Responsibilities
- Design and sustain an internal platform for provisioning AI resources, enabling seamless access to multiple LLMs and MCP servers through solutions like LiteLLM.
- Establish full observability for LLM operations, including logging, monitoring, and alerting using platforms such as Datadog and Langfuse.
- Enforce governance policies for AI models, covering security protocols, data privacy compliance, rate limiting, and precise tracking of API usage costs.
- Work in close coordination with AI-focused software engineers to advance cloud-native system designs and integrate agentic workflows.
- Develop and manage monitoring infrastructure to ensure high reliability and strong performance of AI systems in production environments.