Responsibilities
- Design and support an internal platform for managing access to large language models and model control plane servers, including integration with tools such as LiteLLM.
- Establish full visibility into LLM operations through logging, monitoring, and alerting using platforms like Datadog and Langfuse.
- Implement policies and controls for model usage, covering security, data protection, rate limits, and cost tracking of LLM API calls.
- Work directly with AI-focused developers to enable cloud-native system designs and integration of autonomous agent workflows.
- Develop and manage infrastructure to track performance, stability, and uptime of AI systems in production.