Sumo Logic is looking for a Staff Machine Learning Engineer to lead the design and delivery of the next generation of Agentic AI systems for our Security Operations Center. You will evaluate, prototype, and productionize state-of-the-art agentic AI technologies, building scalable multi-agent architectures that reason over large-scale machine data to drive real-time security detection, investigation, and response.
What You'll Do
- Lead and partner on technical evaluation and adoption of cutting-edge agentic AI platforms, including Anthropic (Claude), LangChain/LangGraph, AWS Bedrock, and other emerging frameworks.
- Architect, prototype, and productionize multi-agent AI systems for Agentic SOC use cases like detection, triage, investigation, and response workflows.
- Own the design of core agent architecture components, including planning, execution, tool orchestration, memory, context engineering, and long-running agent workflows.
- Lead AI agent evaluation systems, including offline and online evaluation pipelines, golden datasets, synthetic data generation, human- and LLM-based judging, and continuous quality monitoring.
- Drive LLM fine-tuning and alignment efforts to improve domain-specific reasoning, accuracy, and reliability for security and observability use cases.
- Design scalable LLMOps and AI agent infrastructure, including inference routing, latency optimization, cost control, and production observability.
- Partner with product, security, and data platform leadership to deliver end-to-end AI agent capabilities from prototype to customer-facing production systems.
- Lead and partner on technical direction and mentorship for AI engineers working on agentic AI and LLM systems.
- Define and implement best practices for AI safety, reliability, evaluation, and monitoring in production agentic systems.
- Operate as a senior technical owner in ambiguous problem spaces, setting technical direction, breaking down complex problems, and driving delivery across teams.
What We're Looking For
- B.Tech, M.Tech, or Ph.D. in Computer Science, Machine Learning, Data Science, or a related technical field.
- 5+ years of hands-on industry experience building, operating, and leading production ML/AI systems, with demonstrated technical leadership.
- Strong foundation in machine learning, distributed systems, data pipelines, and large-scale system design.
- Deep understanding of LLMs, prompt engineering, context engineering, agentic AI design patterns, and reasoning workflows.
- Strong proficiency in Python and modern ML/AI ecosystems.
- Experience designing and operating evaluation frameworks for ML/LLM systems (offline and online).
- Proven ability to lead complex technical initiatives across teams and influence architecture decisions.
- Excellent communication skills and ability to translate complex AI systems into business impact.
Nice to Have
- Hands-on experience building and scaling agentic AI systems or multi-agent architectures in production.
- Experience with modern agent frameworks such as LangGraph, LangChain, CrewAI, or similar.
- Experience with major foundation model platforms such as Anthropic, OpenAI, AWS Bedrock, or Vertex AI.
- Experience with LLM fine-tuning pipelines (SFT, RLHF/RLAIF, preference learning, domain adaptation).
- Strong background in LLMOps, including inference optimization, latency/cost management, observability, and production monitoring.
- Experience with ML infrastructure and tooling such as PyTorch, MLflow, Airflow, Docker, Kubernetes, and cloud platforms (AWS/GCP/Azure).
- Experience applying AI/ML to security, observability, or large-scale log/telemetry data.
Technical Stack
- Python, Anthropic (Claude), LangChain/LangGraph, AWS Bedrock
- PyTorch, MLflow, Airflow, Docker, Kubernetes
- AWS, GCP, Azure
Benefits & Compensation
- Compensation range: $221,000 - $260,000
Work Mode
This is a local-country position in the USA.




