Fundamental is looking for an MLOps Team Lead to lead and mentor a team of MLOps engineers, define the strategic roadmap, and architect scalable ML infrastructure and pipelines. You will bridge the critical gap between research and production in our mission-driven, low-ego environment.
What You'll Do
- Lead and mentor a team of MLOps engineers, fostering technical growth and a culture of operational excellence
- Define and drive the MLOps roadmap, aligning infrastructure capabilities with Research, Engineering and product objectives
- Establish best practices, standards, and processes for ML infrastructure, deployment, and operations
- Own technical decision-making for ML infrastructure architecture and tooling choices
- Architect and oversee scalable, automated machine learning pipelines, CI/CD workflows, and orchestration frameworks
- Drive the design and implementation of robust model serving infrastructure using platforms like Triton, TorchServe, TensorFlow Serving, and KServe
- Define inference architecture strategy optimized for ultra-low latency and high throughput
- Design and maintain feature stores, robust data pipelines, and scalable storage solutions to efficiently handle large volumes of data
- Collaborate with research teams to bridge the gap between experimentation and production
- Define logging, alerting, and monitoring strategy to track model performance, drift, and system reliability
What We're Looking For
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 7+ years of experience in MLOps, with 3+ years in a technical leadership role
- Strong software engineering skills in Python, with experience in Bash and/or Go
- Proven track record of building and leading high-performing MLOps or infrastructure teams
- Experience building and designing MLOps infrastructure from the ground up
- Deep experience with MLOps platforms (MLflow, WandB, etc.) and frameworks (PyTorch, TensorFlow, etc.)
- Deep experience with model serving frameworks (Triton, TorchServe, TensorFlow Serving, KServe) for high scalability and low latency inference
- Experience building and managing data pipelines to support both model training and inference
- Good experience with Kubernetes on a major cloud provider (AWS, GCP, or Azure) and with infrastructure as code (Terraform, Helm, GitOps)
- Proficient with observability and monitoring tools (Prometheus, Grafana, Datadog, OpenTelemetry)
- Excellent communication skills with ability to translate between research and production contexts
Nice to Have
- Experience with workflow orchestration tools (Kubeflow, Airflow, Argo Workflows)
- Experience with FastAPI and backend applications
- Familiarity with data platforms like Databricks or Snowflake
- Experience with LLM/foundation model serving and optimization
- Exposure to SRE practices or cloud security certifications
- Experience scaling ML infrastructure for AI startups
Technical Stack
- Languages: Python, Bash, Go
- MLOps Platforms: MLflow, WandB
- ML Frameworks: PyTorch, TensorFlow
- Model Serving: Triton, TorchServe, TensorFlow Serving, KServe
- Infrastructure: Kubernetes, AWS, GCP, Azure, Terraform, Helm, GitOps
- Observability: Prometheus, Grafana, Datadog, OpenTelemetry
- Orchestration & Tools: Kubeflow, Airflow, Argo Workflows, FastAPI, Databricks, Snowflake
Team & Environment
You will lead and mentor a team of MLOps engineers. The company culture is mission-driven and low-ego, valuing diversity of thought, ownership, and bias toward action.
Benefits & Compensation
- Competitive compensation with salary and equity
- Comprehensive health coverage, including medical, dental, vision, and 401K
- Fertility support
- Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
- Relocation support for employees moving to join the team in one of our office locations
We are an equal opportunity employer.





