Fundamental is looking for an MLOps Team Lead to lead and mentor a team of MLOps engineers, define the strategic roadmap, and architect scalable ML infrastructure and pipelines. You will bridge the critical gap between research and production in our mission-driven, low-ego environment.

What You'll Do

Lead and mentor a team of MLOps engineers, fostering technical growth and a culture of operational excellence
Define and drive the MLOps roadmap, aligning infrastructure capabilities with Research, Engineering and product objectives
Establish best practices, standards, and processes for ML infrastructure, deployment, and operations
Own technical decision-making for ML infrastructure architecture and tooling choices
Architect and oversee scalable, automated machine learning pipelines, CI/CD workflows, and orchestration frameworks
Drive the design and implementation of robust model serving infrastructure using platforms like Triton, TorchServe, TensorFlow Serving, and KServe
Define inference architecture strategy optimized for ultra-low latency and high throughput
Design and maintain feature stores, robust data pipelines, and scalable storage solutions to efficiently handle large volumes of data
Collaborate with research teams to bridge the gap between experimentation and production
Define logging, alerting, and monitoring strategy to track model performance, drift, and system reliability

What We're Looking For

Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
7+ years of experience in MLOps, with 3+ years in a technical leadership role
Strong software engineering skills in Python, with experience in Bash and/or Go
Proven track record of building and leading high-performing MLOps or infrastructure teams
Experience building and designing MLOps infrastructure from the ground up
Deep experience with MLOps platforms (MLflow, WandB, etc.) and frameworks (PyTorch, TensorFlow, etc.)
Deep experience with model serving frameworks (Triton, TorchServe, TensorFlow Serving, KServe) for high scalability and low latency inference
Experience building and managing data pipelines to support both model training and inference
Good experience with Kubernetes on a major cloud provider (AWS, GCP, or Azure) and with infrastructure as code (Terraform, Helm, GitOps)
Proficient with observability and monitoring tools (Prometheus, Grafana, Datadog, OpenTelemetry)
Excellent communication skills with ability to translate between research and production contexts

Nice to Have

Experience with workflow orchestration tools (Kubeflow, Airflow, Argo Workflows)
Experience with FastAPI and backend applications
Familiarity with data platforms like Databricks or Snowflake
Experience with LLM/foundation model serving and optimization
Exposure to SRE practices or cloud security certifications
Experience scaling ML infrastructure for AI startups

Technical Stack

Languages: Python, Bash, Go
MLOps Platforms: MLflow, WandB
ML Frameworks: PyTorch, TensorFlow
Model Serving: Triton, TorchServe, TensorFlow Serving, KServe
Infrastructure: Kubernetes, AWS, GCP, Azure, Terraform, Helm, GitOps
Observability: Prometheus, Grafana, Datadog, OpenTelemetry
Orchestration & Tools: Kubeflow, Airflow, Argo Workflows, FastAPI, Databricks, Snowflake

Team & Environment

You will lead and mentor a team of MLOps engineers. The company culture is mission-driven and low-ego, valuing diversity of thought, ownership, and bias toward action.

Benefits & Compensation

Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Fertility support
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations

We are an equal opportunity employer.