BentoML is hiring a Senior Site Reliability Engineer to own the infrastructure for worldwide large language model and generative AI services. You will architect and operate Kubernetes clusters across AWS, Google Cloud, and on-premises environments, turning extensive GPU resources into responsive inference pools.
What You'll Do
- Design, run, and improve large multi-cluster Kubernetes environments on AWS and Google Cloud, plus on-premises clusters.
- Manage infrastructure with Terraform or Pulumi and follow GitOps workflows.
- Maintain automated build and release pipelines with reliable rollback capabilities.
- Manage NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates, extending practices to AMD GPUs.
- Operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
- Share an on-call rotation, lead post-incident reviews, and keep runbooks current.
- Establish standard SRE processes and teach best practices to the wider engineering team.
What We're Looking For
- Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
- Hands-on experience with AWS and Google Cloud.
- Strong skills with Terraform or Pulumi, GitOps tools like Argo CD or Flux, and CI/CD pipelines.
- Deep understanding of Linux and networking fundamentals.
- Experience managing NVIDIA GPU clusters.
- Solid background with Prometheus and Grafana at scale.
- Clear written and spoken communication and comfort working across time zones.
Nice to Have
- Familiarity with Azure or Oracle Cloud.
- AMD/ROCm knowledge.
- Familiarity with specialized GPU clouds such as Lambda or Nebius.
Technical Stack
- Kubernetes, AWS, Google Cloud
- Terraform, Pulumi
- GitOps (Argo CD, Flux)
- CI/CD pipelines, Linux
- NVIDIA GPU, AMD/ROCm
- Prometheus, Grafana
Benefits & Compensation
- Competitive salary and equity
- Remote work
- Learning budget
- Paid conference travel
Work Mode
This is a global, remote position open to candidates in North America and Asia.
BentoML is an equal opportunity employer.





