Remote (Global) Full-time

BentoML is hiring a Site Reliability Engineer

About the Role

BentoML is hiring a Senior Site Reliability Engineer to own the infrastructure for worldwide large language model and generative AI services. You will architect and operate Kubernetes clusters across AWS, Google Cloud, and on-premises environments, turning extensive GPU resources into responsive inference pools.

What You'll Do

  • Design, run, and improve large multi-cluster Kubernetes environments on AWS and Google Cloud, plus on-premises clusters.
  • Manage infrastructure with Terraform or Pulumi and follow GitOps workflows.
  • Maintain automated build and release pipelines with reliable rollback capabilities.
  • Manage NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates, extending practices to AMD GPUs.
  • Operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
  • Share an on-call rotation, lead post-incident reviews, and keep runbooks current.
  • Establish standard SRE processes and teach best practices to the wider engineering team.

What We're Looking For

  • Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
  • Hands-on experience with AWS and Google Cloud.
  • Strong skills with Terraform or Pulumi, GitOps tools like Argo CD or Flux, and CI/CD pipelines.
  • Deep understanding of Linux and networking fundamentals.
  • Experience managing NVIDIA GPU clusters.
  • Solid background with Prometheus and Grafana at scale.
  • Clear written and spoken communication and comfort working across time zones.

Nice to Have

  • Familiarity with Azure or Oracle Cloud.
  • AMD/ROCm knowledge.
  • Familiarity with specialized GPU clouds such as Lambda or Nebius.

Technical Stack

  • Kubernetes, AWS, Google Cloud
  • Terraform, Pulumi
  • GitOps (Argo CD, Flux)
  • CI/CD pipelines, Linux
  • NVIDIA GPU, AMD/ROCm
  • Prometheus, Grafana

Benefits & Compensation

  • Competitive salary and equity
  • Remote work
  • Learning budget
  • Paid conference travel

Work Mode

This is a global, remote position open to candidates in North America and Asia.

BentoML is an equal opportunity employer.

Required Skills
KubernetesAWSGoogle CloudTerraformPulumiGitOpsArgo CDFluxCI/CDLinuxNVIDIA GPUAMD/ROCmInfrastructure as CodeSite Reliability EngineeringDistributed Systems
Looking for a remote dev community?

200+ professionals, 37 countries, one network

Working remotely doesn't mean working alone. Iglu connects you with developers, designers, and digital experts worldwide. Collaborate, learn, and grow together.

Global professional network
Knowledge sharing & collaboration
Regular community events
Cross-project opportunities
Join the community
37 countries represented
About company
BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. Our portfolio includes both open source and commercial products.

Visit website
Job Details
Category infrastructure
Posted 7 months ago