United States Employment USD 150,000 - 220,000 Yearly

Deepgram is hiring a Site Reliability Engineer

About the Role

Deepgram is looking for a Site Reliability Engineer to build and operate the hybrid infrastructure foundation powering our advanced AI/ML research and product development. In this role, you will architect, build, and run the platform that spans AWS and our bare metal data centers to train and deploy complex models at scale.

What You'll Do

  • Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
  • Develop and manage the entire infrastructure using Infrastructure-as-Code principles with Terraform.
  • Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
  • Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) and create automation for operational tasks.
  • Collaborate with AI researchers and ML engineers to understand infrastructure needs and build tools and workflows.
  • Automate the life cycle of single-tenant, managed deployments.

What We're Looking For

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning, configuration, and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).

Nice to Have

  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
  • Familiarity with FinOps principles and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
  • Experience in a multi-region or hybrid cloud environment.

Technical Stack

  • Kubernetes, AWS, Terraform, Slurm
  • Python, Go, Bash
  • GitLab CI, Jenkins, ArgoCD
  • Calico, Cilium, Ceph, Rook

Benefits & Compensation

  • Medical, dental, vision benefits
  • Annual wellness stipend and mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO and generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Deepgram is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, gender identity or expression, age, marital status, veteran status, disability status, pregnancy, parental status, genetic information, political affiliation, or any other status protected by the laws or regulations in the locations where we operate.

Required Skills
KubernetesAWSTerraformSlurmPythonGoBashGitLab CIJenkinsArgoCDPlatform EngineeringDevOpsSREBare Metal InfrastructureHPC
Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation
Document preparation & submission
Deadline tracking & renewals
Direct liaison with immigration
Talk to an expert
10+ years experience
About company
Deepgram

Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago