United States Hybrid USD 150,000 – 220,000 / year

Deepgram is hiring a Site Reliability Engineer

Build and operate the foundational infrastructure that powers AI/ML research and product development in a hybrid environment. Focus on scalability, reliability, and automation across cloud and on-premise systems using modern platform engineering practices.

Responsibilities

  • Design and maintain a robust, scalable Kubernetes platform running on both AWS and on-premise environments to support diverse applications and services.
  • Implement Infrastructure-as-Code using Terraform to manage and version infrastructure across multiple environments.
  • Monitor system performance and reliability using observability tools like Prometheus and Grafana.
  • Automate deployment, scaling, and failover of AI/ML workloads using Kubernetes operators and CI/CD pipelines.
  • Collaborate with ML engineers to optimize GPU resource allocation and scheduling via Slurm and Kubernetes device plugins.
  • Ensure high availability and disaster recovery readiness for critical platform components.
  • Develop and maintain internal developer platform tooling to streamline service deployment and configuration.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in site reliability, platform engineering, or systems administration.
  • Strong proficiency in Kubernetes, containerization, and cloud infrastructure (AWS).
  • Hands-on experience with Infrastructure-as-Code tools such as Terraform or Pulumi.
  • Solid understanding of networking, distributed systems, and Linux internals.
  • Experience with monitoring, logging, and observability stacks (e.g., Prometheus, Loki, Grafana).

Tech Stack

Kubernetes, Terraform, AWS (EC2, S3, EKS, VPC), Prometheus, Grafana, Slurm, Docker, GitLab CI/CD, ArgoCD

Benefits

  • Comprehensive health, dental, and vision insurance
  • 401(k) matching program
  • Unlimited paid time off
  • Flexible work hours and remote-friendly policy
  • Annual learning and development stipend
  • Onsite and virtual wellness programs
  • Company-sponsored tech talks and hackathons
  • Parental leave policy
  • Employee resource groups and inclusion initiatives
  • Free healthy meals and snacks in office
  • Commuter benefits program
  • Stocked kitchens and game rooms
  • Annual retreats and team-building events
  • Mental health support and counseling services
  • Pet insurance option
  • Volunteer time off program

Work Arrangement

Hybrid (remote and on-site options available)

Additional Information

  • This role supports on-call incident response on a rotating basis.
  • Candidates must be located in the U.S. or Canada for tax and compliance reasons.
  • We are committed to building a diverse and inclusive team.
  • The hiring process includes a technical screening, system design interview, and culture fit discussion.
  • Relocation assistance is available for eligible candidates.
Required Skills
KubernetesAWSTerraformPythonGoBash
About company
Deepgram
Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’.
All jobs at Deepgram Visit website
Job Details
Department Engineering
Category infrastructure
Posted 3 months ago