Build and operate the foundational infrastructure that powers AI/ML research and product development in a hybrid environment. Focus on scalability, reliability, and automation across cloud and on-premise systems using modern platform engineering practices.

Responsibilities

Design and maintain a robust, scalable Kubernetes platform running on both AWS and on-premise environments to support diverse applications and services.
Implement Infrastructure-as-Code using Terraform to manage and version infrastructure across multiple environments.
Monitor system performance and reliability using observability tools like Prometheus and Grafana.
Automate deployment, scaling, and failover of AI/ML workloads using Kubernetes operators and CI/CD pipelines.
Collaborate with ML engineers to optimize GPU resource allocation and scheduling via Slurm and Kubernetes device plugins.
Ensure high availability and disaster recovery readiness for critical platform components.
Develop and maintain internal developer platform tooling to streamline service deployment and configuration.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field.
3+ years of experience in site reliability, platform engineering, or systems administration.
Strong proficiency in Kubernetes, containerization, and cloud infrastructure (AWS).
Hands-on experience with Infrastructure-as-Code tools such as Terraform or Pulumi.
Solid understanding of networking, distributed systems, and Linux internals.
Experience with monitoring, logging, and observability stacks (e.g., Prometheus, Loki, Grafana).

Tech Stack

Kubernetes, Terraform, AWS (EC2, S3, EKS, VPC), Prometheus, Grafana, Slurm, Docker, GitLab CI/CD, ArgoCD

Benefits

Comprehensive health, dental, and vision insurance
401(k) matching program
Unlimited paid time off
Flexible work hours and remote-friendly policy
Annual learning and development stipend
Onsite and virtual wellness programs
Company-sponsored tech talks and hackathons
Parental leave policy
Employee resource groups and inclusion initiatives
Free healthy meals and snacks in office
Commuter benefits program
Stocked kitchens and game rooms
Annual retreats and team-building events
Mental health support and counseling services
Pet insurance option
Volunteer time off program

Work Arrangement

Hybrid (remote and on-site options available)

Additional Information

This role supports on-call incident response on a rotating basis.
Candidates must be located in the U.S. or Canada for tax and compliance reasons.
We are committed to building a diverse and inclusive team.
The hiring process includes a technical screening, system design interview, and culture fit discussion.
Relocation assistance is available for eligible candidates.

Deepgram is hiring a Site Reliability Engineer

Responsibilities

Requirements

Tech Stack

Benefits

Work Arrangement

Additional Information

Similar Jobs

Platform Engineer, Infrastructure

Senior Site Reliability Engineer

Senior DevOps Engineer (hiring in US/CAN & LATAM)

Senior Cloud & Platform Engineer

Senior Engineer - Cloud Platforms

Senior DevOps Engineer (OpenShift, On-Prem) – Tieto Tech Consulting (m/f/d)

Related Articles

Platform Engineering: Kubernetes for All

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

Remote SRE Jobs: Vanguard’s Cloud Transformation

Deepgram is hiring a Site Reliability Engineer

Responsibilities

Requirements

Tech Stack

Benefits

Work Arrangement

Additional Information

Similar Jobs

Platform Engineer, Infrastructure

Senior Site Reliability Engineer

Senior DevOps Engineer (hiring in US/CAN &amp; LATAM)

Senior Cloud & Platform Engineer

Senior Engineer - Cloud Platforms

Senior DevOps Engineer (OpenShift, On-Prem) – Tieto Tech Consulting (m/f/d)

Related Articles

Platform Engineering: Kubernetes for All

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

Remote SRE Jobs: Vanguard’s Cloud Transformation

Senior DevOps Engineer (hiring in US/CAN & LATAM)