Deepgram is hiring a Platform Engineer – AI/ML Infrastructure at the senior or staff level. You will build and operate the hybrid infrastructure foundation that powers advanced AI/ML research and product development. Your core mission is to architect, build, and run the platform spanning AWS and bare metal data centers to train and deploy complex models at scale.
What You'll Do
- Architect and maintain our core computing platform using Kubernetes on AWS and on-premise.
- Develop and manage our entire infrastructure using Infrastructure-as-Code principles with Terraform.
- Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
- Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
- Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions across hybrid environments.
- Develop a comprehensive observability stack (monitoring, logging, tracing) and create automation for operational tasks.
- Collaborate with AI researchers and ML engineers to build tools and workflows that accelerate their development cycle.
- Automate the lifecycle of single-tenant, managed deployments.
What We're Looking For
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
- Proven, hands-on experience building and managing production infrastructure with Terraform.
- Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
- Experience with high-performance compute job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
- Experience managing bare metal infrastructure, including server provisioning, configuration, and lifecycle management.
- Strong scripting and automation skills (e.g., Python, Go, Bash).
Nice to Have
- Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
- Familiarity with FinOps principles and cloud cost optimization strategies.
- Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
- Experience in a multi-region or hybrid cloud environment.
Technical Stack
- Kubernetes, AWS, Terraform, Slurm
- Python, Go, Bash
- GitLab CI, Jenkins, ArgoCD
- Calico, Cilium, Ceph, Rook
Team & Environment
You will collaborate closely with AI researchers and ML engineers to build the platform that accelerates their work.
Deepgram is an equal opportunity employer. We want all voices and perspectives represented in our workforce. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, gender identity or expression, age, marital status, veteran status, disability status, pregnancy, parental status, genetic information, political affiliation, or any other status protected by the laws or regulations in the locations where we operate.





