Full-time

Deepgram is hiring a Platform Engineer – AI/ML Infrastructure

About the Role

Deepgram is hiring a Platform Engineer – AI/ML Infrastructure at the senior or staff level. You will build and operate the hybrid infrastructure foundation that powers advanced AI/ML research and product development. Your core mission is to architect, build, and run the platform spanning AWS and bare metal data centers to train and deploy complex models at scale.

What You'll Do

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise.
  • Develop and manage our entire infrastructure using Infrastructure-as-Code principles with Terraform.
  • Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
  • Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions across hybrid environments.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) and create automation for operational tasks.
  • Collaborate with AI researchers and ML engineers to build tools and workflows that accelerate their development cycle.
  • Automate the lifecycle of single-tenant, managed deployments.

What We're Looking For

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning, configuration, and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).

Nice to Have

  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
  • Familiarity with FinOps principles and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
  • Experience in a multi-region or hybrid cloud environment.

Technical Stack

  • Kubernetes, AWS, Terraform, Slurm
  • Python, Go, Bash
  • GitLab CI, Jenkins, ArgoCD
  • Calico, Cilium, Ceph, Rook

Team & Environment

You will collaborate closely with AI researchers and ML engineers to build the platform that accelerates their work.

Deepgram is an equal opportunity employer. We want all voices and perspectives represented in our workforce. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, gender identity or expression, age, marital status, veteran status, disability status, pregnancy, parental status, genetic information, political affiliation, or any other status protected by the laws or regulations in the locations where we operate.

Required Skills
KubernetesAWSTerraformSlurmPythonGoBashGitLab CIJenkinsArgoCDAI/ML InfrastructureDistributed SystemsCI/CDInfrastructure as CodeMonitoring
Earn more as a remote developer

Performance pay that rewards your skills

Iglu's revenue-sharing model means top performers earn significantly more than traditional salaries. Choose your projects, deliver great work, and see it reflected in your pay.

Revenue-sharing compensation
Project choice & autonomy
International client base
Career growth support
Check compensation
Top earners exceed market rate
About company
Deepgram

Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’.

Visit website
Job Details
Category infrastructure
Posted 8 months ago