DigitalOcean is hiring a Senior Cloud Support Engineer to be the ultimate technical authority for our most complex customer challenges. This role bridges deep support expertise and solutions architecture, focusing on Kubernetes and GPU/GradientAI workloads for customers building large-scale AI/ML infrastructure.
What You'll Do
- Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure.
- Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean.
- Lead technical discovery and solution design for strategic accounts, conducting architectural reviews and proof-of-concept implementations.
- Drive resolution of systemic technical challenges by identifying patterns and partnering with Engineering.
- Act as a trusted technical advisor to highest-value customers and strategic partners.
- Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations.
- Conduct executive technical briefings and workshops for C-level and VP-level stakeholders.
- Mentor and develop IC1-IC3 engineers through structured coaching and technical reviews.
- Design and implement support frameworks including escalation workflows and automation tools.
- Create authoritative technical documentation including architectural reference guides and troubleshooting runbooks.
- Lead critical incident response for platform-wide or high-impact customer issues.
- Represent the Support organization in cross-functional initiatives and strategic planning sessions.
- Participate in an operational on-call rotation to support critical incidents and escalations.
What We're Looking For
- 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles.
- 5+ years in senior technical customer-facing roles managing enterprise customer relationships.
- Expert-level Kubernetes knowledge: production-scale architecture design, cluster operations, advanced troubleshooting.
- Deep GPU/AI/ML infrastructure expertise: multi-GPU and multi-node training, distributed computing frameworks.
- Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale.
- Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face).
- Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization.
- Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores.
- Experience with large-scale distributed AI/ML workloads including data parallelism and model parallelism.
- Proven experience designing fault-tolerant, scalable cloud architectures.
- Expert-level Linux system administration: kernel tuning, performance profiling, security hardening.
- Advanced networking expertise: deep understanding of TCP/IP, routing protocols, load balancing, network security.
- Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++).
- Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools.
- Exceptional communication abilities for audiences ranging from junior engineers to C-level executives.
- Demonstrated leadership capabilities including mentoring and leading cross-functional initiatives.
- Strong consultative approach to discover underlying customer needs and craft solutions.
- Track record of driving organizational improvement through process design and automation.
Nice to Have
- Kubernetes certifications: CKA, CKAD, or CKS.
- Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert.
- GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications.
- Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools.
- Published technical content: blog posts, whitepapers, solution guides.
- Speaking experience at technical conferences, meetups, or webinars.
- Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums).
- Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic.
- Multi-cloud or hybrid-cloud architecture experience spanning AWS, GCP, Azure, and on-premises.
- Experience with DigitalOcean or Paperspace products as a user or customer.
- Database expertise: experience with relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale.
- Security & compliance knowledge: experience with SOC2, HIPAA, GDPR in cloud environments.
- Bare Metal infrastructure expertise: server provisioning, hardware troubleshooting, BIOS/firmware management.
- Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering.
Technical Stack
- Kubernetes, GPU/GradientAI, AI/ML infrastructure
- PyTorch, TensorFlow, Hugging Face, CUDA, TensorRT, vLLM
- Linux, Python, Go, Rust, C++
- Terraform, CloudFormation, Pulumi
- Prometheus, Grafana, Datadog, New Relic
- PostgreSQL, MySQL, MongoDB, Redis
Team & Environment
You'll be a senior member of the AI/ML Support team, a group dedicated to the most technically complex customer challenges.
DigitalOcean fosters a culture built on a growth mindset, thinking big and bold, and winning together while learning, having fun, and making a profound difference.




