DigitalOcean is looking for a Senior Cloud Support Engineer to serve as the definitive technical authority for resolving the most complex customer challenges, particularly around Kubernetes and GPU/GradientAI workloads. In this role, you will bridge deep support expertise with solutions architecture, designing sophisticated cloud infrastructure while maintaining a customer-first mentality and participating in an operational on-call rotation for critical incidents.
What You'll Do
- Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure.
- Architect enterprise-grade solutions for customers building large-scale AI/ML workloads, including multi-cluster Kubernetes deployments and distributed GPU training infrastructure.
- Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews and performance optimization workshops.
- Drive resolution of systemic technical challenges by identifying patterns and partnering with Engineering to implement platform-level improvements.
- Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams.
- Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations.
- Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities and architectural best practices.
- Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, and pair troubleshooting sessions.
- Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices.
- Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, and customer-facing solution guides.
- Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms.
What We're Looking For
- 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership.
- 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements.
- Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking.
- Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns.
- Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale.
- Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns.
- Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization, and inference performance tuning.
- Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring.
- Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training.
- Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence.
- Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation.
- Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues.
- Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar).
- Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools.
- Exceptional communication abilities: Can translate highly complex technical concepts into clear, actionable guidance for audiences ranging from junior engineers to C-level executives.
- Demonstrated leadership capabilities including mentoring team members, leading cross-functional initiatives, and influencing without direct authority.
- Strong consultative approach: Ability to discover underlying customer needs, challenge assumptions respectfully, and craft solutions that balance technical excellence with business pragmatism.
- Track record of driving organizational improvement through process design, automation, documentation, and strategic initiatives.
Nice to Have
- Bare Metal infrastructure expertise: Server provisioning, hardware troubleshooting, BIOS/firmware management, RAID configuration, and performance tuning.
- Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering, and datacenter networking concepts.
- Kubernetes certifications: CKA (Certified Kubernetes Administrator), CKAD, or CKS (Certified Kubernetes Security Specialist).
- Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert.
- GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications, or similar specialized credentials.
- Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools.
- Published technical content: Blog posts, whitepapers, solution guides, or technical documentation demonstrating thought leadership.
- Speaking experience at technical conferences, meetups, or webinars on topics related to cloud infrastructure, AI/ML, or DevOps.
- Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums, cloud-native communities).
- Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic, or similar monitoring/alerting systems.
- Multi-cloud or hybrid-cloud architecture experience: Designing solutions that span AWS, GCP, Azure, and on-premises infrastructure.
- Experience with DigitalOcean or Paperspace products as a user or customer.
- Database expertise: Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale.
- Security & compliance knowledge: Experience with SOC2, HIPAA, GDPR, or other compliance frameworks in cloud environments.
Technical Stack
- Kubernetes, GPU/GradientAI
- Python, PyTorch, TensorFlow, Hugging Face
- CUDA, TensorRT, vLLM
- Linux
- Terraform, CloudFormation, Pulumi
- Go, Rust, C++
- Prometheus, Grafana, Datadog, New Relic
- PostgreSQL, MySQL, MongoDB, Redis
Team & Environment
You will be part of the AI/ML Support team at DigitalOcean, a group dedicated to solving the most challenging technical problems for customers building advanced AI/ML workloads.
DigitalOcean is an equal opportunity employer.



