Bengaluru or Hyderabad Employment

DigitalOcean is hiring a Senior Cloud Support Engineer

About the Role

DigitalOcean is hiring a Senior Cloud Support Engineer to be the ultimate technical authority for our most complex customer challenges. This role bridges deep support expertise and solutions architecture, focusing on Kubernetes and GPU/GradientAI workloads for customers building large-scale AI/ML infrastructure.

What You'll Do

  • Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure.
  • Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean.
  • Lead technical discovery and solution design for strategic accounts, conducting architectural reviews and proof-of-concept implementations.
  • Drive resolution of systemic technical challenges by identifying patterns and partnering with Engineering.
  • Act as a trusted technical advisor to highest-value customers and strategic partners.
  • Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations.
  • Conduct executive technical briefings and workshops for C-level and VP-level stakeholders.
  • Mentor and develop IC1-IC3 engineers through structured coaching and technical reviews.
  • Design and implement support frameworks including escalation workflows and automation tools.
  • Create authoritative technical documentation including architectural reference guides and troubleshooting runbooks.
  • Lead critical incident response for platform-wide or high-impact customer issues.
  • Represent the Support organization in cross-functional initiatives and strategic planning sessions.
  • Participate in an operational on-call rotation to support critical incidents and escalations.

What We're Looking For

  • 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles.
  • 5+ years in senior technical customer-facing roles managing enterprise customer relationships.
  • Expert-level Kubernetes knowledge: production-scale architecture design, cluster operations, advanced troubleshooting.
  • Deep GPU/AI/ML infrastructure expertise: multi-GPU and multi-node training, distributed computing frameworks.
  • Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale.
  • Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face).
  • Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization.
  • Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores.
  • Experience with large-scale distributed AI/ML workloads including data parallelism and model parallelism.
  • Proven experience designing fault-tolerant, scalable cloud architectures.
  • Expert-level Linux system administration: kernel tuning, performance profiling, security hardening.
  • Advanced networking expertise: deep understanding of TCP/IP, routing protocols, load balancing, network security.
  • Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++).
  • Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools.
  • Exceptional communication abilities for audiences ranging from junior engineers to C-level executives.
  • Demonstrated leadership capabilities including mentoring and leading cross-functional initiatives.
  • Strong consultative approach to discover underlying customer needs and craft solutions.
  • Track record of driving organizational improvement through process design and automation.

Nice to Have

  • Kubernetes certifications: CKA, CKAD, or CKS.
  • Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert.
  • GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications.
  • Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools.
  • Published technical content: blog posts, whitepapers, solution guides.
  • Speaking experience at technical conferences, meetups, or webinars.
  • Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums).
  • Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic.
  • Multi-cloud or hybrid-cloud architecture experience spanning AWS, GCP, Azure, and on-premises.
  • Experience with DigitalOcean or Paperspace products as a user or customer.
  • Database expertise: experience with relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale.
  • Security & compliance knowledge: experience with SOC2, HIPAA, GDPR in cloud environments.
  • Bare Metal infrastructure expertise: server provisioning, hardware troubleshooting, BIOS/firmware management.
  • Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering.

Technical Stack

  • Kubernetes, GPU/GradientAI, AI/ML infrastructure
  • PyTorch, TensorFlow, Hugging Face, CUDA, TensorRT, vLLM
  • Linux, Python, Go, Rust, C++
  • Terraform, CloudFormation, Pulumi
  • Prometheus, Grafana, Datadog, New Relic
  • PostgreSQL, MySQL, MongoDB, Redis

Team & Environment

You'll be a senior member of the AI/ML Support team, a group dedicated to the most technically complex customer challenges.

DigitalOcean fosters a culture built on a growth mindset, thinking big and bold, and winning together while learning, having fun, and making a profound difference.

Required Skills
KubernetesGPUAI/ML infrastructurePyTorchTensorFlowHugging FaceCUDATensorRTvLLMLinuxDistributed ComputingModel TrainingModel DeploymentSite Reliability EngineeringDevOps
Want to work from Thailand?

Join a remote network built for tech talent

Iglu gives you real employment in Southeast Asia — visa, work permit, and projects included. Pick what you work on, earn performance-based pay, and live where you want.

Legal employment in Thailand & Vietnam
Choose your own projects
Performance-based revenue sharing
Relocation support available
Join Iglu
200+ professionals worldwide
About company
DigitalOcean

DigitalOcean builds the simplest scalable cloud for a strong community of top talent and the dreamers and builders in the world.

Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 14 days ago