Hyderabad or Bengaluru Employment

DigitalOcean is hiring a Senior Cloud Support Engineer

About the Role

DigitalOcean is looking for a Senior Cloud Support Engineer to serve as the definitive technical authority for resolving the most complex customer challenges, particularly around Kubernetes and GPU/GradientAI workloads. In this role, you will bridge deep support expertise with solutions architecture, designing sophisticated cloud infrastructure while maintaining a customer-first mentality and participating in an operational on-call rotation for critical incidents.

What You'll Do

  • Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure.
  • Architect enterprise-grade solutions for customers building large-scale AI/ML workloads, including multi-cluster Kubernetes deployments and distributed GPU training infrastructure.
  • Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews and performance optimization workshops.
  • Drive resolution of systemic technical challenges by identifying patterns and partnering with Engineering to implement platform-level improvements.
  • Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams.
  • Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations.
  • Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities and architectural best practices.
  • Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, and pair troubleshooting sessions.
  • Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices.
  • Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, and customer-facing solution guides.
  • Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms.

What We're Looking For

  • 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership.
  • 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements.
  • Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking.
  • Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns.
  • Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale.
  • Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns.
  • Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization, and inference performance tuning.
  • Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring.
  • Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training.
  • Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence.
  • Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation.
  • Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues.
  • Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar).
  • Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools.
  • Exceptional communication abilities: Can translate highly complex technical concepts into clear, actionable guidance for audiences ranging from junior engineers to C-level executives.
  • Demonstrated leadership capabilities including mentoring team members, leading cross-functional initiatives, and influencing without direct authority.
  • Strong consultative approach: Ability to discover underlying customer needs, challenge assumptions respectfully, and craft solutions that balance technical excellence with business pragmatism.
  • Track record of driving organizational improvement through process design, automation, documentation, and strategic initiatives.

Nice to Have

  • Bare Metal infrastructure expertise: Server provisioning, hardware troubleshooting, BIOS/firmware management, RAID configuration, and performance tuning.
  • Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering, and datacenter networking concepts.
  • Kubernetes certifications: CKA (Certified Kubernetes Administrator), CKAD, or CKS (Certified Kubernetes Security Specialist).
  • Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert.
  • GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications, or similar specialized credentials.
  • Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools.
  • Published technical content: Blog posts, whitepapers, solution guides, or technical documentation demonstrating thought leadership.
  • Speaking experience at technical conferences, meetups, or webinars on topics related to cloud infrastructure, AI/ML, or DevOps.
  • Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums, cloud-native communities).
  • Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic, or similar monitoring/alerting systems.
  • Multi-cloud or hybrid-cloud architecture experience: Designing solutions that span AWS, GCP, Azure, and on-premises infrastructure.
  • Experience with DigitalOcean or Paperspace products as a user or customer.
  • Database expertise: Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale.
  • Security & compliance knowledge: Experience with SOC2, HIPAA, GDPR, or other compliance frameworks in cloud environments.

Technical Stack

  • Kubernetes, GPU/GradientAI
  • Python, PyTorch, TensorFlow, Hugging Face
  • CUDA, TensorRT, vLLM
  • Linux
  • Terraform, CloudFormation, Pulumi
  • Go, Rust, C++
  • Prometheus, Grafana, Datadog, New Relic
  • PostgreSQL, MySQL, MongoDB, Redis

Team & Environment

You will be part of the AI/ML Support team at DigitalOcean, a group dedicated to solving the most challenging technical problems for customers building advanced AI/ML workloads.

DigitalOcean is an equal opportunity employer.

Required Skills
KubernetesGPUAI/ML InfrastructurePythonPyTorchTensorFlowHugging FaceCUDATensorRTvLLMLinuxDevOpsSREDistributed ComputingPerformance Optimization
Got hired remotely?

Get paid like a professional

Remote clients expect company invoices, not personal PayPal requests. Glopay forms an EU partnership that makes you look legitimate while you stay independent.

Professional invoices with EU company details
Compliance handled automatically
Withdraw to any bank account
Income reports for easy tax filing
Create free account
Free signup • 5 min setup
About company
DigitalOcean

DigitalOcean builds the simplest scalable cloud for a strong community of top talent and the dreamers and builders in the world.

Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 14 days ago