Hyderabad or Bengaluru

DigitalOcean is hiring a Senior Cloud Support Engineer

Responsibilities

  • Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure, coordinating cross-functional responses that span Engineering, Product, and Operations
  • Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean, including multi-cluster Kubernetes deployments, distributed GPU training infrastructure, and hybrid/multi-cloud architectures
  • Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews, performance optimization workshops, and proof-of-concept implementations
  • Drive resolution of systemic technical challenges by identifying patterns across customer issues, partnering with Engineering to implement platform-level improvements, and advocating for product enhancements that eliminate entire classes of problems
  • Research and evaluate emerging technologies in the AI/ML and cloud infrastructure space, identifying opportunities for DigitalOcean to differentiate and expand our capabilities
  • Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams and understanding their business objectives
  • Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations, managing complex project timelines, stakeholder expectations, and technical deliverables
  • Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities, architectural best practices, and roadmap vision to C-level and VP-level stakeholders
  • Partner strategically with Customer Success to drive expansion opportunities, prevent churn through proactive technical guidance, and transform technical challenges into growth opportunities
  • Influence product strategy by synthesizing customer insights, competitive intelligence, and technical trends into actionable recommendations for Product and Engineering leadership
  • Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, pair troubleshooting sessions, and career development guidance
  • Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices that elevate team capabilities
  • Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, customer-facing solution guides, and internal training curricula
  • Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms and ensuring timely, effective resolution
  • Represent the Support organization in cross-functional initiatives, product design reviews, and strategic planning sessions, ensuring the voice of the customer influences critical decisions

Requirements

  • 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership
  • 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements
  • Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking (CNI, service meshes, ingress controllers)
  • Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns
  • Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale
  • Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns
  • Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization (INT4, INT8, FP8), and inference performance tuning
  • Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring
  • Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training
  • Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence
  • Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation
  • Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues
  • Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar)
  • Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools
  • Exceptional communication abilities: Can translate highly complex technical concepts into clear, actionable guidance for audiences ranging from junior engineers to C-level executives
  • Demonstrated leadership capabilities including mentoring team members, leading cross-functional initiatives, and influencing without direct authority
  • Strong consultative approach: Ability to discover underlying customer needs, challenge assumptions respectfully, and craft solutions that balance technical excellence with business pragmatism
  • Track record of driving organizational improvement through process design, automation, documentation, and strategic initiatives

Nice to Have

  • Bare Metal infrastructure expertise: Server provisioning, hardware troubleshooting, BIOS/firmware management, RAID configuration, and performance tuning
  • Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering, and datacenter networking concepts
  • Kubernetes certifications: CKA (Certified Kubernetes Administrator), CKAD, or CKS (Certified Kubernetes Security Specialist)
  • Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert
  • GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications, or similar specialized credentials
  • Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools
  • Published technical content: Blog posts, whitepapers, solution guides, or technical documentation demonstrating thought leadership
  • Speaking experience at technical conferences, meetups, or webinars on topics related to cloud infrastructure, AI/ML, or DevOps
  • Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums, cloud-native communities)
  • Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic, or similar monitoring/alerting systems
  • Multi-cloud or hybrid-cloud architecture experience: Designing solutions that span AWS, GCP, Azure, and on-premises infrastructure
  • Experience with DigitalOcean or Paperspace products as a user or customer
  • Database expertise: Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale
  • Security & compliance knowledge: Experience with SOC2, HIPAA, GDPR, or other compliance frameworks in cloud environments

Additional Information

  • JR: 2026-7534
  • You may apply to a maximum of 3 positions within any 180-day period. This policy promotes better role-candidate matching and encourages thoughtful applications where your qualifications align most strongly.
Required Skills
Technical SupportDevOps
About company
DigitalOcean
DigitalOcean builds the simplest scalable cloud for a strong community of top talent and the dreamers and builders in the world.
All jobs at DigitalOcean Visit website
Job Details
Department Information Technology
Category other
Posted 3 months ago