Requirements
- 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership
- 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements
- Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking (CNI, service meshes, ingress controllers)
- Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns
- Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale
- Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns
- Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization (INT4, INT8, FP8), and inference performance tuning
- Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring
- Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training
- Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence
- Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation
- Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues
- Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar)
- Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools
- Exceptional communication abilities: Can translate highly complex technical concepts into clear, actionable guidance for audiences ranging from junior engineers to C-level executives
- Demonstrated leadership capabilities including mentoring team members, leading cross-functional initiatives, and influencing without direct authority
- Strong consultative approach: Ability to discover underlying customer needs, challenge assumptions respectfully, and craft solutions that balance technical excellence with business pragmatism
- Track record of driving organizational improvement through process design, automation, documentation, and strategic initiatives
Nice to Have
- Kubernetes certifications: CKA (Certified Kubernetes Administrator), CKAD, or CKS (Certified Kubernetes Security Specialist)
- Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert
- GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications, or similar specialized credentials
- Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools
- Published technical content: Blog posts, whitepapers, solution guides, or technical documentation demonstrating thought leadership
- Speaking experience at technical conferences, meetups, or webinars on topics related to cloud infrastructure, AI/ML, or DevOps
- Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums, cloud-native communities)
- Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic, or similar monitoring/alerting systems
- Multi-cloud or hybrid-cloud architecture experience: Designing solutions that span AWS, GCP, Azure, and on-premises infrastructure
- Experience with DigitalOcean or Paperspace products as a user or customer
- Database expertise: Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale
- Security & compliance knowledge: Experience with SOC2, HIPAA, GDPR, or other compliance frameworks in cloud environments
Work Arrangement
Hybrid


