Bellevue, Washington, United States On-site Full-time USD 314,000 – 465,000 / year

Lambda is hiring a Staff Software Engineer - Managed Kubernetes

Responsibilities

  • Define the technical roadmap for a bare-metal Kubernetes platform, focusing on control plane scalability, high availability, multi-tenancy, and cluster lifecycle management
  • Extend and integrate open-source tools from NVIDIA’s ecosystem, including GPU Operator, Network Operator, DCGM, NCCL, AICR, and Topograph for intelligent GPU scheduling
  • Develop orchestration systems designed specifically for GPU-accelerated applications
  • Lead engineering efforts for core services that power managed platform offerings
  • Collaborate with networking teams to shape solutions for AI workloads, including CNI plugins like Cilium and Multus, high-speed fabrics such as InfiniBand and RoCE, RDMA, and GPUDirect
  • Contribute to storage architecture planning for AI use cases, working closely with storage teams to align with Kubernetes, Slurm, and future platform needs
  • Develop foundational components for running Managed Slurm on Kubernetes to support traditional HPC applications
  • Design platform-level services for model inference, including scalable serving infrastructure, load-based autoscaling, and multi-model deployment strategies
  • Build self-healing mechanisms and automated responses for incident detection, root cause analysis, and system resilience
  • Lead large-scale chaos engineering initiatives to test system reliability under failure conditions
  • Establish best practices for managed service operations, including automated upgrades, security patching, and zero-downtime maintenance
  • Act as a technical liaison between orchestration and infrastructure teams, translating platform needs into implementable specifications
  • Influence cross-infrastructure decisions that support robust managed services, with end-to-end system understanding beyond just Kubernetes
  • Provide input on bare-metal provisioning, network layout, and storage configurations to meet orchestration service requirements
  • Promote consistency and standardization across the full infrastructure technology stack
  • Work directly with customers and internal stakeholders to understand deployment patterns and guide migration to managed platforms
  • Set technical direction for Kubernetes-based services, shaping team roadmaps and priorities
  • Lead design reviews and architectural discussions to ensure systems are scalable, maintainable, and customer-aligned
  • Mentor engineers and establish best practices in Kubernetes development, distributed systems, and Cloud Native engineering
  • Partner with Network, Storage, Security, and Customer Success teams to deliver integrated solutions
  • Engage with NVIDIA and open-source communities to track advancements in GPU orchestration and contribute improvements
  • Represent the company through technical publications, conference presentations, and strategic customer interactions
  • Help define the AIOps strategy by designing systems for predictive capacity planning, anomaly detection, and proactive infrastructure maintenance

Benefits

  • Competitive cash and equity compensation package
  • Comprehensive health, dental, and vision insurance for employees and dependents
  • Wellness and commuter allowances for eligible roles
  • 401k plan with a 2% employer contribution for U.S.-based employees
  • Flexible paid time off policy that is actively used by the team

Compensation

Competitive cash and equity compensation package

Work Arrangement

On-site — San Francisco, San Jose, Bellevue

Work Arrangement

  • This position requires presence in our San Francisco, San Jose, or Bellevue office location 4 days per week
  • Lambda’s designated work from home day is currently Tuesday

Other

  • You do not need to match all of the listed expectations to apply for this position
  • Lambda is an Equal Opportunity employer

Not specified

About company
Lambda
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. The company builds and scales AI cloud infrastructure, including high-performance storage, networking, and compute systems for AI training and inference. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence.
All jobs at Lambda Visit website
Job Details
Department Data Center Business
Category infrastructure
Posted a month ago