Responsibilities
- Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
- Define SLOs/SLIs, monitor error budgets, and streamline reporting
- Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
- Lead triage and root-cause analysis of high-severity incidents
- Practice balanced incident response and blameless postmortems
- Participate in on-call rotation to support production services


