Responsibilities
- Set up and configure physical hardware, including racking, cabling, and regular maintenance for on-site compute clusters.
- Diagnose and fix issues with physical servers or network devices to maintain high availability for customer systems.
- Support team members during on-site customer deployments and live demonstrations.
- Create and update essential operational documentation such as runbooks for hardware setup, EKS upgrades, core service updates, and cluster management.
- Contribute to building and sustaining monitoring solutions for Kubernetes services and apps, granting users read-only access to logs and metrics.
- Ensure the on-premises Kubernetes platform remains highly available, scalable, and secure for critical workloads.
- Apply Kubernetes cluster lifecycle best practices, including updates, security patches, and configuration using tools such as Helm, Kustomize, or GitOps with ArgoCD or Flux.
- Work closely with software engineering teams to understand requirements, implement platform improvements, and enforce security policies, network segmentation, and access controls in Kubernetes.
Work Arrangement
Hybrid
Other
This role requires occasional in-person work with physical servers at the Richmond, CA facility as needed.