Responsibilities
- Drive the stability and reliability of Epic's GCP infrastructure—setting and tracking SLOs/SLIs, reducing toil, and engineering out recurring sources of instability
- Build and operate Epic's GCP infrastructure for high availability, scalability, and cost efficiency
- Manage and harden our Docker and GKE container platform, including workload scheduling, autoscaling, networking, and graceful failure handling
- Maintain and improve CI/CD pipelines that enable fast, safe, low-risk delivery across engineering teams
- Own and evolve the observability stack—metrics, logs, traces, dashboards, and alerts—so that signals are actionable, noise is low, and on-call has the context to resolve issues quickly
- Write and maintain Terraform to codify infrastructure across the organization, with a focus on consistency, change safety, and reproducibility
- Contribute to capacity planning, cost optimization, and architectural reviews, with reliability as a first-class consideration
- Champion platform security best practices, including secrets management, IAM policies, and network segmentation
- Support compliance-aware infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we mature our SOC 2 and student-data compliance programs
- Partner with data engineering to operate the orchestration platform and supporting infrastructure—deployment, scaling, reliability, and observability
- Collaborate with backend and data engineers to troubleshoot service and platform issues
- Lead by example in a frequent on-call rotation; drive incident response, blameless post-mortems, and the follow-through that turns one-time outages into systemic, lasting reliability improvements
- Provide guidance to developers on infrastructure concerns and best practices
Requirements
- Bachelor's degree or higher in Computer Science, Software Engineering, or a related field
- 5+ years of experience in infrastructure, platform, DevOps, or a related engineering role
- Hands-on experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and related services)
- Experience with Docker and Kubernetes (GKE)—containerizing workloads, deploying to GKE, Helm, and cluster fundamentals
- Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or similar)
- Experience with an observability platform such as New Relic (metrics, logging, alerting, dashboards)
- Proficiency in Terraform for managing infrastructure as code
- Scripting/programming skills in Python, Bash, or similar
- Comfort participating in a frequent production on-call rotation
- Track record of measurably improving reliability of production systems—e.g., defining SLOs, reducing incident frequency or MTTR, eliminating recurring failure modes
- Strong problem-solving skills, sense of ownership, and ability to work effectively in evolving systems
- Fluency in English for daily collaboration and technical documentation
- Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners
Nice to Have
- Experience operating workflow orchestration platforms (e.g., Dagster, Airflow) as a service for data or platform teams
- Familiarity with the operational footprint of data platforms (warehouse infrastructure, job schedulers, batch workloads)
- Experience in distributed or global engineering teams
- Working knowledge of compliance frameworks (e.g., SOC 2, FERPA, COPPA) and GRC tools
Team
Structure: global, bilingual (English–Chinese) engineering team
Additional Information
- This is a fully remote, US-based role
- Frequent on-call rotation
- Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners