US Remote Remote (Country) Employment 160K to 200K

Epic is hiring a Senior Software Engineer, Infrastructure

Responsibilities

  • Drive the stability and reliability of Epic's GCP infrastructure—setting and tracking SLOs/SLIs, reducing toil, and engineering out recurring sources of instability
  • Build and operate Epic's GCP infrastructure for high availability, scalability, and cost efficiency
  • Manage and harden our Docker and GKE container platform, including workload scheduling, autoscaling, networking, and graceful failure handling
  • Maintain and improve CI/CD pipelines that enable fast, safe, low-risk delivery across engineering teams
  • Own and evolve the observability stack—metrics, logs, traces, dashboards, and alerts—so that signals are actionable, noise is low, and on-call has the context to resolve issues quickly
  • Write and maintain Terraform to codify infrastructure across the organization, with a focus on consistency, change safety, and reproducibility
  • Contribute to capacity planning, cost optimization, and architectural reviews, with reliability as a first-class consideration
  • Champion platform security best practices, including secrets management, IAM policies, and network segmentation
  • Support compliance-aware infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we mature our SOC 2 and student-data compliance programs
  • Partner with data engineering to operate the orchestration platform and supporting infrastructure—deployment, scaling, reliability, and observability
  • Collaborate with backend and data engineers to troubleshoot service and platform issues
  • Lead by example in a frequent on-call rotation; drive incident response, blameless post-mortems, and the follow-through that turns one-time outages into systemic, lasting reliability improvements
  • Provide guidance to developers on infrastructure concerns and best practices

Requirements

  • Bachelor's degree or higher in Computer Science, Software Engineering, or a related field
  • 5+ years of experience in infrastructure, platform, DevOps, or a related engineering role
  • Hands-on experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and related services)
  • Experience with Docker and Kubernetes (GKE)—containerizing workloads, deploying to GKE, Helm, and cluster fundamentals
  • Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or similar)
  • Experience with an observability platform such as New Relic (metrics, logging, alerting, dashboards)
  • Proficiency in Terraform for managing infrastructure as code
  • Scripting/programming skills in Python, Bash, or similar
  • Comfort participating in a frequent production on-call rotation
  • Track record of measurably improving reliability of production systems—e.g., defining SLOs, reducing incident frequency or MTTR, eliminating recurring failure modes
  • Strong problem-solving skills, sense of ownership, and ability to work effectively in evolving systems
  • Fluency in English for daily collaboration and technical documentation
  • Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners

Nice to Have

  • Experience operating workflow orchestration platforms (e.g., Dagster, Airflow) as a service for data or platform teams
  • Familiarity with the operational footprint of data platforms (warehouse infrastructure, job schedulers, batch workloads)
  • Experience in distributed or global engineering teams
  • Working knowledge of compliance frameworks (e.g., SOC 2, FERPA, COPPA) and GRC tools

Team

Structure: global, bilingual (English–Chinese) engineering team

Additional Information

  • This is a fully remote, US-based role
  • Frequent on-call rotation
  • Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners
Required Skills
infrastructureplatformDevOpsor a related engineering roleGCPDockerKubernetesCI/CD pipelinesan observability platform such as New ReTerraform for managing infrastructure asarin Chinese to collaborate effectivthe operational footprint of data platfodistributed or global engineering teamscompliance frameworks infrastructureplatformDevOpsor a related engineering roleGCPDockerKubernetesCI/CD pipelinesan observability platform such as New ReTerraform for managing infrastructure asarin Chinese to collaborate effectivthe operational footprint of data platfodistributed or global engineering teamscompliance frameworks
About company
Epic

Welcome to the leading digital library for kids 12 and under.

Our fun, safe reading app is designed to help kids discover the joy of books and build the literacy skills they need to succeed in school and beyond. With a growing library of 40,000+ high-quality books, audiobooks and learning videos from more than 250 publishers, Epic keeps kids reading, exploring and coming back for more.

Reading is the gateway to every subject, idea and opportunity. Epic builds strong readers with books that meet kids at their level and grow with them.

Every child deserves books. Epic gives kids instant access to thousands of high-quality titles that spark curiosity and build reading confidence.

All jobs at Epic Visit website
Job Details
Category other
Posted 20 days ago