Responsibilities
- Participate in round-the-clock on-call duties for critical infrastructure services
- Lead incident response efforts, including diagnosis, containment, and system recovery
- Maintain and enhance operational runbooks, escalation protocols, and response procedures
- Design engineering solutions to reduce incident recurrence and improve mean time to resolution
- Enhance the stability of core systems such as Kubernetes (GKE), cloud networking, load balancers, and edge platforms like Cloudflare
- Detect and resolve systemic reliability risks through proactive improvements and root cause analysis
- Support capacity planning, scalability testing, and resilience validation for infrastructure components
- Implement security fixes across cloud and containerized environments
- Enforce security policies including least-privilege IAM, network segmentation, and runtime protections
- Collaborate with security teams on vulnerability management and incident investigations
- Respond to security events and contribute to post-incident reviews
- Develop automation to streamline operational and security workflows
- Build tools that increase visibility during incidents and strengthen security enforcement
- Reduce manual effort using scripts, automation frameworks, and process optimization
- Ensure infrastructure changes follow governance, audit, and change control standards
- Support safe deployment and rollback of configuration and system updates
- Contribute to postmortems and continuous improvement programs
- Partner with infrastructure, platform, data, and security teams on shared goals
- Maintain documentation and operational standards across teams
- Guide junior engineers and drive small-scale reliability or security projects
Requirements
- Minimum of four years managing large-scale production systems
- Proven experience with Google Cloud Platform or equivalent public cloud
- Hands-on production experience with Kubernetes, specifically GKE
- Ability to detect recurring system issues and implement durable solutions
- Experience leading incident management or reliability improvement programs
- Solid knowledge of operational, security, and reliability best practices
- Adaptability in on-call and high-pressure incident scenarios
- Strong problem-solving and communication abilities
- Background in supporting live production environments
- Experience mentoring less experienced engineers and influencing technical peers
Nice to Have
- Knowledge of Cloudflare, network infrastructure, or edge security technologies
- Exposure to security tools and vulnerability remediation processes
- Proficiency in scripting or automation using languages such as Python, Go, or Bash
- Experience operating in environments with compliance requirements like SOC2 or ISO
Work Arrangement
Remote (Worldwide)
Team
Team size: over 2,000 team members; Structure: global, remote-first organization


