Responsibilities
- Manage and advance production systems hosted on Google Cloud Platform and Cloudflare
- Guarantee system uptime, responsiveness, and dependability according to defined service level objectives
- Develop and maintain scalable architectures prepared for sudden traffic surges and heavy workloads
- Lead planning for system capacity, auto-scaling strategies, elimination of performance bottlenecks, and fault tolerance enhancements
- Implement and manage infrastructure through code using Terraform, supplemented by Terragrunt when needed
- Oversee Kubernetes operations on Google Kubernetes Engine, including version updates, scaling, and security hardening
- Sustain internal tooling, deployment pipelines, and operational best practices
- Enhance system observability via Datadog with comprehensive monitoring, log management, and application performance tracking
- Direct incident response efforts, conduct root cause investigations, and implement lasting improvements to system reliability
- Minimize alert fatigue and refine detection mechanisms for critical system failures
- Administer core security tools and support cross-team vulnerability remediation
- Strengthen deployment stability and streamline developer workflows using automated CI/CD pipelines and platform controls
- Take accountability for tracking infrastructure spending, driving cost reductions, and supporting financial operations initiatives
Benefits
- Collaborate with seasoned engineers worldwide who have delivered platforms used by millions
- Join a rapidly scaling organization where concepts become live features within days
- Exercise independent judgment with full ownership over technical outcomes
- Integrate AI-driven solutions and modern automation tools into daily workflows
- Receive equity to share in the company's long-term growth and success
Compensation
Equity offered as part of compensation
Work Arrangement
Remote (Worldwide) — Los Angeles, New York, Seoul, Beijing, London, Lisbon, Belgrade, and other global locations
Team
High autonomy and ownership
Other
- High autonomy and ownership
- On-call responsibilities implied through incident management duties
- Equity offered as part of compensation