Responsibilities
- Ensure consistent performance and uptime of production systems across Google Cloud Platform, Kubernetes, and Node.js with Postgres databases
- Take primary responsibility during critical system outages, lead incident resolution, and conduct follow-up analyses to prevent recurrence
- Enhance monitoring capabilities, refine alerting systems, and optimize on-call procedures to proactively detect and resolve issues
- Establish service level objectives and agreements for key services, and promote their consistent use across engineering teams
- Develop internal tools and automated frameworks that enable safer code deployments and streamline infrastructure management for development teams
- Work closely with Product, Engineering, and Machine Learning teams to integrate reliability practices into the development lifecycle
- Create and maintain technical roadmaps that balance immediate stability needs with long-term scalability for an expanding user base
- Promote best practices in platform engineering, including blameless postmortems, operational discipline, and a culture of ongoing learning
Work Arrangement
Remote (Worldwide) — San Francisco, Seoul, Tokyo, Taipei, Ljubljana
Other
Mastering a new language is among the most transformative abilities a person can develop, yet nearly 99% fail to reach proficiency due to ineffective learning methods. The mission is to empower millions to succeed in language acquisition and positively transform their lives.