About the Role
Role details below.
Responsibilities
- Shape the reliability and operational excellence engineering practices to maintain high system uptime, focusing on reducing operational toil through automation, clear ownership, and well-defined runbooks.
- Drive performance testing, tuning, and capacity planning to ensure systems scale effectively and meet SLAs, making informed trade-offs between cost, scalability, and reliability.
- Identify systemic manual processes and eliminate them through automation and software driven solutions, improving efficiency and reducing human error.
- Work across services and codebases to identify, debug, and resolve reliability and performance issues, contributing code changes where necessary to improve system behavior in production.
- Embed security, compliance, and governance into platform Engineering Platform and delivery pipelines by default, minimizing the need for manual enforcement and ensuring data privacy and regulatory compliance.
- Design, implement, and operate observability solutions that provide actionable insights into system health, reliability, and cost, enabling teams to detect and resolve issues proactively.
- Participate in incident response and postmortem reviews, driving learning, systemic fixes, and preventative improvements rather than short-term workarounds.
- Make the System cost and efficiency visible, helping teams understand and optimize their cloud usage in line with business objectives, budget constraints, and cloud governance best practices.
- Partner with Engineering and Product teams to shape and deliver an Engineering Platform roadmap that balances delivery speed, reliability, and long-term sustainability.
Work Arrangement
Hybrid