Responsibilities
- Improve key system metrics such as response time, data processing volume, and uptime across extensive production systems.
- Manage end-to-end performance, from low-level operating system behavior to container orchestration, preventing performance degradation.
- Establish and maintain service level indicators, objectives, and error budgets to align reliability with business goals.
- Collaborate with development teams to analyze code performance, optimize execution speed, and strengthen service resilience.
- Take primary responsibility for tuning database performance, including query efficiency, index design, replication strategies, and resource separation.
- Develop and manage automated testing processes using AI tools to simulate traffic, assess system limits, and forecast capacity needs.
- Direct the transition from legacy monolithic architectures to scalable, multi-tenant Kubernetes environments.
- Lower infrastructure costs by making strategic architectural choices, optimizing resource allocation, and implementing dynamic scaling.
- Create and maintain automated systems for deploying infrastructure, managing configurations, and ensuring system observability.
- Define and enforce high standards for system reliability, performance benchmarks, and incident handling procedures.
Work Arrangement
Remote (Worldwide) — US, Canada, Costa Rica