Improve key system metrics such as response time, data processing volume, and uptime across extensive production systems.
Manage end-to-end performance, from low-level operating system behavior to container orchestration, preventing performance degradation.
Establish and maintain service level indicators, objectives, and error budgets to align reliability with business goals.
Collaborate with development teams to analyze code performance, optimize execution speed, and strengthen service resilience.
Take primary responsibility for tuning database performance, including query efficiency, index design, replication strategies, and resource separation.
Develop and manage automated testing processes using AI tools to simulate traffic, assess system limits, and forecast capacity needs.
Direct the transition from legacy monolithic architectures to scalable, multi-tenant Kubernetes environments.
Lower infrastructure costs by making strategic architectural choices, optimizing resource allocation, and implementing dynamic scaling.
Create and maintain automated systems for deploying infrastructure, managing configurations, and ensuring system observability.
Define and enforce high standards for system reliability, performance benchmarks, and incident handling procedures.

Remote (Worldwide) — US, Canada, Costa Rica

OfficeSpace Software is hiring a Senior Site Reliability Engineer