Responsibilities
- Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
- Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
- Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
- Establish and maintain reliability standards that support business-critical uptime requirements
- Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
- Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
- Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
- Serve as incident commander during production outages and service degradations
- Lead comprehensive post-mortem processes within 48 hours of incidents
- Drive "never-again" corrective actions to completion, ensuring systemic improvements
- Maintain and improve incident response procedures and runbooks
- Implement and enforce least-privilege IAM policies across all AWS resources
- Manage security patch pipelines and vulnerability remediation processes
- Support compliance initiatives including SOC2 and ISO 27001 certification requirements
- Ensure security best practices are embedded in all infrastructure and operational procedures
- Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
- Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
- Maintain operational runbooks and knowledge transfer documentation
- Continuously improve on-call experience and reduce alert fatigue
Requirements
- Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
- Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
- Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
- Establish and maintain reliability standards that support business-critical uptime requirements
- Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
- Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
- Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
- Serve as incident commander during production outages and service degradations
- Lead comprehensive post-mortem processes within 48 hours of incidents
- Drive "never-again" corrective actions to completion, ensuring systemic improvements
- Maintain and improve incident response procedures and runbooks
- Implement and enforce least-privilege IAM policies across all AWS resources
- Manage security patch pipelines and vulnerability remediation processes
- Support compliance initiatives including SOC2 and ISO 27001 certification requirements
- Ensure security best practices are embedded in all infrastructure and operational procedures
- Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
- Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
- Maintain operational runbooks and knowledge transfer documentation
- Continuously improve on-call experience and reduce alert fatigue
Additional Information
- Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
- Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
- Maintain operational runbooks and knowledge transfer documentation
- Continuously improve on-call experience and reduce alert fatigue


