Taguig City, Philippines, Philippines Remote (Global)

Acquireai is hiring a Site Reliability Engineer

Responsibilities

  • Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
  • Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
  • Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
  • Establish and maintain reliability standards that support business-critical uptime requirements
  • Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
  • Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
  • Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
  • Serve as incident commander during production outages and service degradations
  • Lead comprehensive post-mortem processes within 48 hours of incidents
  • Drive "never-again" corrective actions to completion, ensuring systemic improvements
  • Maintain and improve incident response procedures and runbooks
  • Implement and enforce least-privilege IAM policies across all AWS resources
  • Manage security patch pipelines and vulnerability remediation processes
  • Support compliance initiatives including SOC2 and ISO 27001 certification requirements
  • Ensure security best practices are embedded in all infrastructure and operational procedures
  • Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
  • Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
  • Maintain operational runbooks and knowledge transfer documentation
  • Continuously improve on-call experience and reduce alert fatigue

Requirements

  • Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
  • Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
  • Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
  • Establish and maintain reliability standards that support business-critical uptime requirements
  • Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
  • Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
  • Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
  • Serve as incident commander during production outages and service degradations
  • Lead comprehensive post-mortem processes within 48 hours of incidents
  • Drive "never-again" corrective actions to completion, ensuring systemic improvements
  • Maintain and improve incident response procedures and runbooks
  • Implement and enforce least-privilege IAM policies across all AWS resources
  • Manage security patch pipelines and vulnerability remediation processes
  • Support compliance initiatives including SOC2 and ISO 27001 certification requirements
  • Ensure security best practices are embedded in all infrastructure and operational procedures
  • Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
  • Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
  • Maintain operational runbooks and knowledge transfer documentation
  • Continuously improve on-call experience and reduce alert fatigue

Additional Information

  • Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
  • Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
  • Maintain operational runbooks and knowledge transfer documentation
  • Continuously improve on-call experience and reduce alert fatigue
Required Skills
PrometheusGrafanaTypeScriptAWSEKSMongoDBKubernetesInfrastructure as CodeMonitoringCloud Architecture
About company
Acquireai
We’re an award-winning global outsourcer providing contact center and back office services on behalf of our global clients.
All jobs at Acquireai Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 3 months ago