Taguig City, Philippines, Philippines Remote (Global)

Acquireai is hiring a Site Reliability Engineer

Responsibilities

Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
Establish and maintain reliability standards that support business-critical uptime requirements
Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
Serve as incident commander during production outages and service degradations
Lead comprehensive post-mortem processes within 48 hours of incidents
Drive "never-again" corrective actions to completion, ensuring systemic improvements
Maintain and improve incident response procedures and runbooks
Implement and enforce least-privilege IAM policies across all AWS resources
Manage security patch pipelines and vulnerability remediation processes
Support compliance initiatives including SOC2 and ISO 27001 certification requirements
Ensure security best practices are embedded in all infrastructure and operational procedures
Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
Maintain operational runbooks and knowledge transfer documentation
Continuously improve on-call experience and reduce alert fatigue

Requirements

Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems
Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded
Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty
Establish and maintain reliability standards that support business-critical uptime requirements
Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript
Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB S3
Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter
Serve as incident commander during production outages and service degradations
Lead comprehensive post-mortem processes within 48 hours of incidents
Drive "never-again" corrective actions to completion, ensuring systemic improvements
Maintain and improve incident response procedures and runbooks
Implement and enforce least-privilege IAM policies across all AWS resources
Manage security patch pipelines and vulnerability remediation processes
Support compliance initiatives including SOC2 and ISO 27001 certification requirements
Ensure security best practices are embedded in all infrastructure and operational procedures
Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
Maintain operational runbooks and knowledge transfer documentation
Continuously improve on-call experience and reduce alert fatigue

Additional Information

Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks
Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
Maintain operational runbooks and knowledge transfer documentation
Continuously improve on-call experience and reduce alert fatigue

Required Skills

PrometheusGrafanaTypeScriptAWSEKSMongoDBKubernetesInfrastructure as CodeMonitoringCloud Architecture

About company

We’re an award-winning global outsourcer providing contact center and back office services on behalf of our global clients.

All jobs at Acquireai Visit website

Job Details

Department Information Technology

Category infrastructure

Posted 3 months ago

Similar Jobs

Other opportunities you might be interested in

DevOps Engineer (Mid level)

SOUM

Cairo Remote (Global)

Senior Site Reliability Engineer

Hala

Senior Infrastructure Engineer

SentiLink

Sr Cloud Engineer | NodeJS + TS/JS | Europe remote

n8n

Europe Remote (Global)

Senior Cloud Infrastructure Developer (Remote)

Pagefreezer

Remote (Global)

Senior SRE Engineer

Altium

Belgrade On-site

Related Articles

Insights related to this role

Remote data scientist working with Kubernetes through a low-code platform, enabling cloud-native tools without backend expertise

Platform Engineering: Kubernetes for All

3 min 3 months ago

A remote developer working in a well-lit, modern workspace, illustrating a productive environment enabled by a developer experience platform.

Developer Experience Platform: Lessons from Europe

5 min 2 months ago

Home office setup with dual monitors showing Kubernetes dashboards, representing the rise of Kubernetes remote jobs in AI and cloud-native careers 2026.

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

5 min 3 months ago