Acquireai is looking for a Site Reliability Engineer to serve as the guardian of our production systems. You will ensure the reliability, scalability, and performance of our IoT telemetry platform by defining SLOs, automating operational processes, and building the infrastructure and tooling that enables our engineering teams to deploy with confidence.
What You'll Do
- Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across all production systems.
- Track error budget burn rates and make data-driven decisions to halt risky deployments when thresholds are exceeded.
- Implement comprehensive monitoring and alerting strategies using Prometheus, Grafana, and PagerDuty.
- Design and implement Infrastructure as Code (IaC) solutions using Pulumi with TypeScript.
- Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB, and S3.
- Automate operational processes to eliminate toil, targeting any task that consumes more than 2 engineer-days per quarter.
- Serve as incident commander during production outages and service degradations.
- Lead comprehensive post-mortem processes within 48 hours of incidents and drive 'never-again' corrective actions to completion.
- Maintain and improve incident response procedures and runbooks.
- Implement and enforce least-privilege IAM policies across all AWS resources.
- Manage security patch pipelines and vulnerability remediation processes.
- Support compliance initiatives including SOC2 and ISO 27001 certification requirements.
- Participate in follow-the-sun on-call rotation with one week primary/secondary commitment every five weeks.
- Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones.
- Maintain operational runbooks and knowledge transfer documentation.
What We're Looking For
- Proven experience defining and enforcing Service Level Objectives (SLOs) and error budgets in a production environment.
- Deep hands-on experience with monitoring and alerting tools like Prometheus and Grafana.
- Expertise in Infrastructure as Code using Pulumi, Terraform, or similar tools.
- Strong experience managing and optimizing AWS services, particularly EKS and MSK.
- Proficiency in a scripting or programming language such as TypeScript, Python, or Go.
- Experience automating operational workflows and eliminating manual toil.
- Demonstrated ability to lead incident response and post-mortem processes.
- Strong knowledge of cloud security best practices, including IAM policy management and vulnerability remediation.
- Experience supporting SOC2, ISO 27001, or similar compliance frameworks.
- Willingness to participate in a global on-call rotation.
Technical Stack
- Monitoring & Alerting: Prometheus, Grafana, PagerDuty
- Infrastructure as Code: Pulumi, TypeScript
- Cloud Platform: AWS
- Core Services: EKS (Elastic Kubernetes Service), MSK (Managed Streaming for Kafka), SingleStore, MongoDB, S3
Work Mode
This is a global, remote position. Candidates should be located in and authorized to work in the AU/NZ, EU/ZA, or MX time zones to support our follow-the-sun on-call model.
Acquireai is an equal opportunity employer.


