This role leads the development and automation of CI/CD systems, enforces security and compliance standards, and ensures high system reliability and observability. It requires strong software engineering skills and a builder mindset to advance platform automation and integrate AI into SRE workflows.
Responsibilities
- Establish and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across critical services.
- Use error budget policies to guide data-driven discussions between engineering and product teams on balancing release speed with system stability.
- Perform capacity planning and proactive risk analysis to prevent system failures.
- Lead incident response as incident commander, coordinating resolution efforts and maintaining clear communication with stakeholders during outages.
- Conduct blameless postmortems and ensure follow-up actions are tracked and completed.
- Improve runbooks, escalation procedures, and on-call practices to reduce mean time to detection and resolution.
- Develop and maintain observability frameworks using tools like Prometheus, Grafana, OpenTelemetry, and ELK for comprehensive system monitoring.
- Design actionable alerting systems that reduce noise and prevent alert fatigue.
- Promote the adoption of distributed tracing and structured logging across services.
- Measure and reduce engineering toil through automation initiatives.
- Build internal tools and self-service platforms to enhance developer efficiency and system dependability.
- Collaborate with infrastructure and platform teams on cloud-native designs for resilience, auto-scaling, and disaster recovery.
- Guide CI/CD pipeline development and deployment strategies such as canary and blue/green releases to reduce production risks.
- Manage infrastructure via Infrastructure as Code (IaC) tools like Terraform to ensure consistency and reliability.
- Mentor junior SREs and foster a culture of ownership, learning, and continuous improvement.
- Advocate for SRE principles across engineering teams to integrate reliability into the development lifecycle.
- Align SRE strategy with organizational goals through collaboration with key stakeholders.
- Hold regular one-on-one meetings with direct reports and engage in team rituals.
- Integrate AI tools into platform operations, deploying AI-powered features into production.
- Support the development and delivery of AI-driven solutions, translating technical capabilities into business value.
- Work with cross-functional teams including Product, Data, Security, and Legal to ensure safe, effective, and compliant AI deployment.
Requirements
- Minimum of 7 years of experience in site reliability engineering, platform engineering, or a related technical field.
- Demonstrated experience implementing and managing SLOs, SLIs, and error budgets in production environments.
- Proven track record in incident management, including leading postmortems and implementing reliability enhancements.
- Hands-on experience with observability technologies such as Prometheus, Grafana, OpenTelemetry, or similar tools.
- Solid knowledge of cloud platforms (AWS, Azure, or GCP) and container orchestration with Kubernetes.
- Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
Nice to Have
- Experience with chaos engineering tools like Chaos Monkey, Gremlin, or LitmusChaos.
- Familiarity with Infrastructure as Code tools such as Terraform or Pulumi.
- Understanding of DevSecOps principles and security tooling integration.
- Experience working with GitOps workflows and CI/CD pipeline design.
- Bilingual proficiency in English and Spanish.
Tech Stack
Prometheus, Grafana, OpenTelemetry, ELK, AWS, Azure, GCP, Kubernetes, Terraform, Pulumi, Python, Go, Bash, Chaos Monkey, Gremlin, LitmusChaos, GitOps, CI/CD pipelines
Work Arrangement
Fully remote within India
Team
Leadership role with direct reports; works closely with product, engineering, data, security, and legal teams.
- Builder mindset
- Focus on automation and continuous improvement
- AI-native engineering culture
- Blameless incident response
- Data-informed decision making
- Reliability and quality focus
- Collaborative cross-functional partnerships
Additional Information
- Bilingual proficiency in English and Spanish is preferred.
- Position is fully remote within India.
- Full-time role.
- Candidate is expected to integrate AI tools into platform operations and support production of AI-powered offerings.
- Mentoring junior SREs and conducting regular one-on-ones with direct reports is a key responsibility.


