This role leads the development and automation of CI/CD systems, enforces security and compliance standards, and ensures high system reliability and observability. It requires strong software engineering skills and a builder mindset to advance platform automation and integrate AI into SRE workflows.

Responsibilities

Establish and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across critical services.
Use error budget policies to guide data-driven discussions between engineering and product teams on balancing release speed with system stability.
Perform capacity planning and proactive risk analysis to prevent system failures.
Lead incident response as incident commander, coordinating resolution efforts and maintaining clear communication with stakeholders during outages.
Conduct blameless postmortems and ensure follow-up actions are tracked and completed.
Improve runbooks, escalation procedures, and on-call practices to reduce mean time to detection and resolution.
Develop and maintain observability frameworks using tools like Prometheus, Grafana, OpenTelemetry, and ELK for comprehensive system monitoring.
Design actionable alerting systems that reduce noise and prevent alert fatigue.
Promote the adoption of distributed tracing and structured logging across services.
Measure and reduce engineering toil through automation initiatives.
Build internal tools and self-service platforms to enhance developer efficiency and system dependability.
Collaborate with infrastructure and platform teams on cloud-native designs for resilience, auto-scaling, and disaster recovery.
Guide CI/CD pipeline development and deployment strategies such as canary and blue/green releases to reduce production risks.
Manage infrastructure via Infrastructure as Code (IaC) tools like Terraform to ensure consistency and reliability.
Mentor junior SREs and foster a culture of ownership, learning, and continuous improvement.
Advocate for SRE principles across engineering teams to integrate reliability into the development lifecycle.
Align SRE strategy with organizational goals through collaboration with key stakeholders.
Hold regular one-on-one meetings with direct reports and engage in team rituals.
Integrate AI tools into platform operations, deploying AI-powered features into production.
Support the development and delivery of AI-driven solutions, translating technical capabilities into business value.
Work with cross-functional teams including Product, Data, Security, and Legal to ensure safe, effective, and compliant AI deployment.

Requirements

Minimum of 7 years of experience in site reliability engineering, platform engineering, or a related technical field.
Demonstrated experience implementing and managing SLOs, SLIs, and error budgets in production environments.
Proven track record in incident management, including leading postmortems and implementing reliability enhancements.
Hands-on experience with observability technologies such as Prometheus, Grafana, OpenTelemetry, or similar tools.
Solid knowledge of cloud platforms (AWS, Azure, or GCP) and container orchestration with Kubernetes.
Proficiency in at least one programming or scripting language such as Python, Go, or Bash.

Nice to Have

Experience with chaos engineering tools like Chaos Monkey, Gremlin, or LitmusChaos.
Familiarity with Infrastructure as Code tools such as Terraform or Pulumi.
Understanding of DevSecOps principles and security tooling integration.
Experience working with GitOps workflows and CI/CD pipeline design.
Bilingual proficiency in English and Spanish.

Tech Stack

Prometheus, Grafana, OpenTelemetry, ELK, AWS, Azure, GCP, Kubernetes, Terraform, Pulumi, Python, Go, Bash, Chaos Monkey, Gremlin, LitmusChaos, GitOps, CI/CD pipelines

Work Arrangement

Fully remote within India

Team

Leadership role with direct reports; works closely with product, engineering, data, security, and legal teams.

Builder mindset
Focus on automation and continuous improvement
AI-native engineering culture
Blameless incident response
Data-informed decision making
Reliability and quality focus
Collaborative cross-functional partnerships

Additional Information

Bilingual proficiency in English and Spanish is preferred.
Position is fully remote within India.
Full-time role.
Candidate is expected to integrate AI tools into platform operations and support production of AI-powered offerings.
Mentoring junior SREs and conducting regular one-on-ones with direct reports is a key responsibility.

Concentrix is hiring a Lead Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Back-End/Infra Engineer (Kubernetes / Node.js)

Platform Engineer - Product Reliability (Mid Level)

Senior Infrastructure Engineer

Related Articles

Platform Engineering: Kubernetes for All

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026