Concentrix is hiring a Lead Site Reliability Engineer to shape and scale our DevSecOps ecosystem. In this hands-on leadership role, you will own the reliability of production systems, lead the design of automated pipelines, and champion SRE principles across the software delivery lifecycle.
What You'll Do
- Define, implement, and own Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets across critical services.
- Use error budget policies to drive data-informed conversations on release velocity vs. reliability trade-offs.
- Conduct capacity planning and proactive risk assessments to prevent incidents.
- Lead incident response as incident commander, coordinating teams and driving resolution.
- Facilitate blameless postmortems and ensure action items are tracked and resolved.
- Develop and improve runbooks, escalation paths, and on-call practices to reduce MTTD and MTTR.
- Design and maintain observability strategies using modern tooling.
- Define intelligent, actionable alerting to minimize alert fatigue.
- Drive adoption of distributed tracing and structured logging across services.
- Identify and measure toil and lead initiatives to eliminate it through automation.
- Build internal tooling and self-service capabilities to improve developer productivity.
- Collaborate on cloud-native patterns for fault tolerance, auto-scaling, and disaster recovery.
- Provide SRE input into CI/CD pipelines and deployment strategies (canary, blue/green).
- Manage infrastructure using IaC practices with a focus on reliability.
- Mentor and grow junior SREs, fostering a culture of ownership and continuous improvement.
- Act as an SRE advocate across engineering, embedding reliability into the development lifecycle.
- Partner with stakeholders to align SRE strategy with organizational goals.
- Conduct regular 1:1s with direct reports and participate in team rituals.
- Embed AI tools and practices into how we build and run our platform.
- Support engagement and solutioning for AI-powered offerings.
- Collaborate with cross-functional partners to ensure AI is delivered safely and effectively.
What We're Looking For
- 7+ years of experience in SRE, platform engineering, or a related discipline.
- Proven experience defining and managing SLOs, SLIs, and error budgets in a production environment.
- Strong incident management experience, including leading postmortems and driving reliability improvements.
- Hands-on experience with observability tooling (Prometheus, Grafana, OpenTelemetry, or similar).
- Solid understanding of cloud platforms (AWS, Azure, or GCP) and containerized environments (Kubernetes).
- Proficiency in at least one scripting or programming language (Python, Go, or Bash).
Nice to Have
- Experience with chaos engineering tools (e.g., Chaos Monkey, Gremlin, LitmusChaos).
- Familiarity with IaC tooling such as Terraform or Pulumi.
- Knowledge of DevSecOps practices and security tooling.
- Experience with GitOps workflows and CI/CD pipelines.
- Bilingual proficiency (English & Spanish).
Technical Stack
- Observability: Prometheus, Grafana, OpenTelemetry, ELK
- Cloud Platforms: AWS, Azure, GCP
- Infrastructure: Kubernetes
- Languages: Python, Go, Bash
- Infrastructure as Code: Terraform, Pulumi
Work Mode
This is a global, work-at-home position based in India.




