India Remote (Country)

Concentrix is hiring a Lead Site Reliability Engineer

This role leads the development and automation of CI/CD systems, enforces security and compliance standards, and ensures high system reliability and observability. It requires strong software engineering skills and a builder mindset to advance platform automation and integrate AI into SRE workflows.

Responsibilities

  • Establish and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across critical services.
  • Use error budget policies to guide data-driven discussions between engineering and product teams on balancing release speed with system stability.
  • Perform capacity planning and proactive risk analysis to prevent system failures.
  • Lead incident response as incident commander, coordinating resolution efforts and maintaining clear communication with stakeholders during outages.
  • Conduct blameless postmortems and ensure follow-up actions are tracked and completed.
  • Improve runbooks, escalation procedures, and on-call practices to reduce mean time to detection and resolution.
  • Develop and maintain observability frameworks using tools like Prometheus, Grafana, OpenTelemetry, and ELK for comprehensive system monitoring.
  • Design actionable alerting systems that reduce noise and prevent alert fatigue.
  • Promote the adoption of distributed tracing and structured logging across services.
  • Measure and reduce engineering toil through automation initiatives.
  • Build internal tools and self-service platforms to enhance developer efficiency and system dependability.
  • Collaborate with infrastructure and platform teams on cloud-native designs for resilience, auto-scaling, and disaster recovery.
  • Guide CI/CD pipeline development and deployment strategies such as canary and blue/green releases to reduce production risks.
  • Manage infrastructure via Infrastructure as Code (IaC) tools like Terraform to ensure consistency and reliability.
  • Mentor junior SREs and foster a culture of ownership, learning, and continuous improvement.
  • Advocate for SRE principles across engineering teams to integrate reliability into the development lifecycle.
  • Align SRE strategy with organizational goals through collaboration with key stakeholders.
  • Hold regular one-on-one meetings with direct reports and engage in team rituals.
  • Integrate AI tools into platform operations, deploying AI-powered features into production.
  • Support the development and delivery of AI-driven solutions, translating technical capabilities into business value.
  • Work with cross-functional teams including Product, Data, Security, and Legal to ensure safe, effective, and compliant AI deployment.

Requirements

  • Minimum of 7 years of experience in site reliability engineering, platform engineering, or a related technical field.
  • Demonstrated experience implementing and managing SLOs, SLIs, and error budgets in production environments.
  • Proven track record in incident management, including leading postmortems and implementing reliability enhancements.
  • Hands-on experience with observability technologies such as Prometheus, Grafana, OpenTelemetry, or similar tools.
  • Solid knowledge of cloud platforms (AWS, Azure, or GCP) and container orchestration with Kubernetes.
  • Proficiency in at least one programming or scripting language such as Python, Go, or Bash.

Nice to Have

  • Experience with chaos engineering tools like Chaos Monkey, Gremlin, or LitmusChaos.
  • Familiarity with Infrastructure as Code tools such as Terraform or Pulumi.
  • Understanding of DevSecOps principles and security tooling integration.
  • Experience working with GitOps workflows and CI/CD pipeline design.
  • Bilingual proficiency in English and Spanish.

Tech Stack

Prometheus, Grafana, OpenTelemetry, ELK, AWS, Azure, GCP, Kubernetes, Terraform, Pulumi, Python, Go, Bash, Chaos Monkey, Gremlin, LitmusChaos, GitOps, CI/CD pipelines

Work Arrangement

Fully remote within India

Team

Leadership role with direct reports; works closely with product, engineering, data, security, and legal teams.

  • Builder mindset
  • Focus on automation and continuous improvement
  • AI-native engineering culture
  • Blameless incident response
  • Data-informed decision making
  • Reliability and quality focus
  • Collaborative cross-functional partnerships

Additional Information

  • Bilingual proficiency in English and Spanish is preferred.
  • Position is fully remote within India.
  • Full-time role.
  • Candidate is expected to integrate AI tools into platform operations and support production of AI-powered offerings.
  • Mentoring junior SREs and conducting regular one-on-ones with direct reports is a key responsibility.
Required Skills
PrometheusGrafanaOpenTelemetryELKAWSAzureGCPKubernetesTerraformPulumiPythonGoBashChaos MonkeyGremlin PrometheusGrafanaOpenTelemetryELKAWSAzureGCPKubernetesTerraformPulumiPythonGoBashChaos MonkeyGremlin
About company
Concentrix
Concentrix ist ein internationales Unternehmen, das in mehr als 70 Ländern vertreten ist und führend in der Verbesserung der Kundenerfahrung und der Optimierung von Geschäftsprozessen ist.
All jobs at Concentrix Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 2 months ago