Thoughtworks is looking for a Lead Service Reliability Engineer to join our DAMO service line. You will take a multifaceted approach to ensure technical excellence and operational efficiency, championing SRE principles to evolve our infrastructure towards a more customer-focused and agile model.
What You'll Do
- Understand SRE goals from both technical and business perspectives.
- Provide solutions to improve reliability, fault tolerance, and incident response times (MTTR, MTTD).
- Enhance the incident management process, including prioritization, triage, communication, and post-mortem analysis.
- Manage client stakeholder expectations during incidents, provide technical analysis and remediation plans, and interface with C-level executives as needed.
- Act as a liaison with client engineering teams, building trust and influencing senior stakeholders for better decision-making.
- Identify opportunities to enhance system performance and reliability aligned with business SLAs, SLOs, and KPIs.
- Collaborate with Thoughtworks application development leads and solution architects to recommend design changes and reliability best practices.
- Oversee and mentor other SREs on the team, contributing to their growth.
What We're Looking For
- Ability to program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby, or Java.
- Familiarity with DevOps and GitOps practices, integrating observability automation into CI/CD pipelines (e.g., GitLab, Jenkins, CircleCI).
- In-depth knowledge of configuration management and Infrastructure as Code tools (e.g., Terraform, Ansible, ARM, CloudFormation).
- Expertise in observability, logs, tracing, and monitoring tools (e.g., Grafana, Prometheus, Graylog, Jaeger, Zipkin, ELK stack).
- Strong understanding of container-based architecture and hands-on experience with orchestration tools (e.g., Kubernetes, AWS EKS, Docker Swarm, Nomad).
- In-depth experience in application and infrastructure performance tuning and scaling under heavy load scenarios.
- Good understanding of SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortems, synthetic monitoring, distributed tracing, end-user monitoring, and performance testing.
- Experience with network load balancing, security tech stacks, Transport Layer Security (TLS), certificate management, and standard networking protocols.
- Strong communication and articulation skills, proficiency in English.
- Ability to convey resolutions to audiences with varying technical/business proficiency and bring them to consensus.
- Excellent problem-solving and analytical skills with a focus on continuous improvement.
- Good listening and presentation skills.
- Ability to solve challenging and difficult-to-debug issues with a determined attitude.
- Ability to collaborate with cross-functional teams for capacity planning, scalability assessments, and solution design.
- Ability to work under pressure with composure during production incidents.
- Ability to understand and break down client requirements on technical and business aspects.
- Willingness to be part of a rotation- and need-based, 24x7 available team.
Technical Stack
- Languages: Python, Golang, Shell scripting, Ruby, Java
- CI/CD: GitLab, Jenkins, CircleCI
- Infrastructure as Code: Terraform, Ansible, ARM, CloudFormation
- Observability: Grafana, Prometheus, Graylog, Jaeger, Zipkin, ELK stack
- Orchestration: Kubernetes, AWS EKS, Docker Swarm, Nomad
Team & Environment
You will be part of the DAMO service line, collaborating with Thoughtworks application development leads, solution architects, and client engineering teams.
Work Mode
This is an onsite position.
Thoughtworks is an equal opportunity employer.





