This role is central to maintaining and improving the reliability, scalability, and security of our cloud-native platform. The Senior Site Reliability Engineer will work across infrastructure, automation, and observability domains to ensure systems are resilient, performant, and secure by design. You will drive best practices in Infrastructure as Code, lead incident response efforts, and partner closely with engineering and security teams to embed reliability into the development lifecycle. This position requires deep technical expertise, proactive problem-solving, and a strong commitment to operational excellence in a dynamic, distributed environment.
Responsibilities
- Ensure high availability, security, and operational health of critical platform services across multiple environments and regions.
- Design and implement reusable, production-grade infrastructure using Terraform, CI/CD pipelines, and container platforms.
- Serve as a technical expert in AWS, Infrastructure as Code, and observability, providing guidance on complex system challenges.
- Collaborate with Product, Security, and Engineering teams to influence roadmaps with a focus on operability and security.
- Evaluate and balance tradeoffs between reliability, cost, performance, and security, communicating impacts to stakeholders.
- Build and maintain secure, resilient cloud infrastructure on AWS using automation and best practices.
- Develop and manage Infrastructure as Code with Terraform, including modules, state management, and CI/CD integration.
- Enhance system observability through metrics, logging, distributed tracing, dashboards, and proactive alerting.
- Automate repetitive operational tasks and create runbooks to reduce manual effort and improve consistency.
- Work closely with security teams to implement secure configurations, manage secrets, and meet compliance standards.
- Lead production readiness reviews, reliability assessments, and resilience testing for engineering initiatives.
- Use data to guide decisions that improve system performance and reliability.
- Manage multiple priorities effectively in an Agile, team-driven environment.
- Promote automation, standardization, and continuous improvement across operations.
- Mentor junior engineers and contribute to design and release evaluations.
- Maintain strong written and verbal communication skills in English for internal and client interactions.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent practical experience.
- Minimum of 5 years in site reliability, platform engineering, or similar infrastructure-focused role at senior level.
- Proven experience operating large-scale production systems in AWS.
- Extensive hands-on experience with Terraform, including modules, remote state, and workspaces.
- Solid understanding of Linux systems, networking, and storage fundamentals.
- Experience with containerization using Docker and building CI/CD pipelines.
- Strong knowledge of observability practices, including monitoring, logging, distributed tracing, and alerting.
- Track record managing incident response, postmortems, SLOs, and operational procedures.
- Familiarity with security collaboration, including threat modeling, vulnerability management, and secrets handling.
- Excellent communication, collaboration, and mentoring abilities in cross-functional settings.
Tech Stack
Amazon Web Services (EC2, S3, RDS, VPC, IAM, Lambda, EKS), Terraform (modules, providers, remote state), Docker, CloudWatch, Datadog, Prometheus, Grafana, OpenTelemetry, Kubernetes, Jenkins, Ansible, Vault by HashiCorp, New Relic, Elasticsearch, Fluentd, Kibana
Benefits
- Commitment to diversity, equity, and inclusion in all employment practices.
- Opportunity to mentor and support growth of junior team members.
- Engagement with modern cloud and observability technologies in real-world applications.
- Collaborative Agile/Scrum work environment fostering innovation and teamwork.
Compensation
Not specified
Work Arrangement
hybrid
Team
This role operates within a cross-functional engineering organization that values collaboration, innovation, and continuous improvement. The team works in an Agile/Scrum framework with a focus on delivering high-quality, production-ready systems. Close partnerships with product, security, and platform teams ensure alignment on reliability, scalability, and security goals. Engineers are empowered to lead initiatives, mentor peers, and influence technical direction across the organization.
Additional Information
- This role requires on-call incident response rotation with appropriate compensation and support.
- Candidates must be authorized to work in the country where the position is based.
- Remote work options may be available within specified time zones.
- The company supports professional development through training, certifications, and conference attendance.
- We value inclusive team culture and encourage diverse perspectives in technical decision-making.
- Security and compliance are integral to our operations, and all engineers are expected to adhere to best practices.
- The engineering team follows a DevOps model, where ownership of services extends through the full lifecycle.
- Code reviews and pair programming are standard practices to ensure quality and knowledge sharing.
- We use a blameless postmortem process to learn from incidents and improve system resilience.


