Join a forward-thinking engineering team dedicated to building resilient, observable, and highly reliable cloud systems. In this role, you will drive site reliability engineering practices across infrastructure and applications, ensuring systems are robust, scalable, and well-monitored.
Key Responsibilities
- Apply and promote the principles of the Well Architected Framework, with a focus on system resiliency
- Design and execute controlled chaos engineering tests to identify weaknesses and improve fault tolerance
- Support cloud migration initiatives by evaluating workloads and minimizing operational disruption
- Oversee migration progress to ensure smooth, reliable transitions to cloud environments
- Improve observability through the design and implementation of monitoring, logging, and alerting solutions
- Collaborate with IT teams to align observability strategies with business and technical requirements
- Review cloud deployments for adherence to internal standards and reliability benchmarks
- Identify and resolve gaps in system visibility and monitoring coverage
- Stay current with emerging technologies and lead knowledge-sharing sessions across teams
- Contribute to capacity planning, performance analysis, and system optimization efforts
- Guide peers through technical mentorship and collaborative problem-solving
- Evaluate and enhance the organization’s overall resilience posture
- Participate in a rotating on-call schedule to support critical system reliability
Required Qualifications
- Bachelor’s or Master’s degree in Computer Science or a related technical field
- Minimum of 5 years of experience with cloud platforms, including at least 3 years focused on AWS
- At least 3 years in a Site Reliability or similar infrastructure-focused role
- Proven experience with monitoring, application performance tools, logging systems, and alerting platforms
- Familiarity with incident, problem, and change management workflows
- Deep understanding of SRE methodologies, including SLIs, SLOs, and error budgets
- Strong diagnostic abilities and experience mentoring technical colleagues
- Hands-on expertise with Kubernetes and containerized environments
- Advanced skills in CI/CD pipelines and Infrastructure as Code tools such as Terraform (HCL) and AWS CloudFormation
- Proficient with Git and version control best practices
- Excellent organizational habits and documentation practices
- Effective time management and research capabilities
- Strong command of Linux systems, networking fundamentals, and scripting languages
Preferred Skills
- Experience with message streaming platforms, particularly Kafka (MSK)
- Working knowledge of relational databases including Postgres and MySQL
- Proficiency in scripting or programming with Python or Go
Technology Environment
Our stack centers on AWS, Kubernetes, Terraform (HCL), AWS CloudFormation, Git, Linux, networking, scripting, APM tools, logging and notification systems, CI/CD pipelines, IaC, Kafka (MSK), Postgres, MySQL, Python, and Go.
What We Offer
- Competitive compensation and benefits package
- A stimulating technical environment that encourages innovation
- Ongoing learning opportunities and access to international training programs
Work Environment
This role supports a culture centered on cloud resilience, continuous learning, peer mentorship, adherence to best practices, and driving organizational change toward greater system reliability.


