As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and operational excellence of a globally distributed API and microservices platform. You will apply SRE principles to automate operations, enhance system observability, reduce manual toil, and improve incident response times. Working closely with development and operations teams, you will help bridge the gap between software development and infrastructure, ensuring systems are resilient, self-healing, and aligned with business objectives. Your work will directly impact the stability and performance of mission-critical banking services operating 24/7 across multiple regions including Hong Kong, the UK, and the US.

Responsibilities

Maintain and improve the reliability, scalability, and operational health of a distributed API and microservices platform spanning Hong Kong, the UK, and the US.
Apply Site Reliability Engineering practices to enhance system observability, reduce manual effort, manage risk, and support safe, continuous business-driven changes.
Advance technical expertise in core technologies including AWS, Kubernetes, Kong API Gateway, Istio Service Mesh, and hybrid cloud infrastructure.
Build and refine observability solutions using monitoring and alerting tools to support incident response, capacity planning, release safety, and efficient resource use.
Lead incident triage, root cause analysis, and resolution using data-driven approaches, ensuring technical insights inform long-term improvements.
Design and implement automated and self-healing mechanisms to address recurring system failures and improve platform resilience.
Write and maintain code for platform reliability initiatives, including monitoring-as-code, streamlined tenant onboarding, and enhanced release controls.
Develop and sustain operational documentation such as runbooks, onboarding guides, and release procedures to support platform operations.
Participate in an on-call rotation to provide continuous support for critical systems operating around the clock.

Requirements

Fluency in written and spoken English is essential.
Experience collaborating in a global, multicultural team environment.
Commitment to transparent, honest, and accountable communication.
Embrace failure as a path to building more resilient systems through blameless post-mortems.
Strong foundation in evidence-based and structured problem-solving techniques.
Approach decisions through first principles and functional reasoning.
Demonstrate initiative and drive tasks to completion without over-analysis.
Take ownership of outcomes and deliver high-quality results on schedule.
Collaborate without ego, seeking the best solutions regardless of origin.
Prioritize team success and collective outcomes over individual contributions.
Solid understanding of distributed systems and networking fundamentals.
Programming experience in at least one of: Python, Java, Go, Ruby, or Bash scripting.
Ability to debug, optimize code, and automate repetitive operational tasks to reduce toil.
Proven experience with observability platforms such as Splunk, DataDog, AppDynamics, or CloudWatch.
Familiarity with SLOs, SLIs, and error budgets to measure and manage system reliability.
Hands-on experience with AWS cloud services.
Proficiency with Linux and Python scripting.
Experience with Kubernetes and Docker containerization technologies.
Knowledge of Doris as part of the data stack.
Experience using Jenkins for CI/CD pipelines.
Proficiency with Github for version control.
Production support background with incident management responsibilities.

Nice to Have

Production support experience in containerized or virtualized environments, especially with Kubernetes.
Background in large-scale API development and management using platforms like Kong.
Skills in performance analysis and tuning of infrastructure and applications.
Familiarity with service mesh technologies, particularly Istio and Envoy.
Experience working in DevOps and Agile environments.
CI/CD pipeline development experience.
Proficiency with Infrastructure

Tech Stack

AWS, Kubernetes, Docker, Istio, Envoy, Kong API Gateway, Splunk, DataDog, AppDynamics, CloudWatch, Jenkins, Github, Linux, Python, Doris, SLO/SLI frameworks

Benefits

Competitive salary and performance-based bonuses aligned with global standards.
Comprehensive health, dental, and wellness benefits for employees and dependents.
Flexible working hours and remote work options to support work-life balance.
Generous paid time off, including vacation, sick leave, and parental leave.
Professional development opportunities including certifications, training, and conference access.
Employee stock purchase plans and long-term incentive programs.

Work Arrangement

Hybrid (combination of remote and on-site work with regional office presence in Hong Kong, UK, and US)

Team

You will join a global Site Reliability Engineering team responsible for the uptime, performance, and resilience of a mission-critical banking platform. The team operates in a highly collaborative, agile environment with members distributed across multiple time zones. Emphasis is placed on automation, observability, and continuous improvement using modern DevOps practices. You will work closely with platform engineers, developers, and operations teams to deliver reliable, scalable services to internal and external customers.

Additional Information

This role requires participation in a 24/7 on-call rotation with appropriate compensation and support.
Candidates must be authorized to work in the country where the position is based.
The company supports visa sponsorship for eligible roles and qualified candidates.

HSBC Group is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Infrastructure Engineer /DevOps

Senior DevOps Engineer (hiring in US/CAN & LATAM)

Implementation Engineer

Senior Software Engineer - Cloud

Cloud Systems Engineer

Senior Engineer - Cloud Platforms

Related Articles

Platform Engineering: Kubernetes for All

Become an AI Developer: Your Career Guide

Developer Experience Platform: Lessons from Europe

HSBC Group is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Infrastructure Engineer /DevOps

Senior DevOps Engineer (hiring in US/CAN &amp; LATAM)

Implementation Engineer

Senior Software Engineer - Cloud

Cloud Systems Engineer

Senior Engineer - Cloud Platforms

Related Articles

Platform Engineering: Kubernetes for All

Become an AI Developer: Your Career Guide

Developer Experience Platform: Lessons from Europe

Senior DevOps Engineer (hiring in US/CAN & LATAM)