Hong Kong S.A.R. Remote (Country)

HSBC Group is hiring a Site Reliability Engineer

As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and operational excellence of a globally distributed API and microservices platform. You will apply SRE principles to automate operations, enhance system observability, reduce manual toil, and improve incident response times. Working closely with development and operations teams, you will help bridge the gap between software development and infrastructure, ensuring systems are resilient, self-healing, and aligned with business objectives. Your work will directly impact the stability and performance of mission-critical banking services operating 24/7 across multiple regions including Hong Kong, the UK, and the US.

Responsibilities

  • Maintain and improve the reliability, scalability, and operational health of a distributed API and microservices platform spanning Hong Kong, the UK, and the US.
  • Apply Site Reliability Engineering practices to enhance system observability, reduce manual effort, manage risk, and support safe, continuous business-driven changes.
  • Advance technical expertise in core technologies including AWS, Kubernetes, Kong API Gateway, Istio Service Mesh, and hybrid cloud infrastructure.
  • Build and refine observability solutions using monitoring and alerting tools to support incident response, capacity planning, release safety, and efficient resource use.
  • Lead incident triage, root cause analysis, and resolution using data-driven approaches, ensuring technical insights inform long-term improvements.
  • Design and implement automated and self-healing mechanisms to address recurring system failures and improve platform resilience.
  • Write and maintain code for platform reliability initiatives, including monitoring-as-code, streamlined tenant onboarding, and enhanced release controls.
  • Develop and sustain operational documentation such as runbooks, onboarding guides, and release procedures to support platform operations.
  • Participate in an on-call rotation to provide continuous support for critical systems operating around the clock.

Requirements

  • Fluency in written and spoken English is essential.
  • Experience collaborating in a global, multicultural team environment.
  • Commitment to transparent, honest, and accountable communication.
  • Embrace failure as a path to building more resilient systems through blameless post-mortems.
  • Strong foundation in evidence-based and structured problem-solving techniques.
  • Approach decisions through first principles and functional reasoning.
  • Demonstrate initiative and drive tasks to completion without over-analysis.
  • Take ownership of outcomes and deliver high-quality results on schedule.
  • Collaborate without ego, seeking the best solutions regardless of origin.
  • Prioritize team success and collective outcomes over individual contributions.
  • Solid understanding of distributed systems and networking fundamentals.
  • Programming experience in at least one of: Python, Java, Go, Ruby, or Bash scripting.
  • Ability to debug, optimize code, and automate repetitive operational tasks to reduce toil.
  • Proven experience with observability platforms such as Splunk, DataDog, AppDynamics, or CloudWatch.
  • Familiarity with SLOs, SLIs, and error budgets to measure and manage system reliability.
  • Hands-on experience with AWS cloud services.
  • Proficiency with Linux and Python scripting.
  • Experience with Kubernetes and Docker containerization technologies.
  • Knowledge of Doris as part of the data stack.
  • Experience using Jenkins for CI/CD pipelines.
  • Proficiency with Github for version control.
  • Production support background with incident management responsibilities.

Nice to Have

  • Production support experience in containerized or virtualized environments, especially with Kubernetes.
  • Background in large-scale API development and management using platforms like Kong.
  • Skills in performance analysis and tuning of infrastructure and applications.
  • Familiarity with service mesh technologies, particularly Istio and Envoy.
  • Experience working in DevOps and Agile environments.
  • CI/CD pipeline development experience.
  • Proficiency with Infrastructure

Tech Stack

AWS, Kubernetes, Docker, Istio, Envoy, Kong API Gateway, Splunk, DataDog, AppDynamics, CloudWatch, Jenkins, Github, Linux, Python, Doris, SLO/SLI frameworks

Benefits

  • Competitive salary and performance-based bonuses aligned with global standards.
  • Comprehensive health, dental, and wellness benefits for employees and dependents.
  • Flexible working hours and remote work options to support work-life balance.
  • Generous paid time off, including vacation, sick leave, and parental leave.
  • Professional development opportunities including certifications, training, and conference access.
  • Employee stock purchase plans and long-term incentive programs.

Work Arrangement

Hybrid (combination of remote and on-site work with regional office presence in Hong Kong, UK, and US)

Team

You will join a global Site Reliability Engineering team responsible for the uptime, performance, and resilience of a mission-critical banking platform. The team operates in a highly collaborative, agile environment with members distributed across multiple time zones. Emphasis is placed on automation, observability, and continuous improvement using modern DevOps practices. You will work closely with platform engineers, developers, and operations teams to deliver reliable, scalable services to internal and external customers.

Additional Information

  • This role requires participation in a 24/7 on-call rotation with appropriate compensation and support.
  • Candidates must be authorized to work in the country where the position is based.
  • The company supports visa sponsorship for eligible roles and qualified candidates.
Required Skills
AWSKubernetesPythonJavaGoDistributed SystemsNetworkingKong API GatewayIstio Service MeshSplunkBashRubyProblem-SolvingEnglish AWSKubernetesDockerKong API GatewayIstio Service MeshEnvoySplunkDataDogAppDynamicsCloudWatchJenkinsGithubTerraformAnsibleDoris
Ready to relocate and code from paradise?

Thailand or Vietnam — your office, your rules

Iglu offers relocation to Bangkok, Chiang Mai, Ho Chi Minh City, or Hong Kong. Full employment, legal setup, and a community of 200+ digital professionals.

Relocation to 5 countries
Full legal work setup
Developer community access
Work-life balance culture
Explore locations
Relocation support included
About company
HSBC Group
A global banking and financial services institution.
All jobs at HSBC Group Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 3 months ago