As a Site Reliability Engineer, you will play a critical role in ensuring the stability, scalability, and security of our global cloud infrastructure. You will work at the intersection of development and operations, applying engineering principles to build automated, self-healing systems that support high-performance trading platforms. Your responsibilities will span infrastructure design, incident response, performance optimization, and proactive monitoring, all while fostering a culture of reliability and continuous improvement across engineering teams.

Responsibilities

Architect, deploy, and manage highly available and scalable AWS infrastructure using Infrastructure-as-Code tools
Operate and secure Kubernetes clusters, including EKS and self-managed setups, to support containerized services
Build and maintain CI/CD and GitOps pipelines to streamline application deployment and testing
Develop observability solutions using Prometheus, Grafana, Datadog, or equivalent tools to enhance system reliability
Enforce cloud security standards, including IAM policies and compliance with SOC2 and ISO 27001 frameworks
Diagnose and resolve infrastructure issues through root cause analysis and implement performance optimizations
Automate provisioning and configuration using Terraform, Ansible, or similar tools
Collaborate with engineering, architecture, and security teams to advance DevOps practices
Design disaster recovery, failover, and backup strategies to ensure continuous operations

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical discipline
Minimum of 5 years in cloud infrastructure, Site Reliability Engineering, or DevOps roles
Deep experience with AWS services including EC2, S3, Lambda, RDS, VPC, and IAM
Hands-on experience managing Kubernetes environments such as EKS, K3s, or self-hosted clusters
Proficiency in scripting and automation with Python, Bash, or equivalent languages
Proven experience with Infrastructure-as-Code tools like Terraform, CloudFormation, or Ansible
Familiarity with monitoring, logging, and observability platforms such as Prometheus, Grafana, or Datadog
Solid understanding of networking concepts including VPCs, DNS, load balancing, and firewalls
Experience working with CI/CD, DevOps, and GitOps methodologies
Background in operating low-latency, high-performance systems
Knowledge of serverless and event-driven architectures
Ability to work and communicate effectively in asynchronous environments
Demonstrated commitment to improving system availability and performance through data-driven insights
Strong problem-solving skills, ownership mindset, and collaborative approach
Exposure to cloud cost optimization and FinOps principles

Nice to Have

Interest in or experience with trading systems or financial markets
Hold or have pursued AWS Certified SysOps Administrator - Associate certification
Familiarity with Rust compilation workflows and tooling
Prior experience in cryptocurrency, traditional finance, or trading environments

Tech Stack

AWS, EC2, S3, Lambda, RDS, VPC, IAM, Kubernetes, EKS, K3s, Terraform, Ansible, CloudFormation, Prometheus, Grafana, Datadog, LGTM stack, Python, Bash, Infrastructure-as-Code (IaC), CI/CD, GitOps, Serverless, Event-driven computing, Rust

Benefits

Competitive compensation package with benefits tailored to employment or contractor status
Flexible working hours and full remote capability across global locations
Opportunity to shape and grow within an entrepreneurial, excellence-driven environment
Professional development plan with learning and certification support aligned to team and individual goals

Compensation

competitive salary package. benefits v

Work Arrangement

Fully remote with flexible hours, supporting a globally distributed team

Team

You will join a high-performing, globally distributed engineering team responsible for maintaining and scaling mission-critical infrastructure. The team emphasizes collaboration, knowledge sharing, and continuous learning, with a strong focus on operational excellence and proactive system design.

Additional Information

This role supports 24/7 systems with occasional on-call responsibilities and incident response duties
Regular participation in cross-team initiatives and architecture reviews is expected
Opportunities for mentorship, technical leadership, and process improvement are encouraged
The team follows agile practices with a focus on automation, observability, and security-by-design
Candidates must be comfortable working in a fast-paced environment with evolving technical challenges

Keyrock is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Compensation

Work Arrangement

Team

Additional Information

Similar Jobs

Platform Engineer, Infrastructure

Cloud Systems Engineer

DevOps & Solution Architect

Cloud Platform Engineer

DevOps Engineer

Software Integration Engineer (Prime Contract)

Related Articles

Remote Tech Job Risks 2026: Automation, Loyalty, and Pay

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

Remote SRE Jobs: Vanguard’s Cloud Transformation