As a Site Reliability Engineer, you will play a critical role in ensuring the stability, scalability, and security of our global cloud infrastructure. You will work at the intersection of development and operations, applying engineering principles to build automated, self-healing systems that support high-performance trading platforms. Your responsibilities will span infrastructure design, incident response, performance optimization, and proactive monitoring, all while fostering a culture of reliability and continuous improvement across engineering teams.
Responsibilities
- Architect, deploy, and manage highly available and scalable AWS infrastructure using Infrastructure-as-Code tools
- Operate and secure Kubernetes clusters, including EKS and self-managed setups, to support containerized services
- Build and maintain CI/CD and GitOps pipelines to streamline application deployment and testing
- Develop observability solutions using Prometheus, Grafana, Datadog, or equivalent tools to enhance system reliability
- Enforce cloud security standards, including IAM policies and compliance with SOC2 and ISO 27001 frameworks
- Diagnose and resolve infrastructure issues through root cause analysis and implement performance optimizations
- Automate provisioning and configuration using Terraform, Ansible, or similar tools
- Collaborate with engineering, architecture, and security teams to advance DevOps practices
- Design disaster recovery, failover, and backup strategies to ensure continuous operations
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline
- Minimum of 5 years in cloud infrastructure, Site Reliability Engineering, or DevOps roles
- Deep experience with AWS services including EC2, S3, Lambda, RDS, VPC, and IAM
- Hands-on experience managing Kubernetes environments such as EKS, K3s, or self-hosted clusters
- Proficiency in scripting and automation with Python, Bash, or equivalent languages
- Proven experience with Infrastructure-as-Code tools like Terraform, CloudFormation, or Ansible
- Familiarity with monitoring, logging, and observability platforms such as Prometheus, Grafana, or Datadog
- Solid understanding of networking concepts including VPCs, DNS, load balancing, and firewalls
- Experience working with CI/CD, DevOps, and GitOps methodologies
- Background in operating low-latency, high-performance systems
- Knowledge of serverless and event-driven architectures
- Ability to work and communicate effectively in asynchronous environments
- Demonstrated commitment to improving system availability and performance through data-driven insights
- Strong problem-solving skills, ownership mindset, and collaborative approach
- Exposure to cloud cost optimization and FinOps principles
Nice to Have
- Interest in or experience with trading systems or financial markets
- Hold or have pursued AWS Certified SysOps Administrator - Associate certification
- Familiarity with Rust compilation workflows and tooling
- Prior experience in cryptocurrency, traditional finance, or trading environments
Tech Stack
AWS, EC2, S3, Lambda, RDS, VPC, IAM, Kubernetes, EKS, K3s, Terraform, Ansible, CloudFormation, Prometheus, Grafana, Datadog, LGTM stack, Python, Bash, Infrastructure-as-Code (IaC), CI/CD, GitOps, Serverless, Event-driven computing, Rust
Benefits
- Competitive compensation package with benefits tailored to employment or contractor status
- Flexible working hours and full remote capability across global locations
- Opportunity to shape and grow within an entrepreneurial, excellence-driven environment
- Professional development plan with learning and certification support aligned to team and individual goals
Compensation
competitive salary package. benefits v
Work Arrangement
Fully remote with flexible hours, supporting a globally distributed team
Team
You will join a high-performing, globally distributed engineering team responsible for maintaining and scaling mission-critical infrastructure. The team emphasizes collaboration, knowledge sharing, and continuous learning, with a strong focus on operational excellence and proactive system design.
Additional Information
- This role supports 24/7 systems with occasional on-call responsibilities and incident response duties
- Regular participation in cross-team initiatives and architecture reviews is expected
- Opportunities for mentorship, technical leadership, and process improvement are encouraged
- The team follows agile practices with a focus on automation, observability, and security-by-design
- Candidates must be comfortable working in a fast-paced environment with evolving technical challenges


