Shape the backbone of a growing AI platform as a Staff Site Reliability Engineer. You'll ensure production systems remain stable, secure, and scalable across AWS infrastructure, with a focus on EKS, Istio, and GitOps-driven workflows. This role is central to maintaining system health, improving automation, and enabling fast, reliable deployments in a cloud-native environment.
What You’ll Do
- Operate and optimize AWS environments, ensuring high availability and performance across EC2, EKS, IAM, and networking services.
- Manage Kubernetes clusters for workload stability, scaling, and configuration integrity.
- Implement and maintain Istio service mesh for secure traffic management and observability.
- Support GitOps pipelines using Flux CD to enforce consistent, auditable infrastructure changes.
- Build automation tools and platform-level improvements that enhance developer velocity and system resilience.
- Design and maintain monitoring, logging, and tracing systems across AI models, microservices, and data workflows.
- Lead incident response efforts, conduct root cause analysis, and drive long-term fixes.
- Collaborate with development teams to improve deployment practices and operational readiness.
- Support nx-based monorepos to streamline development workflows at scale.
- Participate in an on-call rotation to ensure system reliability around the clock.
What We’re Looking For
- Proven experience with core AWS services including EKS, EC2, IAM, VPC, and load balancing in production environments.
- Strong Kubernetes expertise, particularly with EKS—covering autoscaling, networking, RBAC, and cluster lifecycle management.
- Hands-on work with Istio or similar service mesh technologies.
- Experience with GitHub and GitHub Actions for CI/CD pipeline development and maintenance.
- Familiarity with monorepo architectures, especially nx.
- Understanding of GitOps principles and tools like Flux CD.
- Solid foundation in Linux, containerization, Docker, and networking concepts.
- Proficiency with infrastructure-as-code tools such as CDK or Terraform.
- Knowledge of SLOs, error budgets, incident response, and production best practices.
- Excellent written and verbal communication skills in English.
What You’ll Receive
- Indefinite contract with full legal benefits in Colombia
- Prepaid health insurance and life coverage
- Internet and home office allowances
- Competitive salary above market average
- Full remote flexibility with strong work-life balance
- Annual personal time-off allowance
- Sick leave top-up covering 100% of salary from Day 3 to Day 90
- Service recognition awards, including extra paid time off
- Vacation bonus after 5 years of service
- Training budget to support continuous learning
- Mentorship from seasoned technical leaders
Our Culture
We believe diverse perspectives drive innovation. Our environment values independence, trust, and accountability—where ideas are heard, collaboration is natural, and growth is supported. We promote inclusivity, encourage knowledge sharing, and recognize contributions meaningfully. With a global footprint and remote-first mindset, we offer room to lead, learn, and shape the future.
Equal Opportunity
We encourage applicants from all backgrounds. Committed to equity and inclusion, we aim to reflect the diversity of our users in every team we build. Everyone is welcome to contribute, grow, and belong.


