Jobgether is hiring a Senior Site Reliability Engineer to design, build, and maintain highly available, secure, and scalable systems for our production and machine learning environments. You will collaborate closely with software engineers, data scientists, and platform architects to ensure system reliability and performance.
What You'll Do
- Design, implement, and maintain cloud-native infrastructure on Kubernetes (EKS/GKE/AKS) for production systems.
- Architect and manage microservice deployments, ensuring reliable CI/CD pipelines and service performance.
- Collaborate with ML and Data teams to design, optimize, and monitor ML/AI workflows using tools like Databricks, Spark, Flyte, or Airflow.
- Establish and enforce SLOs/SLIs, conduct incident postmortems, and enhance system reliability and developer velocity.
- Lead improvements in architecture focusing on scalability, fault tolerance, performance, and cost optimization.
- Support secure infrastructure practices, including IAM, secret management, policy-as-code, and compliance controls.
- Mentor junior engineers and contribute to best practices across observability, infrastructure-as-code, and production readiness.
What We're Looking For
- Bachelor’s degree in a related field or equivalent work experience.
- 8+ years of experience in software, systems, or DevOps engineering.
- Strong expertise in Kubernetes deployment, scaling, networking, monitoring, and debugging.
- Proficiency in Golang and Python.
- Solid understanding of distributed systems, cloud architecture, and container orchestration.
- Experience building and maintaining microservice-based architectures in production.
- Familiarity with CI/CD pipelines (GitLab CI, ArgoCD, Flux, or similar).
- Deep experience with monitoring/observability tools (Datadog, Prometheus, Grafana, OpenTelemetry).
Nice to Have
- Experience designing or operating ML workflows and data pipelines.
- Background in system design or infrastructure architecture.
- Exposure to multi-cloud environments (AWS, GCP, Azure).
- Knowledge of security, compliance, and automation in production-grade systems.
- Contributions to open-source projects or internal platform tooling, or experience leading SRE transformations.
- Familiarity with service meshes (Istio, Linkerd) and API gateways (Kong, Envoy).
Technical Stack
- Kubernetes (EKS/GKE/AKS), Golang, Python
- Databricks, Spark, Flyte, Airflow
- GitLab CI, ArgoCD, Flux
- Datadog, Prometheus, Grafana, OpenTelemetry
- AWS, GCP, Azure
- Istio, Linkerd, Kong, Envoy
Team & Environment
You will collaborate closely with software engineers, data scientists, and platform architects.
Benefits & Compensation
- Competitive salary range: $138,000–$213,000.
- Performance-based bonuses and stock options.
- Unlimited paid time off.
- Health, dental, and vision coverage.
- Remote or hybrid work flexibility.
- Opportunities for professional growth and impact in a collaborative environment.
Work Mode
This is a remote position open to candidates in the United States, with hybrid work flexibility.



