Join Albatross as a Site Reliability Engineer to take ownership of the reliability and observability of our platform. This is a hands-on leadership role where you will design, build, and maintain our observability stack, lead incident response, oversee releases, and establish processes and standards.
What You'll Do
- Own and evolve our observability stack, including Prometheus, Grafana, Loki, and Jaeger, along with dashboards, alerts, and SLOs.
- Instrument services for meaningful metrics and tracing, reducing noise and improving signal.
- Lead incident response and establish blameless postmortems, runbooks, and automated remediation.
- Define, track, and improve SLIs and SLOs to proactively reduce reliability risk.
- Own the release process end-to-end, improving deployment speed, safety, and recovery.
- Implement progressive rollouts, feature flags, and rollback strategies.
- Embed observability into the development lifecycle in close collaboration with engineering teams.
- Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value.
What We're Looking For
- 5–7+ years in SRE, platform engineering, DevOps, or a similar hands-on role.
- Strong production experience with Kubernetes and modern observability stacks like Prometheus, Grafana, Loki, and Jaeger/OpenTelemetry.
- Proven track record leading incident response and building monitoring systems teams actually use.
- Deep distributed systems knowledge and production debugging experience.
- A pragmatic approach to tooling and alerting that teams trust.
- Clear communicator across engineering, product, and leadership.
- A STEM degree (Computer Science, Engineering, Mathematics, or similar).
Nice to Have
- Contributions to open-source observability projects.
- A background in high-scale or high-availability environments.
Technical Stack
- Prometheus
- Grafana
- Loki
- Jaeger
- OpenTelemetry
- Kubernetes
Benefits & Compensation
- Remote-first, async-friendly culture.
- Ownership and autonomy; you'll shape how we do reliability.
- A team that cares about building things right.
Work Mode
This is a global, remote position open to candidates based in Europe.


