Feldera is looking for a Software Engineer: Reliability and Performance to ensure our platform runs flawlessly under real-world conditions. You will push our engine and control plane to its limits by designing automated tests, validating correctness under stress, and ensuring smooth upgrades and recovery.
What You'll Do
- Design and run long-lived workloads that mimic production environments, including sustained load, skewed data distributions, and upgrade workflows.
- Build metrics and dashboards to continuously measure throughput, latency, and resource efficiency to guide system improvements.
- Run chaos and fault injection experiments involving node failures, crashes, network partitions, resource contention, and rolling upgrades.
- Own and evolve CI/CD pipelines to make them faster, more reliable, and more reflective of production conditions.
- Work closely with core systems engineers to pinpoint bottlenecks, identify regressions, and improve reliability mechanisms.
What We're Looking For
- Strong background in systems engineering, performance testing, or site reliability engineering.
- Fluency in Python and Linux fundamentals.
- Experience with distributed systems and database concepts (consistency, fault tolerance, transactions).
- Experience with CI/CD pipeline engineering (GitHub Actions, Docker, Kubernetes).
- Hands-on experience running large-scale and long-running workloads, preferably in a cloud-native environment.
- Curiosity, rigor, and the ability to design experiments that simulate messy real-world conditions.
Nice to Have
- Rust experience is strongly valued.
Technical Stack
- Languages: Python, Rust
- Platform: Linux
- Infrastructure: GitHub Actions, Docker, Kubernetes
Work Mode
This is a remote position open to candidates in time zones from UTC+0 to UTC+5:30.





