About the Role

The individual will collaborate with engineering teams to improve platform stability, automate operational tasks, and drive best practices in monitoring, incident response, and system design.

Responsibilities

Collaborate with development teams to ensure service reliability and scalability
Design and implement automated solutions for operational workflows
Monitor system performance and respond to incidents in production environments
Develop and maintain tools for deployment, monitoring, and diagnostics
Contribute to capacity planning and system architecture improvements
Troubleshoot complex technical issues across distributed systems
Enforce observability standards through logging, metrics, and tracing
Participate in on-call rotations for critical system support
Optimize system uptime and reduce mean time to recovery
Drive post-incident reviews and implement corrective actions
Support continuous integration and delivery pipelines
Ensure configurations adhere to security and compliance standards
Improve system resilience through proactive failure testing
Collaborate on disaster recovery planning and execution
Document system architecture and operational procedures
Mentor junior engineers in reliability best practices
Evaluate new technologies for operational efficiency
Work closely with product teams to align reliability goals
Maintain infrastructure as code for consistency and repeatability
Analyze system dependencies to reduce single points of failure
Implement scalable solutions for data processing systems
Promote a culture of shared ownership for system health
Contribute to service level objective definitions and tracking
Support cloud infrastructure management and optimization
Ensure efficient resource utilization across environments

Nice to Have

Master’s degree in computer science or related field
Experience with big data platforms such as Apache Spark or Flink
Contributions to open-source infrastructure projects
Certifications in cloud or DevOps technologies
Prior work in high-throughput data processing environments
Exposure to service mesh technologies like Istio
Knowledge of gRPC and protocol-level observability
Experience with large-scale event streaming systems
Background in machine learning infrastructure operations
Leadership in reliability initiatives across engineering teams

Compensation

Competitive salary with performance-based incentives

Work Arrangement

Hybrid work model with flexibility for remote or office-based work

Team

Part of a distributed engineering team focused on data infrastructure and reliability

About the Datacraft team

This team is responsible for building and maintaining scalable data infrastructure that powers core product functionality. The focus is on reliability, automation, and performance at scale.

What We Value

Collaboration, transparency, technical excellence, and a proactive approach to system health and incident prevention.

Available for qualified candidates requiring sponsorship

Bloomreach is hiring a Senior Site Reliability Engineer for Datacraft team