About the Role
The individual will collaborate with engineering teams to improve platform stability, automate operational tasks, and drive best practices in monitoring, incident response, and system design.
Responsibilities
- Collaborate with development teams to ensure service reliability and scalability
- Design and implement automated solutions for operational workflows
- Monitor system performance and respond to incidents in production environments
- Develop and maintain tools for deployment, monitoring, and diagnostics
- Contribute to capacity planning and system architecture improvements
- Troubleshoot complex technical issues across distributed systems
- Enforce observability standards through logging, metrics, and tracing
- Participate in on-call rotations for critical system support
- Optimize system uptime and reduce mean time to recovery
- Drive post-incident reviews and implement corrective actions
- Support continuous integration and delivery pipelines
- Ensure configurations adhere to security and compliance standards
- Improve system resilience through proactive failure testing
- Collaborate on disaster recovery planning and execution
- Document system architecture and operational procedures
- Mentor junior engineers in reliability best practices
- Evaluate new technologies for operational efficiency
- Work closely with product teams to align reliability goals
- Maintain infrastructure as code for consistency and repeatability
- Analyze system dependencies to reduce single points of failure
- Implement scalable solutions for data processing systems
- Promote a culture of shared ownership for system health
- Contribute to service level objective definitions and tracking
- Support cloud infrastructure management and optimization
- Ensure efficient resource utilization across environments
Nice to Have
- Master’s degree in computer science or related field
- Experience with big data platforms such as Apache Spark or Flink
- Contributions to open-source infrastructure projects
- Certifications in cloud or DevOps technologies
- Prior work in high-throughput data processing environments
- Exposure to service mesh technologies like Istio
- Knowledge of gRPC and protocol-level observability
- Experience with large-scale event streaming systems
- Background in machine learning infrastructure operations
- Leadership in reliability initiatives across engineering teams
Compensation
Competitive salary with performance-based incentives
Work Arrangement
Hybrid work model with flexibility for remote or office-based work
Team
Part of a distributed engineering team focused on data infrastructure and reliability
About the Datacraft team
This team is responsible for building and maintaining scalable data infrastructure that powers core product functionality. The focus is on reliability, automation, and performance at scale.
What We Value
Collaboration, transparency, technical excellence, and a proactive approach to system health and incident prevention.
Available for qualified candidates requiring sponsorship