About the Role
The role involves building and maintaining scalable systems that support high-availability services, combining software engineering practices with operational rigor to improve system resilience and efficiency.
Responsibilities
- Design and implement reliable, scalable infrastructure for cloud-native applications
- Develop automation tools to reduce manual intervention in system operations
- Monitor system performance and proactively address potential issues
- Respond to incidents with clear escalation paths and post-incident reviews
- Improve system uptime and reduce mean time to recovery
- Collaborate with development teams to enhance service reliability
- Define and track key reliability metrics and service level objectives
- Troubleshoot complex production issues across distributed systems
- Optimize resource usage and cost-efficiency in cloud environments
- Contribute to disaster recovery and business continuity planning
- Implement observability solutions including logging, metrics, and tracing
- Enforce best practices in configuration management and deployment safety
- Support CI/CD pipelines with reliability-focused testing and validation
- Drive improvements in system architecture for fault tolerance
- Participate in on-call rotations with a focus on sustainable operations
- Mentor engineers in reliability principles and operational discipline
- Evaluate new technologies for improving system stability
- Document system behavior, failure modes, and recovery procedures
- Ensure compliance with security and operational standards
- Work across time zones to support global service operations
Nice to Have
- Experience with financial technology or regulated environments
- Familiarity with formal incident management frameworks
- Contributions to open-source projects related to infrastructure
- Background in performance tuning and load testing
- Knowledge of networking protocols and distributed consensus algorithms
Compensation
Competitive salary with performance-based incentives
Work Arrangement
Hybrid work model with flexible remote options
Team
Collaborative engineering team focused on building resilient, cloud-native systems
Our Tech Stack
- We use Google Cloud Platform as our primary infrastructure
- Services are containerized using Docker and orchestrated with Kubernetes
- Infrastructure is managed through Terraform for consistent deployments
- Monitoring is powered by Prometheus and Grafana
- Logging and tracing are handled via Fluentd and OpenTelemetry
Engineering Culture
- We value transparency, ownership, and continuous learning
- Engineers are encouraged to propose and lead technical initiatives
- Blameless postmortems are standard practice after incidents
- We maintain a strong focus on documentation and knowledge sharing
- Team members are supported in attending conferences and training
Available for qualified candidates requiring work authorization