About the Role

The role involves building and maintaining scalable systems that support high-availability services, combining software engineering practices with operational rigor to improve system resilience and efficiency.

Responsibilities

Design and implement reliable, scalable infrastructure for cloud-native applications
Develop automation tools to reduce manual intervention in system operations
Monitor system performance and proactively address potential issues
Respond to incidents with clear escalation paths and post-incident reviews
Improve system uptime and reduce mean time to recovery
Collaborate with development teams to enhance service reliability
Define and track key reliability metrics and service level objectives
Troubleshoot complex production issues across distributed systems
Optimize resource usage and cost-efficiency in cloud environments
Contribute to disaster recovery and business continuity planning
Implement observability solutions including logging, metrics, and tracing
Enforce best practices in configuration management and deployment safety
Support CI/CD pipelines with reliability-focused testing and validation
Drive improvements in system architecture for fault tolerance
Participate in on-call rotations with a focus on sustainable operations
Mentor engineers in reliability principles and operational discipline
Evaluate new technologies for improving system stability
Document system behavior, failure modes, and recovery procedures
Ensure compliance with security and operational standards
Work across time zones to support global service operations

Nice to Have

Experience with financial technology or regulated environments
Familiarity with formal incident management frameworks
Contributions to open-source projects related to infrastructure
Background in performance tuning and load testing
Knowledge of networking protocols and distributed consensus algorithms

Compensation

Competitive salary with performance-based incentives

Work Arrangement

Hybrid work model with flexible remote options

Team

Collaborative engineering team focused on building resilient, cloud-native systems

Our Tech Stack

We use Google Cloud Platform as our primary infrastructure
Services are containerized using Docker and orchestrated with Kubernetes
Infrastructure is managed through Terraform for consistent deployments
Monitoring is powered by Prometheus and Grafana
Logging and tracing are handled via Fluentd and OpenTelemetry

Engineering Culture

We value transparency, ownership, and continuous learning
Engineers are encouraged to propose and lead technical initiatives
Blameless postmortems are standard practice after incidents
We maintain a strong focus on documentation and knowledge sharing
Team members are supported in attending conferences and training

Available for qualified candidates requiring work authorization

Thought Machine is hiring a Senior Site Reliability Engineer