About the Role
The role involves bridging software engineering and systems operations to build and maintain resilient, low-latency financial platforms used in fast-paced market environments.
Responsibilities
- Design and implement automated deployment pipelines for production systems
- Monitor system health and proactively identify performance bottlenecks
- Respond to and resolve critical production incidents with minimal downtime
- Develop tools to improve operational efficiency and reduce manual intervention
- Collaborate with development teams to enhance system reliability
- Maintain and scale infrastructure supporting high-frequency trading platforms
- Enforce observability standards using logging, metrics, and tracing
- Participate in on-call rotations for rapid incident response
- Optimize system performance under heavy transaction loads
- Ensure configurations adhere to security and compliance requirements
- Troubleshoot complex distributed system failures
- Drive post-incident reviews to prevent recurrence
- Implement disaster recovery and failover strategies
- Support capacity planning for future growth
- Integrate reliability best practices into the development lifecycle
- Manage configuration consistency across environments
- Automate routine operational tasks to increase team velocity
- Contribute to system architecture discussions with engineering teams
- Maintain documentation for operational procedures and system design
- Evaluate new technologies for improving platform stability
- Enforce SLA and SLO compliance across services
- Work closely with security teams to address vulnerabilities
- Improve deployment reliability through canary and blue-green strategies
- Support audit readiness for regulatory requirements
- Promote a culture of shared ownership for system reliability
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexible remote options
Team
Collaborative engineering team focused on high-performance systems
Technology Stack
- Primary languages include Go and Python
- Infrastructure runs on AWS with Kubernetes orchestration
- Monitoring stack includes Prometheus, Grafana, and ELK
- CI/CD powered by Jenkins and GitLab CI
- Configuration management via Terraform and Ansible
Performance Expectations
- Maintain 99.99% uptime for core trading services
- Respond to critical incidents within five minutes
- Reduce mean time to resolution by 20% year over year
- Achieve full automation of routine operational tasks
- Ensure all services meet defined SLOs
Available for qualified candidates