About the Role
The candidate will design and maintain reliable systems, automate operational workflows, and support scalable infrastructure with a focus on uptime, performance, and resilience.
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid or remote options available
Team
Collaborative engineering environment focused on system reliability
Responsibilities
- Design and implement reliable, scalable systems
- Monitor infrastructure and application performance
- Respond to incidents and resolve outages
- Automate operational tasks and deployment pipelines
- Improve system observability and alerting
- Collaborate with development teams on service design
- Maintain documentation for systems and procedures
- Conduct root cause analysis for recurring issues
- Support capacity planning and resource optimization
- Enforce best practices in configuration management
Requirements
- Proven experience in site reliability or systems engineering
- Strong scripting skills in languages such as Python or Bash
- Experience with containerization and orchestration tools
- Familiarity with cloud platforms like AWS or GCP
- Knowledge of monitoring tools such as Prometheus or Grafana
- Understanding of networking and distributed systems
- Experience with CI/CD pipelines
- Proficiency in configuration management systems
- Ability to troubleshoot complex technical issues
- Strong communication and collaboration skills
Preferred Qualifications
- Experience with large-scale production environments
- Background in incident management frameworks
- Knowledge of security best practices in infrastructure
- Familiarity with service-level objectives and error budgets
- Contributions to open-source projects
- Experience with database administration
- Understanding of compliance and audit requirements
Available for qualified candidates