About the Role

The role involves maintaining and improving the reliability of large-scale systems by combining software engineering and operations expertise to support critical infrastructure and services.

Responsibilities

Monitor system performance and respond to incidents
Design and maintain scalable infrastructure systems
Implement automated solutions for operational tasks
Collaborate with development teams to improve code deployability
Troubleshoot complex production issues
Develop tools to enhance system observability
Support continuous integration and delivery pipelines
Optimize system reliability and uptime
Participate in on-call rotation for critical systems
Document system architecture and operational procedures
Enforce security and compliance standards
Contribute to disaster recovery planning
Evaluate new technologies for operational efficiency
Improve monitoring and alerting frameworks
Work with distributed systems and cloud platforms
Ensure efficient resource utilization across environments
Drive incident post-mortem analysis and follow-up actions
Promote best practices in configuration management
Support containerized application deployments
Maintain high availability for critical services
Assist in capacity planning and forecasting
Integrate feedback loops for system improvements
Collaborate on performance tuning initiatives
Support global infrastructure with low-latency requirements
Contribute to internal knowledge sharing

Nice to Have

Experience with large-scale production environments
Background in open-source contributions
Familiarity with service mesh technologies
Knowledge of database administration
Experience with infrastructure as code tools
Understanding of site reliability engineering principles
Exposure to global team collaboration
Proficiency with automation frameworks
Experience in agile development environments
Strong grasp of system architecture patterns

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid work model with flexibility for remote operations

Team

Collaborative engineering environment focused on system stability and performance

About the Team

This team focuses on building resilient systems using open-source technologies.
Engineers work closely with development and operations groups to deliver reliable services.

Technology Stack

Primary tools include Linux, Kubernetes, Prometheus, and Git.
Cloud platforms and containerized environments are central to operations.

Available for qualified candidates

Red Hat is hiring a Site Reliability Engineer

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

About the Team

Technology Stack

Similar Jobs

DevOps Engineer

Senior DevOps Engineer

Software Engineer - Observability

Sr Cloud Engineer | NodeJS + TS/JS | Europe remote

Senior Solutions Engineer - F5 Distributed Cloud

Senior DevOps / Platform Engineer (AWS | Terraform | Full Stack Exposure)