About the Role

The role involves bridging development and operations by applying engineering principles to infrastructure and operations problems. The focus is on building and maintaining reliable systems at scale.

Responsibilities

Design and implement scalable monitoring solutions for distributed systems
Develop automation tools to improve system reliability and reduce manual intervention
Respond to and resolve critical production incidents in a timely manner
Collaborate with development teams to enhance application performance and resilience
Drive post-incident reviews and implement corrective actions
Optimize system performance and availability across cloud environments
Maintain and improve CI/CD pipelines for faster and safer deployments
Enforce best practices in configuration management and infrastructure as code
Support capacity planning and system scalability initiatives
Contribute to disaster recovery planning and execution
Evaluate and integrate new technologies that improve system stability
Ensure compliance with security and operational standards
Mentor junior engineers and share operational knowledge
Participate in on-call rotations for critical systems
Improve observability through logging, tracing, and metrics collection
Troubleshoot complex cross-system issues in production environments
Promote a culture of blameless post-mortems and continuous improvement
Work closely with product teams to influence system design for reliability
Automate routine operational tasks to increase efficiency
Monitor system health and proactively address potential failures

Nice to Have

Master's degree in computer science or related field
Experience supporting mission-critical enterprise systems
Contributions to open-source projects
Familiarity with service mesh technologies
Knowledge of large-scale data replication and consistency models
Experience with performance benchmarking and tuning
Background in software development with production code contributions
Exposure to edge computing or hybrid cloud architectures
Certifications in cloud or systems administration
Track record of improving system uptime and reducing incident frequency

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid remote and office-based work model

Team

Collaborative engineering team focused on system reliability and scalability

Why This Role Matters

This position plays a key role in maintaining the stability and performance of large-scale services used by global customers.
Engineers in this role directly influence the reliability and efficiency of core infrastructure platforms.

Technology Environment

Work is conducted in a Linux-based, open-source environment with extensive use of cloud-native technologies.
Primary tools include Kubernetes, Prometheus, Git, and Ansible, running on public and private cloud infrastructures.

Available for qualified candidates

Red Hat is hiring a Senior Site Reliability Engineer

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Why This Role Matters

Technology Environment

Similar Jobs

DevOPS Engineer

Senior/Lead Cloud Automation Developer

Containerization Cloud Consulting

Support Engineer

Professional Services Engineer, EMEA

Cloud Systems Engineer (Cleared)

Related Articles

Platform Engineering: Kubernetes for All

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026