A leading technology organization is seeking a Senior Software Engineer specializing in Reliability to architect, execute, and maintain systems ensuring cloud-based production environments remain secure, compliant, and highly available. The selected professional will be a pivotal team member of a nascent Site Reliability Engineering (SRE) team, constructing processes and infrastructure supporting mission-critical workloads in regulated contexts. You will engage with engineering, product, and operational teams to establish service-level objectives, develop monitoring and automation, and elevate overall system reliability. The ideal candidate possesses extensive expertise in cloud infrastructure, automation, and observability, and thrives on resolving intricate distributed system challenges. This opportunity enables reshaping SRE culture and practices from inception while contributing to high-impact projects supporting regulated and commercial operations. Accountabilities: · Architect observability practices including metrics, traces, dashboards, logs, and alerting for production systems. · Collaborate with engineering, product, and lab teams to define SLIs/SLOs, error budgets, and incident response protocols. · Develop and sustain operational playbooks and runbooks for reliability and compliance. · Engage in on-call rotations, championing automation and self-healing for production systems. · Contribute to deployment processes and infrastructure automation using Infrastructure as Code (IaC). · Participate in incident reviews, postmortems, and disaster recovery exercises to enhance system reliability. · Mentor colleagues, promote best practices, and help establish SRE culture and strategy. Requirements: · Bachelor's degree in Computer Science, Engineering, or equivalent experience. · 5+ years of experience in software engineering, SRE, or DevOps roles (Python or Go preferred). · Proven experience deploying and operating production workloads in cloud environments (AWS, GCP, or Azure). · Expertise in Infrastructure as Code (Terraform, Pulumi, Bicep/ARM). · Experience with incident management platforms (e.g., Incident.io, ServiceNow, Opsgenie, PagerDuty). · Strong knowledge of Kubernetes (AKS, GKE, EKS) and cloud networking. · Proficiency with observability platforms such as DataDog, Prometheus/Grafana, or OpenTelemetry. · Exceptional troubleshooting, root-cause analysis, and automation skills. · Capacity to work independently and collaborate effectively across functional teams. · Experience in regulated environments (healthcare, biotech) and familiarity with compliance-driven change management is a plus. Benefits: · Competitive salary: $131,325–$201,000 USD, with potential for pre-IPO equity and cash bonuses. · Comprehensive medical, dental, and vision coverage. · Paid time off and holidays. · Remote work flexibility. · Opportunities for professional growth, mentorship, and leadership in a foundational SRE team. · Participation in shaping processes for high-reliability systems in regulated environments. #LI-CL1

EX Squared is hiring a Senior Software Engineer - Reliability (Remote)

Don't lose them over invoicing

Similar Jobs

Senior Cloud Engineer (remote - US)

Senior Systems Engineer (Remote - US)

Enterprise Infrastructure Systems Engineer (Remote)

Automation Developer

Head of Platform Engineering (Remote - US)

ServiceNow Solution Architect (OTM Expert) | SmartRecruiters