A leading technology organization is seeking a Senior Software Engineer specializing in Reliability to architect, execute, and maintain systems ensuring cloud-based production environments remain secure, compliant, and highly available. The selected professional will be a pivotal team member of a nascent Site Reliability Engineering (SRE) team, constructing processes and infrastructure supporting mission-critical workloads in regulated contexts. You will engage with engineering, product, and operational teams to establish service-level objectives, develop monitoring and automation, and elevate overall system reliability. The ideal candidate possesses extensive expertise in cloud infrastructure, automation, and observability, and thrives on resolving intricate distributed system challenges. This opportunity enables reshaping SRE culture and practices from inception while contributing to high-impact projects supporting regulated and commercial operations. Accountabilities: · Architect observability practices including metrics, traces, dashboards, logs, and alerting for production systems. · Collaborate with engineering, product, and lab teams to define SLIs/SLOs, error budgets, and incident response protocols. · Develop and sustain operational playbooks and runbooks for reliability and compliance. · Engage in on-call rotations, championing automation and self-healing for production systems. · Contribute to deployment processes and infrastructure automation using Infrastructure as Code (IaC). · Participate in incident reviews, postmortems, and disaster recovery exercises to enhance system reliability. · Mentor colleagues, promote best practices, and help establish SRE culture and strategy. Requirements: · Bachelor's degree in Computer Science, Engineering, or equivalent experience. · 5+ years of experience in software engineering, SRE, or DevOps roles (Python or Go preferred). · Proven experience deploying and operating production workloads in cloud environments (AWS, GCP, or Azure). · Expertise in Infrastructure as Code (Terraform, Pulumi, Bicep/ARM). · Experience with incident management platforms (e.g., Incident.io, ServiceNow, Opsgenie, PagerDuty). · Strong knowledge of Kubernetes (AKS, GKE, EKS) and cloud networking. · Proficiency with observability platforms such as DataDog, Prometheus/Grafana, or OpenTelemetry. · Exceptional troubleshooting, root-cause analysis, and automation skills. · Capacity to work independently and collaborate effectively across functional teams. · Experience in regulated environments (healthcare, biotech) and familiarity with compliance-driven change management is a plus. Benefits: · Competitive salary: $131,325–$201,000 USD, with potential for pre-IPO equity and cash bonuses. · Comprehensive medical, dental, and vision coverage. · Paid time off and holidays. · Remote work flexibility. · Opportunities for professional growth, mentorship, and leadership in a foundational SRE team. · Participation in shaping processes for high-reliability systems in regulated environments. #LI-CL1
California, United States Remote (Country) Employment
EX Squared is hiring a Senior Software Engineer - Reliability (Remote)
Your first international client?
Don't lose them over invoicing
Clients ghost freelancers with unprofessional invoicing. Glopay gives you a real EU company partnership so they take you seriously from invoice #1.
Instant EU company partnership
Invoice builder with your branding
Automated payment reminders
Real-time payment tracking
Ready in 24 hours


