The Site Reliability Engineer ensures high availability, reliability, scalability, and performance of critical customer-facing microservices that support eCommerce platforms. This role applies SRE methodologies centered on Service Level Objectives, Service Level Indicators, and error budgets to balance system stability with rapid feature delivery in collaboration with engineering and product teams.

Responsibilities

Establish and manage Service Level Indicators, Service Level Objectives, and error budgets for key microservices in partnership with engineering and product teams.
Use error budgeting to guide release decisions, prioritize reliability improvements, and mitigate operational risks.
Build and maintain observability systems covering metrics, logs, traces, and real-time telemetry for distributed services.
Reduce operational toil by converting repetitive tasks into tracked Jira issues with clear ownership and measurable outcomes.
Design and implement system resiliency features including graceful degradation, redundancy, automated failover, and disaster recovery procedures.
Lead response efforts during high-severity incidents, serve as escalation point, and conduct blameless postmortems.
Document incident follow-ups and reliability enhancements in Jira, ensuring accountability and timely resolution.
Work with development teams to improve system reliability through release readiness checks, change validation, and testing strategies.
Conduct in-depth root cause analysis, debugging, and performance optimization across complex distributed systems.
Integrate reliability practices early in the development lifecycle by promoting operability, monitoring, and failure testing.
Advance system reliability through automation, self-healing mechanisms, chaos engineering, and capacity planning initiatives.
Maintain and update runbooks, playbooks, and knowledge bases, linking them to Jira tasks to reduce mean time to resolution.
Provide technical mentorship and leadership to junior engineers and SRE team members.
Collaborate with global, distributed teams using Jira for transparent planning, dependency tracking, and execution oversight.
Participate in on-call rotations and respond to critical or high-impact production incidents.

Requirements

Minimum of four years of experience in site reliability engineering, software development, or production operations for large-scale eCommerce systems.
Hands-on experience working with Java/J2EE-based distributed applications.
Demonstrated experience applying SLO-driven models to design and operate reliable systems.
Experience defining and monitoring SLIs such as availability, latency, error rates, throughput, and saturation.
Solid understanding of NoSQL databases and relational database systems, including the ability to write and execute database queries.
Proven experience deploying and managing services on cloud platforms like AWS, Azure, or Google Cloud.
Proficiency with observability, application performance monitoring, and caching tools such as Dynatrace, Splunk, ELK, Akamai, QuantumMetric, or Tealeaf.
Extensive use of Jira for managing backlogs, tracking incident follow-ups, reducing toil, and coordinating across teams.
Ability to independently own and improve system reliability from end to end.
Strong communication skills with the ability to influence engineering and product stakeholders.

Nice to Have

Experience with React is beneficial.
Background in developing and maintaining microservices using Spring Boot, Groovy, React, or similar technologies.
Solid understanding of CI/CD pipelines, release automation, and progressive delivery techniques.
Experience in eCommerce domains including Catalog, Customer Data, and Order Management systems.
Familiarity with search technologies such as Endeca, Solr, Lucene, or Elasticsearch.
Proficiency in scripting and automation using Python, Bash, Ruby, Perl, or PowerShell.
Experience with ITSM tools integrated into Jira workflows.
Exposure to capacity planning, load testing, and chaos engineering practices.

Tech Stack

Java, J2EE, React, Spring Boot, Groovy, AWS, Azure, Google Cloud, Dynatrace, Splunk, ELK, Akamai, QuantumMetric, Tealeaf, NoSQL, RDBMS, Python, Bash, Ruby, Perl, PowerShell, Endeca, Solr, Lucene, Elasticsearch

Work Arrangement

global

Team

Global, distributed teams

Additional Information

Experience participating in on-call rotations and managing critical or high-severity production incidents.

Photon Group is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Infrastructure Engineer /DevOps

Platform Architect

Senior DevOps Engineer

Staff Software Engineer - Compute Infrastructure

Cloud Engineer

Software Engineer, New Grad - Infrastructure

Related Articles

remote full stack jobs 2026: Top Skills to Land a Role