Photon Group is hiring a Site Reliability Engineer to ensure the availability, reliability, scalability, and performance of our most critical, customer-facing eCommerce microservices. You will apply Google-inspired SRE principles to balance feature velocity and system reliability using Service Level Objectives, Service Level Indicators, and error budgets.
What You'll Do
- Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.
- Use error budgets to influence release decisions, prioritize reliability work, and manage operational risk.
- Design and maintain observability platforms including metrics, logs, traces, and real-time telemetry.
- Track, manage, and reduce operational toil by converting repetitive work into actionable Jira stories and epics.
- Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.
- Lead incident response, act as an escalation point for high-severity incidents, and drive blameless postmortems.
- Capture incident action items and reliability improvements in Jira, ensuring closure, accountability, and continuous improvement.
- Partner with scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.
- Perform deep root cause analysis, debugging, and performance tuning across distributed systems.
- Promote shift-left reliability by embedding operability, monitoring, and failure testing early in the software development lifecycle.
- Drive continuous improvement through automation, self-healing systems, chaos engineering, and capacity planning.
- Maintain runbooks, playbooks, and knowledge repositories, linking documentation to Jira tasks to reduce Mean Time to Resolution.
- Provide technical leadership and mentoring to junior SREs and engineers.
- Collaborate with global, distributed teams, leveraging Jira for transparent planning, dependency tracking, and execution.
What We're Looking For
- 4+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.
- Hands-on experience with Java/J2EE-based distributed systems.
- Proven ability to design and operate systems using SLO-driven reliability models.
- Experience defining and measuring SLIs (availability, latency, error rates, throughput, saturation).
- Good understanding with NoSQL technologies and RDBMS, including the ability to write queries.
- Experience deploying and operating services on cloud platforms (AWS, Azure, or Google Cloud).
- Expertise with observability, APM, and caching tools (Dynatrace, Splunk, ELK, Akamai, QuantumMetric/Tealeaf, etc.).
- Strong experience using Jira for backlog management, incident follow-ups, toil reduction tracking, and cross-team coordination.
- Ability to independently own services and drive reliability initiatives end-to-end.
- Strong communication skills and ability to influence engineering and product teams.
- Experience being on an On-Call rotation and handling critical or high-severity incidents.
Nice to Have
- React experience is a plus.
- Experience building and operating microservices architectures using Spring Boot, Groovy, React, or similar.
- Strong understanding of CI/CD pipelines, release automation, and progressive delivery.
- Experience with eCommerce domains such as Catalog, Customer Data, and Order Management.
- Familiarity with search platforms (Endeca, Solr, Lucene, Elasticsearch).
- Proficiency in scripting and automation (Python, Bash, Ruby, Perl, PowerShell).
- Experience with ITSM tools integrated with Jira workflows.
- Exposure to capacity planning, load testing, and chaos engineering.
Technical Stack
- Languages & Frameworks: Java/J2EE, React, Spring Boot, Groovy
- Databases: NoSQL, RDBMS, Endeca, Solr, Lucene, Elasticsearch
- Cloud Platforms: AWS, Azure, Google Cloud
- Observability & Tools: Dynatrace, Splunk, ELK, Akamai, QuantumMetric/Tealeaf, Jira
- Scripting & Automation: Python, Bash, Ruby, Perl, PowerShell
Photon Group is an equal opportunity employer.




