Responsibilities
- Ensure Platform Stability. Operate and support the Dotcom and OMNI platform (including BOPIS and Same-Day Delivery), ensuring high availability, resilience, and hyper-stable customer experiences during normal operations and peak traffic events.
- Lead Incident Response. Triage, diagnose, and resolve L2/L3 production incidents; lead post-incident reviews and partner with engineering teams on permanent corrective actions to eliminate root causes.
- Drive Intelligent Automation. Build automation solutions, reduce operational toil, and create AI-driven reliability tools and agentic workflows to improve mean time to resolution, productivity, and overall stability.
- Enhance Observability. Develop and optimize observability through logs, metrics, traces, dashboards, and anomaly detection; refine alerting and telemetry pipelines to proactively identify and resolve issues.
- Validate Release Readiness. Ensure world-class readiness for releases, seasonal events, feature launches, and traffic spikes through resiliency checks, performance validation, and comprehensive change reviews.
- Maintain Reliability Standards. Maintain and optimize SLO/SLI frameworks; monitor error budgets and partner with application teams on continuous reliability improvements.
Requirements
- Deep SRE Expertise. 6+ years of hands-on SRE, DevOps, or Production Engineering experience in high-scale digital applications, with a strong understanding of reliability principles and operational excellence.
- Cloud-Native Technical Skills. Strong exposure to Azure AKS, Kubernetes, Docker, Service Mesh, and API-driven architectures, with operational support experience for React front-end and Spring Boot microservices in production environments.
- Observability and Automation Mastery. Hands-on experience with observability tools (Dynatrace, Splunk, Grafana, Prometheus) and strong scripting abilities (Python, Bash, PowerShell, YAML) to build automation that reduces toil and improves incident response.
- Incident Management Excellence. Proven experience in incident management, root cause analysis, and implementing permanent corrective actions that drive long-term reliability improvements.
- CI/CD and Platform Knowledge. Experience with SRE principles, CI/CD pipelines (Jenkins, GitHub Actions), and cloud platforms (Azure required; AWS/GCP/OCI a plus).
- Analytical Problem-Solver. Strong analytical and problem-solving abilities with clear communication skills under pressure, a collaborative mindset, and passion for reducing toil while improving developer and operator experiences.
Benefits
- Caring Community. Thrive in a supportive, mentorship-driven environment from your leaders while also creating that same environment for your teams!
- Fulfilling Path. We invest in you, not just your role, with opportunities to learn, innovate and lead.
- Meaningful Work. Your work creates real impact. With every decision leading to thousands of shipments, you bring beauty to life for our clients.


