Responsibilities
- Implement and enhance observability solutions for AWS EKS-based ERP systems.
- Monitor system health, performance, and availability, ensuring alerts are actionable and contributing to SLO/SLA alignment.
- Implement incident management tooling and best practices, and drive incident response partnership with engineering and cloud operations teams for timely resolution and clear communication.
- Conduct root cause analysis and lead cross-functional incident retrospectives with engineering, cloud operations, and product support, helping identify preventative improvements that reduce recurring issues, reliability risks, and improve overall system resiliency.
- Develop automation and operational tooling to reduce manual effort and operational toil.
- Participate in a structured on-call rotation during primary business hours.
Requirements
- 2–3+ years of experience in Site Reliability Engineering, Cloud Operations, DevOps, or Software Engineering in cloud-based environments.
- Hands-on experience with monitoring and observability tools such as Datadog and CloudWatch.
- Experience participating in incident response processes and using tools like PagerDuty and JSM Operations.
- Working knowledge of Kubernetes (EKS experience preferred), with experience supporting containerized applications.
- Experience with Infrastructure as Code tooling such as Terraform.
- Proficiency in at least one scripting or programming language such as Python, Bash, C#, or .NET Core.
- Experience with CI/CD practices and modern DevOps methodologies.
- Bachelor’s degree in Computer Science or related field


