Responsibilities
- Deploy and manage monitoring tools to track performance, reliability, and scalability of AWS EKS-hosted services.
- Take part in on-call schedules to deliver urgent technical support and respond promptly to system incidents and user requests.
- Perform in-depth analysis of incident causes and apply corrective actions to reduce operational burden and customer impact.
- Lead and contribute to post-incident reviews to strengthen response protocols and improve system resilience.
- Evaluate system architecture and deployment strategies to meet service level agreements and prepare for growth-related challenges.
- Use proven software engineering practices to diagnose and resolve complex system issues effectively.
- Work closely with support teams to boost system reliability, streamline operations, and improve customer experience via automation and self-service solutions
