Responsibilities
- Respond to incidents within defined service level agreements and conduct root cause investigations.
- Ensure system observability by actively monitoring with tools such as Splunk, CloudWatch, and Zabbix.
- Diagnose and resolve issues across AWS infrastructure, network configurations, APIs, and integrated services.
- Administer Amazon Connect settings, including contact flows, Lex-powered bots, and integrations with Lambda, S3, QuickSight, and DynamoDB.
- Create visual representations of workflows and develop standardized troubleshooting procedures for support teams.
- Maintain detailed runbooks, knowledge bases, and resolution guides for common alert scenarios.
- Review incident data, root cause analyses, and past events in ServiceNow to improve documentation and response strategies.
- Work with platform and operations teams to triage incidents, conduct practice drills, and enhance support processes.