Responsibilities
- Serve as the highest level of technical escalation for critical incidents, manage incident bridges, implement targeted workarounds, and develop comprehensive post-incident remediation strategies.
- Conduct in-depth troubleshooting across network, server, storage, virtualization, cloud, and backup systems by analyzing logs, traces, packet data, and telemetry to accurately determine root causes.
- Manage recurring technical problems by producing root cause analyses and known error documentation, establishing preventive measures, and overseeing resolution to closure in the ticketing system.
- Plan and execute high-risk infrastructure changes, including risk assessment, procedure documentation, validation testing, and rollback planning, while refining change patterns for consistency with approval board policies.
- Enhance system monitoring and observability by helping design effective alert thresholds, event correlation rules, custom metrics, and dashboards that minimize false alerts and speed up issue detection.
- Develop and secure operational automation using PowerShell, Python, Bash, and platform APIs to streamline diagnostics, recovery actions, compliance audits, and routine tasks, and distribute them safely for use by frontline teams.
- Act as a technical specialist and liaison with vendors and support teams, resolving complex issues and incorporating vendor recommendations into lasting, documented solutions.
- Improve the quality of operational documentation by creating and reviewing standardized procedures, runbooks, and knowledge articles that are clear, up-to-date, and accessible for Tier 1 and Tier 2 support.
- Maintain rigorous ITIL compliance in the service management platform, ensuring accurate ticket records, configuration item relationships, client communications, and complete incident, change, problem, and knowledge entries.
- Support round-the-clock operations by participating in on-call rotations and performing after-hours maintenance or emergency changes to maintain service availability.
Work Arrangement
Remote
Responsibilities (10)
- Act as the final technical escalation for P1/P2 incidents, run the technical bridge to stabilize service, define targeted workarounds, and deliver complete post‑incident remediation plans.
- Lead deep‑dive diagnostics across network, server, storage, virtualization, cloud, and/or backup platforms, correlating logs, traces, packet captures, and platform telemetry to isolate root cause with high confidence.
- Own problem management for chronic issues, producing RCAs and known error records, defining preventative actions, and tracking systemic fixes to closure in ServiceNow.
- Engineer and execute complex/high‑risk changes, including risk/impact analysis, MOP authoring, pre/post validation, and backout plans; improve change success by templating repeatable patterns aligned to CAB policy.
- Elevate monitoring/observability, assist with designing monitoring thresholds, event correlation, custom metrics, and dashboards that reduce alert noise and improve time to detect for critical services.
- Build and harden operational automation (PowerShell/Python/Bash and platform APIs) to accelerate diagnostics, remediation, compliance checks, and recurring maintenance, and package it for safe use by lower tiers.
- Serve as a subject‑matter expert and vendor/TAC coordinator, unblocking complex technical issues and integrating vendor guidance into durable, documented fixes.
- Raise the quality bar for operational content, authoring, and reviewing Tier 1/2‑ready SOPs, runbooks, and KBs; ensure artifacts are updated after incidents and changes and remain audit‑ready in the central repository.
- Uphold ITIL discipline in ServiceNow, maintaining exemplary ticket hygiene, CI relationships, client‑facing communications, and complete Incident/Change/Problem/Knowledge records within a 24x7 model.
- Support 24x7 operation through on‑call participation and after‑hours maintenance/emergency changes to ensure continuity of care.