Monitor system health and data integrations across cloud platforms, focusing on real-time detection of anomalies in workflows involving Azure Logic Apps, FTP/SFTP connections, and API-driven processes. Identify and respond to operational disruptions quickly, ensuring minimal impact on data integrity and system performance.
What You'll Do
- Track and triage alerts from monitoring systems, diagnosing root causes in logic flows, file transfers, or data routing issues.
- Investigate incidents by analyzing logs, API responses, and configuration changes to determine failure points.
- Resolve technical issues affecting data pipelines, including misconfigured integrations, failed transfers, or malformed payloads.
- Collaborate with engineering, QA, and operations teams to address recurring defects and implement long-term fixes.
- Document incidents thoroughly in tracking systems, ensuring traceability from detection to resolution.
- Refine alerting rules and monitoring dashboards to improve early detection and reduce false positives.
- Support client onboarding by validating integration setups and assisting with troubleshooting during implementation.
- Translate technical findings into clear summaries for non-technical stakeholders and leadership.
- Maintain and update runbooks, SOPs, and internal knowledge resources to standardize responses.
- Identify patterns in recurring issues and recommend automation or process improvements to reduce manual effort.
Requirements
- 2–5 years of experience in technical or production support within a software or SaaS environment.
- Hands-on experience with Microsoft Azure, particularly Logic Apps and related monitoring tools.
- Working knowledge of FTP/SFTP, API integrations, and network troubleshooting (e.g., authentication, timeouts, DNS).
- Ability to query and analyze data using SQL and Excel for validation and root-cause analysis.
- Skill in interpreting system logs, error messages, and health metrics to isolate issues.
- Strong attention to detail, with a focus on data accuracy and repeatable resolution processes.
- Experience managing multiple concurrent incidents under time pressure.
- Clear communication skills for documenting issues and explaining technical details across teams.
- Proven ability to work cross-functionally with engineering, QA, and operations groups.
- Commitment to structured workflows, root-cause analysis, and continuous improvement.
Benefits
- Fully remote role with flexibility in scheduling.
- Preference for afternoon/evening availability to support later-day production monitoring.
- Opportunity to engage across all operational areas of the platform.
- Participation in automation and process optimization initiatives.
- Clear career progression based on impact, scope, and experience.
- Exposure to large-scale customer communication systems and data orchestration.
- Inclusive culture guided by principles of efficiency, improvement, collaboration, and curiosity.
