What You'll Do
Monitor and analyze client applications in real time using advanced observability tools, focusing on rapid incident identification and resolution. You'll apply deep technical expertise in Dynatrace and Azure Insights to assess system health and drive effective responses during critical events.
Use KQL across the full data lifecycle—from querying and modeling to visualization—to extract meaningful insights and support troubleshooting efforts. You'll correlate telemetry from metrics, logs, and traces to uncover root causes in distributed systems and eliminate recurring issues.
Lead the refinement of alerting systems to reduce noise and ensure only high-impact incidents receive immediate attention. You'll analyze daily performance data to detect anomalies early, preventing outages before they impact services.
Take ownership of P1 incidents as the primary technical contact, coordinating between engineering, operations, and business teams. You'll ensure clear communication and timely resolution while upholding SLA standards and SRE principles.
Improve operational maturity by evaluating and redesigning runbooks, creating new procedural standards, and automating manual reporting workflows. Your work will directly enhance the reliability and efficiency of client environments.
Mentor junior engineers in APM best practices and technical procedures, fostering a proactive support culture. Collaborate with engineering teams to refine monitoring strategies and contribute to long-term product improvements.
Requirements
- Bachelor’s degree in computer science or a related field, or equivalent hands-on experience in cloud operations
- At least 5 years of technical experience in managed services or cloud hosting, with a focus on senior-level system administration
- Fluency in English and Spanish to support collaboration across distributed teams and clients
- Ability to obtain Microsoft Azure certification within 90 days of starting
Preferred Qualifications
- Advanced certifications in Dynatrace or other APM platforms
- Additional credentials in Azure, AWS, GCP, Linux, Windows, SQL, O365, VMware, Cisco, Palo Alto, Terraform, or DevOps practices
Technical Stack
You'll work extensively with Dynatrace, Azure Insights, and KQL, applying APM telemetry and SRE practices across multi-cloud environments. Additional technologies include Microsoft Azure, Terraform, AWS, GCP, Linux, Windows, O365, SQL, VMware, Cisco, and Palo Alto.