Responsibilities
- Lead, coach, and develop a team of eight engineers spanning infrastructure, DevOps/SRE, and end-user support, focusing on retention, career growth, and performance management.
- Develop and execute the platform strategy and roadmap to support business objectives and customer service level agreements.
- Define, monitor, and report on key operational metrics such as system availability, MTTR, change success rate, incident frequency, and cloud cost efficiency to executive leadership.
- Ensure platform strategy, planning, and execution are aligned with product demands, business goals, and SLA commitments.
- Implement consistent operational rhythms including incident reviews, change advisory processes, service desk reporting, team retrospectives, and continuous improvement initiatives.
- Oversee the design, deployment, and 24/7 operations of a hybrid infrastructure environment combining Azure and on-premises systems for production and corporate use.
- Guarantee systems are highly available, scalable, performant, secure, and cost-effective across all environments.
- Directly contribute to cloud architecture and implementation, including networking, identity (Azure AD/Entra, RBAC), storage, backup, monitoring, and observability.
- Lead cloud cost optimization through rightsizing, reserved instances, architectural enhancements, and governance policies across Azure workloads.
- Set and enforce technical standards for networking, security, identity, logging, alerting, and operational consistency.
- Drive DevOps and SRE adoption by implementing CI/CD pipelines, Infrastructure as Code (Terraform, ARM/Bicep), containerization (Kubernetes), and modern deployment methods.
- Personally implement Kubernetes clusters, container orchestration, service mesh, and cloud-native design patterns.
- Establish SRE practices including error budgets, SLOs/SLIs, blameless post-incident reviews, observability (metrics, logs, traces), and a culture of reliability.
- Enhance CI/CD tooling and workflows to accelerate releases, reduce deployment risk, and improve developer efficiency.
- Design and enforce change management processes that include risk assessment, testing, communication, and rollback plans to support speed, safety, and audit compliance.
- Implement security and compliance measures covering access controls, monitoring, vulnerability management, incident response, and audit readiness.
- Enforce infrastructure security standards including network segmentation, firewall policies, encryption (at rest and in transit), secrets handling, and privileged access controls.
- Lead response to infrastructure and platform incidents, including root cause analysis, remediation, and process improvements.
- Own disaster recovery planning and execution, including defining RPO/RTO, designing multi-region and hybrid solutions, creating runbooks, and conducting regular tests.
- Ensure reliable backup and recovery across critical systems with documented procedures and verified restore capabilities.
- Manage desktop, endpoint, and telecommunications services—including laptops, mobile devices, productivity tools, collaboration platforms, and voice/conferencing—to ensure secure and dependable user experiences.
- Implement IT service management processes for incident, request, problem, and asset management with defined SLAs and user satisfaction tracking.
- Manage vendor relationships for infrastructure, telecom, SaaS, and managed services, including contract evaluation, license optimization, and service quality assurance.
Work Arrangement
On-site — Charlotte, NC
Other
Position is 100% Onsite in Charlotte, NC