Responsibilities
- Manage and enhance AWS services including EKS, EC2, load balancers, certificates, identity policies, container registries, and auto-scaling to ensure high availability, cost control, and safe deployments.
- Ensure the stability of Kubernetes environments by overseeing deployments, ingress routing, horizontal scaling, secret handling, Helm charts, and responding to production issues with confidence in tracing request flows across multiple layers.
- Collaborate with software teams to refine deployment processes, minimize manual intervention, and increase system reliability across all stages of development and production.
- Develop and support monitoring solutions using Grafana, Prometheus/Mimir, Loki, Grafana Alloy, Tempo, OpenTelemetry, Datadog, and CloudWatch to deliver actionable insights.
- Design dashboards, alerts, and data pipelines that provide clear visibility into system behavior for both engineering and operations staff.
- Write precise queries in PromQL, LogQL, SQL, and CloudWatch to extract meaningful signals from data and promote a culture of effective alerting.
- Support Linux-based hardware deployed remotely, including servers, embedded systems, third-party integrations, and the services that relay data from field devices to cloud infrastructure.
- Diagnose network issues such as connectivity problems, routing misconfigurations, ARP anomalies, DNS failures, firewall restrictions, VPN behavior, and TCP-level data transmission errors in environments where recovery is critical.
- Create and maintain runbooks, automated procedures, and configuration management standards to ensure consistent and resilient field operations.
- Improve and manage infrastructure-as-code using Terraform and Terraform Cloud, automate workflows with Ansible, and maintain CI/CD pipelines in Azure DevOps and GitLab, along with scripting in shell and Python.
- Identify configuration drift, manual work, and undocumented processes as technical debt and lead efforts to eliminate them systematically.
- Strengthen and standardize Linux system configurations across cloud and edge environments to ensure consistency, repeatability, and operational safety.
- Approach platform development as a product, delivering opinionated, well-supported workflows that allow application teams to deploy and manage services without deep infrastructure knowledge.
- Collect input from engineering teams, prioritize improvements based on impact, and track usage and satisfaction metrics for platform features.
- Work closely with security teams to integrate controls into the platform, including secret handling, policy enforcement through code, software supply chain protection, and default least-privilege access.
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid or remote with field support considerations
Team
Cross-functional engineering organization focused on platform reliability and developer enablement
Responsibilities
- Own and evolve AWS infrastructure — EKS, EC2, ALB/ELB, ACM, IAM, ECR, and Auto Scaling Groups — with a focus on uptime, cost efficiency, and deployment safety.
- Maintain Kubernetes platform health: deployments, ingress, HPA, secrets, Helm releases, and production incident response — you can trace a broken request through a public ALB, reverse proxy, internal ALB, and K8s ingress without breaking a sweat.
- Partner with development teams to improve deployment pipelines, reduce manual steps, and raise the reliability bar across all environments.
- Build and maintain dashboards, alerts, and telemetry pipelines using Grafana, Prometheus/Mimir, Loki, Grafana Alloy, Tempo, OpenTelemetry, Datadog, and CloudWatch.
- Create actionable metrics, log views, and traces that help engineering and operations teams see what's happening — not just that something went wrong.
- Write PromQL, LogQL, SQL, and CloudWatch queries that surface real signal, not noise — and build alert quality into the culture, not just the config.
- Support distributed Linux-based hardware deployed in the field — physical servers, embedded devices, vendor integrations, and the data forwarding services that connect them to the cloud.
- Troubleshoot connectivity, routing, ARP, DNS, firewall rules, VPN behavior, and TCP socket data flows in remote environments where recoverability matters as much as uptime.
- Develop and maintain runbooks, automation, and configuration management practices that make field operations repeatable and resilient.
- Own and improve Terraform and Terraform Cloud codebases, Ansible playbooks, Azure DevOps and GitLab CI/CD pipelines, and shell/Python automation.
- Treat configuration drift, manual toil, and undocumented procedures as technical debt — and systematically pay it down.
- Harden and document Linux systems across cloud and edge environments, with an eye toward consistency and safe repeatability.
- Treat the platform as a product: build opinionated, well-supported workflows that help product teams provision services, ship code, and operate them in production without needing deep infra expertise.
- Gather feedback from engineering teams, prioritize based on impact, and measure adoption and satisfaction of platform capabilities.
- Partner with security to bake guardrails into the platform — secrets management, policy-as-code, supply chain security, and least-privilege defaults.
Available for qualified candidates