What You'll Do
Own the stability and performance of distributed systems serving millions of users. Troubleshoot and resolve deep infrastructure and application issues, often under pressure, and contribute to on-call rotations to ensure continuous service availability. Lead post-incident reviews, identifying root causes and driving improvements that reduce operational toil and customer impact.
Design and implement automated solutions to enhance system reliability, deployment efficiency, and monitoring coverage. Work closely with development teams to influence architectural decisions, ensuring systems are built with observability, security, and scalability in mind. Develop scripts and internal tools that accelerate delivery and improve operational workflows.
Requirements
- 10+ years in DevOps or Site Reliability Engineering roles
- 3+ years using an object-oriented language such as Java, .NET, or C++
- Strong Linux administration, scripting, and debugging skills
- Proven experience with observability platforms like New Relic, Splunk, or DataDog
- Deep knowledge of AWS services including VPC, EC2, ECS, Fargate, Route53, and load balancing
- Proficiency with infrastructure-as-code tools such as CloudFormation, Terraform, Helm, and Ansible
- Familiarity with containerization, Kubernetes, and microservices architecture
- Hands-on experience in CI/CD and full software development lifecycle practices
- Strong written and verbal communication abilities
- Commitment to automation, security, and enabling self-service platforms
Preferred Qualifications
- Experience with AWS CDK
Benefits
- Medical, dental, and vision insurance
- 401(k) plan with company matching
- Life insurance coverage
- Unlimited paid time off
- Complimentary training, onboarding, and professional support
- Clear pathways for career development
- Hybrid remote work model with flexibility to work from home in eligible states
- Access to physical offices in Austin, TX and Tampa, FL

