Responsibilities
- Adopting a Terraform-backed EKS cluster, modernizing & maintaining it for elastic scale, reliability, performance, security, etc.
- Going deep into troubleshooting Postgres performance, queues of every shape and size, and come out the other side with a plan for scaling another 10x to 100x.
- Identifying and correcting scaling issues before they affect our customers by relying on and improving our telemetry and traces in Datadog, AWS Cloudwatch, and Honeycomb. If you see a blind spot, you are comfortable getting into the codebase to fix it.
- Maintaining and improve upon our >99.95% uptime track record.
- Supporting our product engineering team at moving fast to deliver customer value. Improving the day-to-day developer experience through canaries, faster cycle time, blue/green deploys, etc.
- Joining on-call rotations on a schedule with the rest of the engineering team.
Requirements
- 4+ years experience as a DevOps engineer or similar in a startup or mid-sized company working with complex systems that operate at scale.
- Experience working in and on production Kubernetes clusters using infrastructure as code (we use Terraform, but others like Pulumi or Cloudformation are fine too).
- Experience working on complex AWS deployments (multi-account, complex VPC structure to support EKS, EKS experience).
- Experience operating and scaling different database technologies. We use Aurora Postgres, Mongo, and ClickHouse so significant experience with at least one of these is a must.
- Some past experience or familiarity operating and scaling different queues and streams across SQS, Kinesis, Kafka or similar.
- Strong problem-solving skills with a focus on reliability, scalability, and performance.
- Strong communications skills, with the ability to work in a fully distributed, remote-first team.
Work Arrangement
Remote (Worldwide)
Team
Team size: 20+. Structure: remote-first with a NYC base
Additional Information
- The company has a collaborative culture of sharing what works with AI tools, comparing notes, and iterating on workflows.
- Team members are expected to be familiar with AI tools like Cursor, Claude Code, Codex, or similar.
- Candidates may use AI tools in parts of the interview loop, but will sometimes be asked to refrain.
- The role involves high autonomy and high accountability, with expectations to document changes via runbooks and internal documentation.