Responsibilities
- Design, develop, and deploy new infrastructure services and automation tools to support platform growth and new product initiatives.
- Manage and optimize existing infrastructure components (compute, storage, networking) across 50+ global regions.
- Lead and participate in incident management, conducting postmortems, root cause analyses, and implementing long-term improvements.
- Evaluate infrastructure decisions and capacity planning strategies to improve reliability, scalability, and performance.
- Collaborate across teams to drive reliability, security, and compliance throughout the software lifecycle.
Requirements
- Bachelor’s degree or foreign degree equivalent in Computer Science, or related field and four (4) years of experience in Software Engineering related role or job offered.
- Two (2) years of experience with designing and operating complex, large-scale distributed systems in production, including service discovery, load balancing, high availability, and disaster recovery across multi-region or multi-availability-zone deployments.
- Two (2) years of experience with implementing Infrastructure as Code (IaC) using tools such as Terraform or Pulumi, including authoring reusable modules, performing code reviews, and executing change management with drift detection and automated policy checks.
- Two (2) years of experience with administering Kubernetes in production, including cluster provisioning and upgrades, workload orchestration and autoscaling, Helm-based packaging, and network policy configuration.
- Two (2) years of experience with building internal automation and platform tooling using script programming languages such as Python, Bash or Rush, including developing command-line tools or services that interact with cloud and Kubernetes APIs and implementing automated tests.
- Two (2) years of experience with configuring and operating observability stacks, including metrics, logs, and distributed tracing (e.g., Datadog, OpenTelemetry, Sentry), defining SLIs/SLOs, and creating actionable alerts integrated with incident response tooling (e.g., PagerDuty or Incident.io).
- Two (2) years of experience with designing and maintaining CI/CD pipelines (e.g., GitHub Actions or BuildKite), including build, test, and deployment automation, artifact management, and progressive delivery strategies (blue/green or canary).
- Two (2) years of experience with engineering cloud infrastructure on at least one major cloud platform (AWS, GCP, or Azure), including compute, networking (VPC/VNet design, routing, load balancing, and peering), identity and access management, and object/block storage.
- Two (2) years of experience with managing operational data stores and caches (e.g., PostgreSQL or MySQL; Redis; and a document or key-value store such as MongoDB or DynamoDB), including replication/backup configuration, schema or data modeling, and performance tuning.
- Two (2) years of experience with implementing network and platform security controls, including secrets management (e.g., CKMS, EKMS, CMEK), OS hardening and patching, least-privilege IAM policy design, and vulnerability remediation workflows with CI/CD gates.
Work Arrangement
Hybrid — San Francisco, CA 94103
Additional Information
- Must be in the office 3 days per week and 2 days at home.
- 40 hours/week
- Harvey is an equal opportunity employer and does not discriminate on the basis of race, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition, or any other basis protected by law.
- Reasonable accommodations to applicants with disabilities can be requested by emailing accommodations@harvey.ai