Responsibilities
- Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
- GCP Observabilty Engineering: Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
- Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and 'observability-driven development.'
- Automation: Eliminate 'toil' by automating the deployment and scaling of observability agents and collectors.
Requirements
- GKE: Minimum 5+ Experience scaling and managing observability in a Google Cloud platform.
- Visualization: Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources.
- SRE Mindset: Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
- Programming Proficiency: Strong coding skills in Python, Go for building internal tools and automating workflows.
- Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
- Problem Solving: A data-driven approach to debugging complex, cross-service performance bottlenecks.
Nice to Have
- Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
- Grafana Loki: Experience in migrating Splunk to Grafana Loki
- Other Cloud Platforms: Experience managing observability native tools within AWS.
Benefits
- equity (where applicable)
- bonus
- health, dental and vision insurance
- 401(k)
- flexible spending account
- paid leave (including PTO and parental leave)


