This role is central to ensuring the stability, performance, and scalability of our real-time digital platform. As a Senior Site Reliability Engineer, you will bridge the gap between development and operations by building resilient systems, enforcing reliability standards, and driving automation across the infrastructure lifecycle. You will work closely with engineering teams to proactively identify risks, reduce toil, and improve system observability, all while maintaining service uptime under high-load conditions. Your expertise will directly influence the platform’s ability to support millions of concurrent users and deliver seamless interactive experiences. This position requires a strategic mindset, deep technical proficiency, and a commitment to operational excellence in a dynamic, fast-moving environment.
Responsibilities
- Design and maintain scalable, reliable infrastructure for real-time applications.
- Create automation tools to enhance system reliability and streamline deployment workflows.
- Implement monitoring, logging, and alerting systems for rapid incident detection and resolution.
- Collaborate with engineering teams to improve service performance, reliability, and observability.
- Lead incident response efforts, conduct root cause analysis, and apply learnings to prevent recurrence.
- Optimize infrastructure for performance, cost efficiency, and long-term scalability.
- Scale containerized environments using Kubernetes, Docker, and orchestration technologies.
- Establish and uphold reliability standards, including SLOs and operational best practices.
- Evaluate emerging tools and methodologies to enhance system resilience and engineering productivity.
Requirements
- Minimum of six years of experience in Site Reliability Engineering, DevOps, or infrastructure roles.
- Proven experience managing infrastructure for large-scale systems serving millions of users.
- Strong technical background with cloud platforms, particularly Google Cloud Platform (GCP).
- Hands-on expertise with Kubernetes, containerization, and distributed systems.
- Experience building monitoring and observability solutions using tools like Prometheus, Grafana, or Datadog.
- Proficiency in scripting or programming languages such as Python, Go, or TypeScript.
- Solid understanding of SLOs, SLIs, and incident management frameworks.
- Demonstrated ability to collaborate effectively across engineering teams.
Nice to Have
- Experience supporting real-time streaming, gaming, or large-scale consumer-facing applications.
- Knowledge of event-driven architectures and large-scale data processing systems.
- Track record of optimizing infrastructure costs in high-growth environments.
Tech Stack
Google Cloud Platform (GCP), Kubernetes, Docker, Prometheus, Grafana, Datadog, Python, Go, TypeScript
Benefits
- Unlimited paid time off to support work-life balance.
- 401(k) plan for long-term financial planning.
- Comprehensive health insurance coverage.
- Paid company holidays for rest and rejuvenation.
- Competitive base salary reflecting experience and impact.
Compensation
$150k - $200k base salary. Equity: options
Work Arrangement
onsite — Santa Monica
- Fast-paced and collaborative work environment
- Emphasis on high standards and personal initiative
- Open and respectful communication practices
- Culture of real-time feedback and continuous improvement
- Work intensity aligned with ambitious goals
- Focus on gaming and interactive digital experiences
- Encouragement of self-driven projects and innovation
Additional Information
- This is a full-time, on-site role located in Santa Monica.
- The platform emphasizes real-time interaction, engagement, and gamified experiences.
- The company supports creators and audience participation within digital communities.


