What You'll Do
Design and maintain Upbound Spaces, the foundation of our control plane management platform, ensuring it scales efficiently across thousands of instances. You'll play a central role in operating and evolving a system used in both cloud and on-prem deployments, focusing on reliability, performance, and operational clarity.
Develop new features based on customer needs and deliver enhancements that improve system behavior and user experience. Investigate and resolve intricate issues in multi-control plane environments, including reconciliation failures, resource inconsistencies, and performance degradation.
Write production-grade Go code that interacts with the Kubernetes API, building controllers, operators, and extensions with observability and maintainability as core priorities. Contribute to the full lifecycle of service development—from design and implementation to deployment and ongoing support—while ensuring systems remain production-ready.
Use metrics, logs, and distributed tracing to monitor, debug, and optimize live services. Create internal tools that streamline incident diagnosis, assess control plane health, and automate responses to common operational issues.
Document your work thoroughly, including design proposals, post-incident analyses, runbooks, and technical content to guide users and teammates. Support the release process for self-hosted versions of Spaces, helping diagnose problems in customer-run environments.
Participate in on-call rotations to respond to platform incidents, lead resolution efforts, and implement follow-up improvements to prevent recurrence.
Requirements
- Proven experience running large-scale cloud services with a focus on monitoring, alerting, incident management, and post-mortem analysis
- Strong debugging skills in distributed systems, with hands-on use of observability tools such as Prometheus, Grafana, OpenTelemetry, and distributed tracing
- Direct experience building and managing Kubernetes controllers and operators, including tuning reconciliation logic and handling API rate limits
- Ability to collaborate with customers to understand, replicate, and fix complex technical problems in their environments
- A mindset of ownership—stepping in to resolve issues even when they fall outside your immediate domain, especially during critical outages
- Commitment to operational excellence, with a focus on reliability, debuggability, and long-term system health
- Customer empathy, ensuring solutions are built with real-world use and supportability in mind
- Clear, thoughtful communication in both technical documentation and team collaboration
- Active support for a learning culture—helping teammates grow, sharing on-call knowledge, and fostering psychological safety
Technical Stack
Go, Kubernetes, Crossplane, Prometheus, Grafana, OpenTelemetry, distributed tracing, controllers, operators, add-ons, Kubernetes API
Work Mode
Remote - global
Our Culture
Rooted in operational rigor and continuous learning, we value ownership, clear communication, and teamwork. We prioritize customer needs, encourage open collaboration, and maintain a supportive environment where engineers can grow and thrive—even during high-pressure situations.


