Responsibilities
- Implement and maintain robust automation for deploying and operating Kong's Managed Gateways across various cloud environments.
- Monitor system health, performance, and uptime, striving for 99.99% availability for our core infrastructure.
- Resolve complex production incidents efficiently, participating actively in on-call rotations to maintain service continuity.
- Build resilient tools and systems that enhance the overall reliability and operational efficiency of our platform.
- Contribute proactively to the prevention of technical debt, ensuring sustainable and scalable operations as Kong grows.
- Collaborate closely with engineering teams to design, review, and implement resilient and highly scalable services.
Requirements
- 2+ years of experience applying Site Reliability Engineering (SRE) principles and practices in a production environment.
- Proficiency in at least one of Golang or Python for automation, tooling, and infrastructure as code.
- Hands-on experience with Kubernetes and major cloud platforms such as AWS, GCP, or Azure.
- Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Datadog).
- Solid understanding of networking concepts, distributed systems, and API gateways.
Nice to Have
- Experience with Kong Gateway or other API management platforms.
- Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Kubernetes Administrator).
- Active contributions to open-source projects or developer communities.