Role Overview
We’re seeking a Platform Engineer to join our Product Reliability team, focused on building robust, scalable systems within our energy management platform. In this role, you’ll partner with product teams to enhance system availability, performance, and fault tolerance, ensuring services remain resilient under real-world demands.
Key Responsibilities
- Advise engineering teams on reliability best practices, including infrastructure design and failure mitigation strategies
- Collaborate directly on code and configuration to strengthen system resilience and operational performance
- Identify opportunities for improvement in core platform infrastructure based on hands-on experience and incident analysis
- Support the development of proof-of-concept solutions to evolve deployment architecture in line with scaling needs
- Guide teams in implementing observability frameworks using tools like Datadog, Prometheus, and Grafana
- Contribute to post-incident reviews, helping teams implement corrective actions and prevent recurrence
- Use metrics and monitoring data to detect patterns, recommend changes, and improve service reliability
- Work across distributed systems to solve complex technical challenges in high-availability environments
Required Qualifications
- Proven experience with AWS, Terraform, and Kubernetes in production environments
- Familiarity with observability platforms such as Datadog, Prometheus, or similar tooling
- Programming experience in Python or related languages to analyze application behavior in production
- Strong written communication skills, particularly in asynchronous formats like Slack, Notion, or technical documentation
- Ability to thrive in autonomous settings, define structure in ambiguous situations, and drive initiatives independently
- Experience collaborating with developers and product stakeholders to deliver measurable improvements
- Demonstrated commitment to continuous learning and iterative problem solving
Preferred Background
- Prior work as a Site Reliability Engineer or similar role
- Experience supporting SaaS platforms at scale, including knowledge transfer across teams
- Background in incident response, outage management, and technical post-mortem facilitation
- Exposure to large relational databases and performance tuning
- Experience defining and tracking service level objectives to guide reliability improvements
Technology Environment
Our platform runs on AWS with infrastructure managed through Terraform, orchestrated via Kubernetes, and monitored using Datadog, Grafana, Prometheus, and Rootly. Development and operations workflows are supported in Python, TypeScript, Go, and C#.
Work Environment
This role is open to candidates based in Australia, with full remote flexibility within the country. We value autonomy, clear documentation, and inclusive collaboration across distributed teams.
Culture and Values
We foster a culture rooted in empathy, sustainability, and technical excellence. Our teams operate with independence while maintaining strong accountability. We prioritize diversity, proactive learning, and transparent communication—especially in written form—to support long-term growth and innovation.


