Join a newly formed team focused on strengthening the reliability of our platform through robust observability. As a Senior Platform Engineer, you'll play a central role in defining how we monitor, alert, and understand system behavior across a complex, cloud-native architecture. This is a foundational opportunity to build standards, tooling, and practices that empower product teams to operate with confidence.
What You’ll Do
- Design and implement monitoring and alerting strategies across distributed systems, ensuring visibility into critical services.
- Establish observability best practices and guide teams in adopting them through collaboration and documentation.
- Work alongside engineers to build effective dashboards tailored to specific service needs and operational requirements.
- Define meaningful Service Level Indicators and Objectives that align with business commitments and operational health.
- Refine alerting systems using core metrics—latency, traffic, errors, and saturation—to reduce noise and improve incident response quality.
- Support teams in transitioning to on-call responsibilities by improving signal clarity and operational preparedness.
- Enhance self-service capabilities so teams can independently manage their monitoring needs.
- Analyze incident data to uncover patterns and drive systemic improvements in reliability and detection.
- Collaborate with FinOps to ensure efficient use of observability resources and manage platform costs.
- Contribute to broader platform reliability initiatives where alignment is strong.
What We’re Looking For
- Proven experience with AWS infrastructure and services, particularly in supporting engineering teams at scale.
- Hands-on work with Terraform for infrastructure automation and day-to-day management.
- Strong familiarity with Kubernetes, including deployment patterns and operational workflows.
- Direct experience with observability platforms such as Datadog, Grafana, Prometheus, or similar tools.
- Ability to read and understand application code—particularly in Python or C-family languages—to inform instrumentation and debugging.
- Track record of working effectively in small, independent teams with minimal oversight.
- Comfort navigating ambiguity and creating structure where none exists.
- Strong written communication skills, especially in asynchronous environments using tools like Slack and Notion.
- Ownership mindset—driving progress independently and taking responsibility for outcomes.
Nice to Have
- Prior work in observability, SRE, or data-centric engineering roles.
- Experience supporting SaaS products and enabling engineering teams through training and knowledge transfer.
- Background in building scalable observability solutions for internet-facing services.
- Experience diagnosing performance issues in large relational databases, especially PostgreSQL or Amazon RDS.
- Practical use of SLOs to guide reliability improvements and operational decision-making.
Technology Environment
Our stack includes AWS, Kubernetes, Terraform, Datadog, Grafana, Prometheus, Rootly, Python, TypeScript, Go, C#, and PostgreSQL on Amazon RDS. You’ll be expected to navigate these tools daily and help evolve their usage across the organization.
Work Environment
This role is hybrid, ideally based in Melbourne, Australia, though remote candidates within Australia will be considered. We operate with a high degree of autonomy, trust, and asynchronous communication. You’ll join a supportive team culture focused on clear goals, continuous learning, and personal growth as the platform matures.


