Responsibilities
- Create scalable distributed training systems designed for mixed hardware setups under constrained network conditions.
- Develop and refine model-parallel training approaches including data, tensor, and pipeline parallelism with custom sharding to reduce communication costs.
- Enhance GPU usage, memory management, and computational throughput across distributed nodes.
- Implement reliable checkpointing, state synchronization, and recovery protocols for extended training runs prone to failures.
- Develop monitoring and metrics infrastructure to observe training progress, model performance, and system limitations.
- Design fault-tolerant training architectures capable of handling node failures, network splits, and dynamic participant changes.
- Build peer-to-peer network topologies to enable decentralized coordination among geographically dispersed nodes.
- Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle controls.
- Analyze and refine communication patterns to minimize latency and bandwidth use in multi-node settings.
Benefits
- Compensation with significant equity, offering real ownership in a purpose-driven organization
- Market-competitive salary for senior engineering positions in Australia
- Visa sponsorship offered for outstanding candidates
- Primarily remote work with optional access to a Melbourne-based office
- Collaborate with a world-class team featuring prior experience at top tech firms and startups
Compensation
Equity-heavy with competitive base salary for senior roles in Australia
Work Arrangement
Remote-first, with optional access to Melbourne hub
Team
World-class engineers with backgrounds at Google, Amazon, Microsoft, and leading startups
Other
- Visa sponsorship available for exceptional candidates
- Remote-first with optional access to our Melbourne hub
Available for exceptional candidates