xAI is seeking a Software Engineer for the ML and Data Infrastructure team. This team builds the foundational infrastructure for frontier AI models and truth-seeking agents. You will collaborate with pre-training, multimodal, reasoning, and product teams to tackle ambiguous, high-stakes problems in a fast-paced, meritocratic environment.
What You'll Do
- Design, build, and operate petabyte-to-exabyte scale distributed systems for data acquisition, web crawling, preprocessing, filtering, classification, and multimodal pipelines.
- Architect high-performance search and retrieval engines at trillion-document scale, integrated with LLMs and agents for truth-seeking, low-hallucination reasoning, and real-time knowledge access.
- Develop reliable inference serving infrastructure: load balancing, autoscaling, KV cache, batching, fault-tolerance, monitoring, CI/CD, and benchmarking for 100% uptime and optimal tail latency.
- Optimize low-level performance: CUDA kernels, Triton and CUTLASS extensions, quantization, distillation, speculative decoding, GPU memory hierarchy, and model-hardware co-design for next-generation architectures.
- Innovate on compilers, runtimes, distributed profiling and debugging tools, and interconnect fabrics.
- Manage complex workloads across clouds and clusters: orchestration, data bookkeeping and verifiability, high-speed interconnect validation, failure analysis, and telemetry and automation for production reliability.
What We're Looking For
- Strong systems engineering skills with a proven impact on large-scale distributed infrastructure.
- Proficiency in Python and at least one compiled language (Rust, C++, Go, or Java); experience building bespoke libraries, optimizing performance, and debugging complex systems.
- Hands-on experience with at least one key area: petabyte-scale data pipelines and crawling, web-scale search and retrieval, inference optimization, compiler features, or high-speed interconnects.
- Deep understanding of distributed systems challenges: high-throughput operations per second, latency and throughput tradeoffs, fault-tolerance, monitoring, and scaling to production billions-of-users or 100,000+ GPU clusters.
- Passion for AI infrastructure: keeping up with state-of-the-art techniques, first-principles problem-solving, meticulous organization and bookkeeping, and delivering rigorous, high-quality results.
Nice to Have
- Experience with multimodal data, epistemics and truth-seeking in retrieval, or agentic systems.
- Low-level optimizations: CUDA kernel development, GPU profiling, low-precision numerics, or interconnect pathfinding.
- Production expertise in inference reliability, CI/CD for ML, or cluster networking.
- A track record of owning end-to-end projects in hyperscale environments, with strong debugging, vendor management, or open-source contributions.
Technical Stack
- Languages: Python, Rust, C++, Go, Java
- Infrastructure: Spark, Ray, Kubernetes
- ML/Performance: CUDA, Triton, CUTLASS, JAX, XLA, MLIR
- Ops & Observability: Prometheus, Grafana, Buildkite, ArgoCD
Team & Environment
You will join a small team within a flat organizational structure. The company culture is highly motivated and focused on engineering excellence. All employees are expected to be hands-on and contribute directly to the company’s mission to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.
Benefits & Compensation
- Total compensation range: $180,000 - $440,000 USD
- Equity
- Comprehensive medical, vision, and dental coverage
- 401(k) retirement plan
- Short-term and long-term disability insurance
- Life insurance
- Various other discounts and perks
xAI is an equal opportunity employer.




