Responsibilities

Lead the end-to-end architecture for all server types in AI clusters, defining system roles, configurations, and lifecycle management strategies.
Establish and manage server scaling formulas based on processor count, cluster scale, and workload categories, including capacity planning and headroom policies.
Define hardware platform specifications, including CPU selection and core count strategy, vendor roadmaps, memory configuration, PCIe topology, NIC integration, and local NVMe usage.
Convert software and runtime behaviors into quantifiable hardware demands such as CPU load, memory performance, IO bursts, and concurrency needs, and communicate constraints to software teams.
Build performance and scalability models, validate through microbenchmarks and full workload testing, and lead resolution of cross-stack bottlenecks.
Specify baseline configurations for operating systems, BIOS, firmware, and drivers per server type, enabling infrastructure teams to implement consistently across fleets.
Monitor advancements in server components including next-gen CPUs, memory technologies, CXL, NVMe updates, and SmartNICs, and conduct proof-of-concept trials to assess adoption timing.
Manage technical relationships with hardware vendors, influence product roadmaps, request custom features, and collaborate on resolving performance or reliability issues.
Set technical qualification and acceptance standards for performance, stability, and operational support, working with hardware TPMs to execute validation and production deployment.
Support lab and staging deployments, lead root-cause analysis for rare failures, and resolve issues across firmware, drivers, OS, and runtime layers.

Cerebras Systems is hiring a Compute Server Platform Architect

Responsibilities