Responsibilities
- Lead the end-to-end architecture for all server types in AI clusters, defining system roles, configurations, and lifecycle management strategies.
- Establish and manage server scaling formulas based on processor count, cluster scale, and workload categories, including capacity planning and headroom policies.
- Define hardware platform specifications, including CPU selection and core count strategy, vendor roadmaps, memory configuration, PCIe topology, NIC integration, and local NVMe usage.
- Convert software and runtime behaviors into quantifiable hardware demands such as CPU load, memory performance, IO bursts, and concurrency needs, and communicate constraints to software teams.
- Build performance and scalability models, validate through microbenchmarks and full workload testing, and lead resolution of cross-stack bottlenecks.
- Specify baseline configurations for operating systems, BIOS, firmware, and drivers per server type, enabling infrastructure teams to implement consistently across fleets.
- Monitor advancements in server components including next-gen CPUs, memory technologies, CXL, NVMe updates, and SmartNICs, and conduct proof-of-concept trials to assess adoption timing.
- Manage technical relationships with hardware vendors, influence product roadmaps, request custom features, and collaborate on resolving performance or reliability issues.
- Set technical qualification and acceptance standards for performance, stability, and operational support, working with hardware TPMs to execute validation and production deployment.
- Support lab and staging deployments, lead root-cause analysis for rare failures, and resolve issues across firmware, drivers, OS, and runtime layers.