Pragmatike is hiring a CUDA Kernel Engineer on behalf of a fast-growing AI startup founded by MIT CSAIL researchers. You will design, implement, and optimize custom CUDA kernels from scratch for NVIDIA GPUs, directly powering the high-throughput AI systems used by Fortune 500 clients.
What You'll Do
- Design, implement, and optimize custom CUDA kernels for NVIDIA GPUs, focusing on maximizing occupancy, memory throughput, and warp efficiency.
- Profile GPU workloads using tools such as Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK.
- Analyze and eliminate performance bottlenecks including warp divergence, uncoalesced memory access, register pressure, and PCIe transfer overhead.
- Improve GPU memory pipelines and ensure proper memory coalescing.
- Collaborate closely with AI systems, model acceleration, and backend distributed systems teams.
- Contribute to GPU architecture decisions, kernel libraries, and internal performance-engineering best practices.
What We're Looking For
- Proven track record building NVIDIA CUDA kernels from scratch, not just calling existing libraries.
- Strong ability to optimize kernels using tiling strategies, occupancy tuning, shared memory design, and warp scheduling.
- Deep understanding of CUDA threads, warps, blocks, and grids, GPU memory hierarchy and memory coalescing, as well as warp divergence.
- Experience diagnosing PCIe bottlenecks and optimizing host-device transfers using pinned memory, streams, and batching.
- Familiarity with C++, CUDA runtime APIs, and GPU debugging and profiling tooling.
Nice to Have
- Experience with multi-GPU or distributed GPU systems such as NCCL or NVLink.
- Background in GPU acceleration for ML frameworks or HPC workloads.
- Knowledge of model inference optimization with TensorRT, CUDA Graphs, or CUTLASS.
- Exposure to compiler-level optimization or PTX/SASS analysis.
- Startup experience or comfort working in fast-moving, ambiguous environments.
Technical Stack
- CUDA, C++, NVIDIA GPUs
- Nsight Compute, Nsight Systems, nvprof, CUDA‐MEMCHECK
- TensorRT, CUDA Graphs, CUTLASS, NCCL, NVLink
Team & Environment
You will collaborate closely with AI systems, model acceleration, and backend distributed systems teams.
Benefits & Compensation
- Competitive salary & equity options
- Sign-on bonus
- Health, Dental, and Vision insurance
- 401k plan
Work Mode
This is a remote position open to candidates within the United States.
Pragmatike is an Equal Opportunity Employer and is committed to providing equal employment opportunities to all applicants without discrimination.




