Remote (Country) Full-time

Pragmatike (on behalf of a fast-growing AI startup) is hiring a CUDA Kernel Engineer (Remote US)

About the Role

Pragmatike is hiring a CUDA Kernel Engineer on behalf of a fast-growing AI startup founded by MIT CSAIL researchers. You will design, implement, and optimize custom CUDA kernels from scratch for NVIDIA GPUs, directly powering the high-throughput AI systems used by Fortune 500 clients.

What You'll Do

  • Design, implement, and optimize custom CUDA kernels for NVIDIA GPUs, focusing on maximizing occupancy, memory throughput, and warp efficiency.
  • Profile GPU workloads using tools such as Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK.
  • Analyze and eliminate performance bottlenecks including warp divergence, uncoalesced memory access, register pressure, and PCIe transfer overhead.
  • Improve GPU memory pipelines and ensure proper memory coalescing.
  • Collaborate closely with AI systems, model acceleration, and backend distributed systems teams.
  • Contribute to GPU architecture decisions, kernel libraries, and internal performance-engineering best practices.

What We're Looking For

  • Proven track record building NVIDIA CUDA kernels from scratch, not just calling existing libraries.
  • Strong ability to optimize kernels using tiling strategies, occupancy tuning, shared memory design, and warp scheduling.
  • Deep understanding of CUDA threads, warps, blocks, and grids, GPU memory hierarchy and memory coalescing, as well as warp divergence.
  • Experience diagnosing PCIe bottlenecks and optimizing host-device transfers using pinned memory, streams, and batching.
  • Familiarity with C++, CUDA runtime APIs, and GPU debugging and profiling tooling.

Nice to Have

  • Experience with multi-GPU or distributed GPU systems such as NCCL or NVLink.
  • Background in GPU acceleration for ML frameworks or HPC workloads.
  • Knowledge of model inference optimization with TensorRT, CUDA Graphs, or CUTLASS.
  • Exposure to compiler-level optimization or PTX/SASS analysis.
  • Startup experience or comfort working in fast-moving, ambiguous environments.

Technical Stack

  • CUDA, C++, NVIDIA GPUs
  • Nsight Compute, Nsight Systems, nvprof, CUDA‐MEMCHECK
  • TensorRT, CUDA Graphs, CUTLASS, NCCL, NVLink

Team & Environment

You will collaborate closely with AI systems, model acceleration, and backend distributed systems teams.

Benefits & Compensation

  • Competitive salary & equity options
  • Sign-on bonus
  • Health, Dental, and Vision insurance
  • 401k plan

Work Mode

This is a remote position open to candidates within the United States.

Pragmatike is an Equal Opportunity Employer and is committed to providing equal employment opportunities to all applicants without discrimination.

Required Skills
CUDAC++NVIDIA GPUsNsight ComputeNsight SystemsnvprofCUDA-MEMCHECKTensorRTCUDA GraphsCUTLASSGPUPerformance OptimizationProfilingDeep Learning
Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation
Document preparation & submission
Deadline tracking & renewals
Direct liaison with immigration
Talk to an expert
10+ years experience
About company
Pragmatike (on behalf of a fast-growing AI startup)

Pragmatike is recruiting on behalf of a fast-growing AI startup recognized as a Top 10 GenAI company by GTM Capital, founded by MIT CSAIL researchers.

Visit website
Job Details
Category embedded
Posted 21 days ago