**`cudarc` crate for Rust**
CUDA is a complex software stack, but fundamentally it boils down to
- A piece of code running on the GPU (a *kernel*)
- A piece of code running on the CPU (the *driver*)
- Allocations, blocks of memory that can be created on the CPU or the GPU and can be `memcpy`'ed between them
Code can be generated and compiled on-the-fly using the *NVIDIA Runtime Compiler Toolkit* (`nvrtc`).
`nvrtc` takes as input a CUDA C++ program as a string. The output is a compiled *PTX* program. PTX is a portable virtual-machine format that can be used to target many different NVIDIA GPUs, including ones that don't exist yet b/c of CUDA's strong forward-compatibility guarantees.
### NVIDIA Jetson
I have an NVIDIA Jetson AGX Orin device that I bought a long time ago when I first became interested in edge AI applications.
Now, the Orin is a test bed that I can use to experiment with writing raw CUDA kernels and executing them at machine speed.
The datasheet for this device claims that it should be able to achieve 204GB/sec of memory bandwidth. That's pretty good for a tiny device! A measly 10% of what you'd get on it's big sister the H100, but closer to what you'd get on something like an Apple M2 Pro processor.
I started poking Claude for ideas. It gave me a starting point that didn't compile, but after fixing it up I was able to get some measurements across 3 different kernels:
- `write_kernel`: broadcast a single value to every slot in the output array
- `read_kernel`: sum all values in an input array into a single value
- `copy_kernel`: element-wise copy of an input array to an output array
The results look like the following:
```text
CUDA Memory Bandwidth Test (Kernel-based) Device: Orin
============================================================
Size (MB) Copy (GB/s) Read (GB/s) Write (GB/s)
-----------------------------------------------------------------
100 108.86 43.98 139.37
500 108.99 33.02 134.57
1000 107.27 32.99 139.56
2000 135.88 33.06 164.17
4000 121.54 33.08 150.91
```
This shows us we're only able to get ~75% of the theoretical bandwidth, and oddly the maximum comes from the write kernel. It's not clear to me why these are so different from one another. My device is in MAXN mode and I've enabled overclocking by running `sudo jetson_clocks`.