GPU Async API
User-space API reference for GPU asynchronous (GPUDirect RDMA) support.
The interface is split across three headers in include/:
GpuAsync.h— low-level ioctl wrappers (C-callable).GpuAsyncUser.h—GpuAsyncCoreRegs: a thin C++ wrapper over the mapped GpuAsyncCore register block. Hides offset and behaviour differences between V1 and V4 firmware.GpuAsyncLib.h— CUDA + GpuAsync helpers:DataGPU,CudaContext,GpuDmaBuffer_t,gpuMapHostFpgaMem,gpuMapFpgaMem, pairedGpuBufferState_t, plus theAxiWrDesc64_tAXI stream descriptor.
Note
The ioctl wrappers in GpuAsync.h are static inline and compile
into the calling application. The GpuAsyncLib helpers
(DataGPU, CudaContext, gpuMapFpgaMem, etc.) are declared
in the header and defined in data_dev/app/src/GpuAsyncLib.cpp —
link that translation unit into your application.
The caller must call cuInit(0) before constructing any
CudaContext. The CudaContext constructor calls cuInit(0)
itself, so direct application use is straightforward.
Ioctl Wrappers (GpuAsync.h)
C wrappers around the GpuAsync ioctls (GPU_Add_Nvidia_Memory,
GPU_Is_Gpu_Async_Supp, GPU_Get_Gpu_Async_Ver,
GPU_Get_Max_Buffers, etc.; see Ioctl Command Code Reference). Safe to use from
either C or C++ code. gpuGetGpuAsyncVersion and gpuGetMaxBuffers
return ssize_t so kernel-level ioctl failures (-1 with errno
set) are representable without wraparound; callers must check for a
negative return before using the value.
Defines
-
GPU_Add_Nvidia_Memory
GPU command codes.
-
GPU_Rem_Nvidia_Memory
-
GPU_Set_Write_Enable
-
GPU_Is_Gpu_Async_Supp
-
GPU_Get_Gpu_Async_Ver
-
GPU_Get_Max_Buffers
Functions
-
static inline ssize_t gpuAddNvidiaMemory(int32_t fd, uint32_t write, uint64_t address, uint32_t size)
Adds an NVIDIA GPU memory region.
This function adds a specified memory region to the NVIDIA GPU, allowing for the region to be accessed as specified by the write flag.
- Parameters:
fd – File descriptor for the device.
write – Write access flag (1 for write access, 0 for read-only).
address – Memory address of the GPU region to add.
size – Size of the memory region to add. Must be a multiple of 64 KB.
- Returns:
On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.
-
static inline ssize_t gpuRemNvidiaMemory(int32_t fd)
Removes an NVIDIA GPU memory region.
This function removes a previously added memory region from the NVIDIA GPU, ceasing its accessibility.
- Parameters:
fd – File descriptor for the device.
- Returns:
On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.
-
static inline ssize_t gpuSetWriteEn(int32_t fd, uint32_t idx)
Set write enable for buffer.
This function enables a DMA buffer for DMA operations.
- Parameters:
fd – File descriptor for the device.
idx – Buffer index to enable.
- Returns:
0 on success, negative error code on failure. Returns -ENOTSUPP if the firmware does not support GPUDirect.
-
static inline bool gpuIsGpuAsyncSupported(int32_t fd)
Check if the firmware supports GPU Async.
Note
The ioctl returns
1(supported),0(not supported), or a negative errno on failure. We must compare against>0because casting-1directly toboolyieldstrueand would produce a false-positive “supported” result on driver/fd errors.- Parameters:
fd – File descriptor for the device.
- Returns:
trueif the firmware and driver support GPU Async,falseotherwise (including when the driver was built without GPUAsync support and returns-ENOTSUPP, or when the ioctl itself fails with-1anderrnoset, e.g.ENOTTYorENOTSUPP). Callers should treat afalsereturn as “not supported” without attempting any further GPU Async ioctls.
-
static inline ssize_t gpuGetGpuAsyncVersion(int32_t fd)
Get the version of GpuAsyncCore in the firmware.
Note
The return type is
ssize_t(signed) so the-1failure case is representable without wraparound. Callers must explicitly check for a negative return before using the value.- Parameters:
fd – File descriptor for the device.
- Returns:
On success, the version of GpuAsyncCore (
>=0). On failure,-1witherrnoset to indicate the cause (for exampleENOTSUPPorENOTTYif the driver was compiled without GPUAsync support, or another ioctl error).
-
static inline ssize_t gpuGetMaxBuffers(int32_t fd)
Get the maximum number of DMA buffers.
Note
Same signed-return rationale as gpuGetGpuAsyncVersion() — callers must explicitly check for a negative return before using the value.
- Parameters:
fd – File descriptor for the device.
- Returns:
On success, the number of DMA buffers available for use (
>=0). On failure,-1witherrnoset to indicate the cause (for exampleENOTSUPPorENOTTYif the driver was compiled without GPUAsync support, or another ioctl error).
-
struct GpuNvidiaData
- #include <GpuAsync.h>
Represents NVIDIA GPU memory data.
This structure is used for managing memory regions in NVIDIA GPUs, specifically for adding or removing access to these regions.
Register Wrapper (GpuAsyncUser.h)
C++11 wrapper class GpuAsyncCoreRegs over the mapped GpuAsyncCore
register block. The lifetime of a GpuAsyncCoreRegs instance must be
within the lifetime of the memory mapping it was constructed over.
Code calling into the class does not need to be aware of register
offsets or the underlying GpuAsyncCore version (V1 vs V4).
-
class GpuAsyncCoreRegs
- #include <GpuAsyncUser.h>
Thin wrapper around the C API and definitions in GpuAsyncRegs.h The lifetime of this object must be within the lifetime of the memory mapped registers provided in the constructor.
Public Functions
-
GpuAsyncCoreRegs() = delete
-
inline explicit GpuAsyncCoreRegs(volatile void *regs, int versionOverride = -1)
- Parameters:
regs – Pointer to the memory mapped GpuAsyncCore registers.
versionOverride – Force this specific version of GpuAsyncCore, instead of reading from the GpuAsyncReg_Version
-
inline volatile uint8_t *registers() const
-
inline uint32_t readReg(const GpuAsyncRegister ®) const
-
inline uint32_t readReg(uint32_t offset) const
Read register at a specific offset, instead of using the GpuAsyncRegister struct.
-
inline void writeReg(const GpuAsyncRegister ®, uint32_t value)
-
inline void writeReg(uint32_t offset, uint32_t value)
-
inline uint32_t version() const
Returns the version of GpuAsyncCore this firmware is running.
-
inline uint32_t maxBuffers() const
Returns the max number of buffers supported by the firmware.
-
inline uint32_t arCache() const
-
inline uint32_t awCache() const
-
inline uint32_t dmaDataBytes() const
Returns the number of dma header bytes, DMA_AXI_CONFIG_G.DATA_BYTES_C.
-
inline uint32_t writeCount() const
-
inline void setWriteCount(uint32_t val)
-
inline uint32_t writeEnable() const
-
inline void setWriteEnable(uint32_t val)
-
inline uint32_t readCount() const
-
inline void setReadCount(uint32_t val)
-
inline uint32_t readEnable() const
-
inline void setReadEnable(uint32_t val)
-
inline void countReset()
-
inline uint32_t rxFrameCount() const
-
inline uint32_t txFrameCount() const
-
inline uint32_t axiWriteErrorCount() const
-
inline uint32_t axiReadErrorCount() const
-
inline uint32_t axiWriteErrorVal() const
-
inline uint32_t axiReadErrorVal() const
-
inline uint32_t axiWriteTimeoutCount() const
-
inline uint32_t axisDeMuxSelect() const
-
inline void setAxisDeMuxSelect(uint32_t val)
-
inline uint32_t minWriteBuffer() const
-
inline uint32_t minReadBuffer() const
-
inline uint32_t totalLatency(uint32_t buffer) const
Returns the total round-trip latency, in clock cycles, reported for the buffer.
Note
For V4+, the buffer argument is ignored and should be 0.
-
inline uint32_t totalLatencyOffset(uint32_t buffer) const
-
inline uint32_t gpuLatency(uint32_t buffer) const
Returns the GPU processing latency, in clock cycles, reported for the buffer.
Note
For V4+, the buffer argument is ignored and should be 0.
-
inline uint32_t gpuLatencyOffset(uint32_t buffer) const
-
inline uint32_t writeLatency(uint32_t buffer) const
Returns the FPGA -> GPU write latency, in clock cycles, reported for the buffer.
Note
For V4+, the buffer argument is ignored and should be 0.
-
inline uint32_t writeLatencyOffset(uint32_t buffer) const
-
inline uint32_t remoteWriteSize(uint32_t buffer) const
Gets the remote write max size, used for FPGA -> GPU transfers.
Note
buffer is ignored when version() >= 4, since in V4 all buffers share the same register
- Parameters:
buffer – The buffer to set the remote size for. Ignored in version >= 4
-
inline void setRemoteWriteMaxSize(uint32_t buffer, uint32_t size)
Sets the remote write max size, used for FPGA -> GPU transfers.
Note
buffer is ignored when version() >= 4, since in V4 all buffers share the same register
- Parameters:
buffer – The buffer to set the remote size for. Ignored in version >= 4
size – The size
-
inline void setRemoteWriteAddress(uint32_t buffer, uint64_t addr)
Sets the remote write address for the specified buffer.
Used for FPGA -> GPU transfers
- Parameters:
buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4
addr – 64-bit address in GPU device memory
-
inline void setRemoteReadAddress(uint32_t buffer, uint64_t addr)
Sets the remote read address for the specified buffer.
Used for GPU -> FPGA transfers
- Parameters:
buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4
addr – 64-bit address in GPU device memory
-
inline void returnFreeListIndex(uint32_t buffer)
Arms free list buffer for remote write from FPGA -> GPU.
See also
freeListOffset() for something usable with CUDA
- Parameters:
buffer – Buffer index to trigger.
-
inline uint32_t freeListOffset(uint32_t buffer) const
Returns the offset of the free list register from the start of the GpuAsyncCore registers.
- Parameters:
buffer – The buffer index.
-
inline uint32_t remoteReadSizeOffset(uint32_t buffer) const
Returns the offset of the remote read size register from the start of the GpuAsyncCore registers.
This is usable in CUDA kernels.
- Parameters:
buffer – the buffer index.
-
inline uint32_t remoteReadSize(uint32_t buffer) const
Get the remote read size for the specified buffer.
-
inline void setRemoteReadSize(uint32_t buffer, uint32_t size)
Set the remote read size for the buffer.
- Parameters:
buffer – Buffer index
size – Size of the GPU -> FPGA transfer
-
GpuAsyncCoreRegs() = delete
CUDA Helpers (GpuAsyncLib.h)
Thin C++ helpers for the boilerplate CUDA + GpuAsync setup:
DataGPU (RAII for /dev/datadev_X), CudaContext (cuInit /
device select / context create), GpuDmaBuffer_t and
GpuBufferState_t (FPGA-mapped GPU memory), and the gpuMapFpgaMem
/ gpuMapHostFpgaMem / gpuUnmapFpgaMem helpers. The library is
deliberately scope-narrow — it does not own a CUDA stream or prescribe
a session lifecycle, leaving stream / buffer-pool / arming policy to
the calling application.
Functions
-
void checkError(CUresult status)
Print + abort if
statusis not CUDA_SUCCESS.
-
void checkError(cudaError_t status)
Print + abort if
statusis not cudaSuccess.
-
bool wasError(CUresult status)
Return true (and print) if
statusis not CUDA_SUCCESS.
-
int gpuMapHostFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size)
Map an FPGA register block to host + GPU via cuMemHostRegister.
Use for control / register access — not high-throughput data. Backed by dmaMapRegister(). The caller must call gpuUnmapFpgaMem() to release.
- Parameters:
outmem – Output buffer descriptor (zero-initialised on entry).
fd – DMA device fd.
offset – Register-block offset within the device.
size – Register-block size in bytes.
- Returns:
0 on success, -1 on failure.
-
int gpuMapFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size, int write)
Allocate a GPU buffer and register it with the FPGA via RDMA.
Backed by cudaMalloc + cuPointerSetAttribute + gpuAddNvidiaMemory.
- Parameters:
outmem – Output buffer descriptor (zero-initialised on entry).
fd – DMA device fd.
offset – Currently unused; reserved for future use.
size – Buffer size in bytes (must be a multiple of 64 KiB).
write – Non-zero if this buffer is for FPGA->GPU writes; zero for GPU->FPGA reads.
- Returns:
0 on success, -1 on failure.
-
void gpuUnmapFpgaMem(GpuDmaBuffer_t *mem)
Release a buffer previously returned by gpuMapHostFpgaMem / gpuMapFpgaMem.
-
int gpuInitBufferState(GpuBufferState_t *b, int fd, size_t bufSize)
Allocate paired rx + tx GPU buffers and register them with the FPGA.
- Returns:
0 on success, -1 on error (rx already cleaned up if tx failed).
-
void gpuDestroyBufferState(GpuBufferState_t *b)
Tear down state allocated by gpuInitBufferState.
-
inline AxiWrDesc64_t UnpackAxiWriteDescriptor(const void *data)
-
class DataGPU
- #include <GpuAsyncLib.h>
RAII wrapper around an opened /dev/datadev_X file descriptor.
The constructor opens the device with O_RDWR and throws on failure. The destructor closes the descriptor.
Public Functions
-
explicit DataGPU(const char *path)
-
inline ~DataGPU()
-
inline int fd() const
Returns the underlying file descriptor (or -1 if closed).
Protected Attributes
-
int fd_
-
explicit DataGPU(const char *path)
-
class CudaContext
- #include <GpuAsyncLib.h>
Wraps cuInit + cuDeviceGet + cuCtxCreate.
The constructor calls cuInit(0) and throws on failure. init() selects a device, verifies stream-memory-ops, and creates a CUDA context. The resulting context and device are exposed via context() and device().
Public Functions
-
CudaContext()
-
bool init(int device = -1, bool quiet = false)
Select a CUDA device and create a context.
- Parameters:
device – CUDA device index. < 0 selects device 0.
quiet – Suppress informational stderr output.
- Returns:
true on success.
-
void listDevices()
Print all visible CUDA devices to stdout.
-
int getAttribute(CUdevice_attribute attr)
Convenience wrapper around cuDeviceGetAttribute (returns 0 on failure).
-
inline CUdevice device() const
-
inline CUcontext context() const
-
CudaContext()
-
struct GpuDmaBuffer_t
- #include <GpuAsyncLib.h>
FPGA memory mapped for GPU access (host-mapped or RDMA).
-
struct GpuBufferState_t
- #include <GpuAsyncLib.h>
A pair of FPGA-mapped GPU buffers: one for FPGA writes, one for FPGA reads.
Public Members
-
uint8_t *swFpgaRegs
-
GpuDmaBuffer_t bread
FPGA -> GPU read buffer (GPU is the destination of FPGA reads).
-
GpuDmaBuffer_t bwrite
FPGA -> GPU write buffer (GPU is the destination of FPGA writes).
-
uint8_t *swFpgaRegs
-
struct AxiWrDesc64_t
- #include <GpuAsyncLib.h>
64-bit AXI stream write descriptor as emitted by the FPGA.
Layout matches AxiStreamDmaV2Write.vhd.