GPU Async API

User-space API reference for GPU asynchronous (GPUDirect RDMA) support. The interface is split across three headers in include/:

GpuAsync.h — low-level ioctl wrappers (C-callable).
GpuAsyncUser.h — GpuAsyncCoreRegs: a thin C++ wrapper over the mapped GpuAsyncCore register block. Hides offset and behaviour differences between V1 and V4 firmware.
GpuAsyncLib.h — CUDA + GpuAsync helpers: DataGPU, CudaContext, GpuDmaBuffer_t, gpuMapHostFpgaMem, gpuMapFpgaMem, paired GpuBufferState_t, plus the AxiWrDesc64_t AXI stream descriptor.

Note

The ioctl wrappers in GpuAsync.h are static inline and compile into the calling application. The GpuAsyncLib helpers (DataGPU, CudaContext, gpuMapFpgaMem, etc.) are declared in the header and defined in data_dev/app/src/GpuAsyncLib.cpp — link that translation unit into your application.

The caller must call cuInit(0) before constructing any CudaContext. The CudaContext constructor calls cuInit(0) itself, so direct application use is straightforward.

Ioctl Wrappers (GpuAsync.h)

C wrappers around the GpuAsync ioctls (GPU_Add_Nvidia_Memory, GPU_Is_Gpu_Async_Supp, GPU_Get_Gpu_Async_Ver, GPU_Get_Max_Buffers, etc.; see Ioctl Command Code Reference). Safe to use from either C or C++ code. gpuGetGpuAsyncVersion and gpuGetMaxBuffers return ssize_t so kernel-level ioctl failures (-1 with errno set) are representable without wraparound; callers must check for a negative return before using the value.

Defines

GPU_Add_Nvidia_Memory: GPU command codes.

GPU_Rem_Nvidia_Memory

GPU_Set_Write_Enable

GPU_Is_Gpu_Async_Supp

GPU_Get_Gpu_Async_Ver

GPU_Get_Max_Buffers

Functions

static inline ssize_t gpuAddNvidiaMemory(int32_t fd, uint32_t write, uint64_t address, uint32_t size)

Adds an NVIDIA GPU memory region.

This function adds a specified memory region to the NVIDIA GPU, allowing for the region to be accessed as specified by the write flag.

Parameters:

fd – File descriptor for the device.
write – Write access flag (1 for write access, 0 for read-only).
address – Memory address of the GPU region to add.
size – Size of the memory region to add. Must be a multiple of 64 KB.

Returns:

On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline ssize_t gpuRemNvidiaMemory(int32_t fd)

Removes an NVIDIA GPU memory region.

This function removes a previously added memory region from the NVIDIA GPU, ceasing its accessibility.

Parameters:

fd – File descriptor for the device.

Returns:

On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline ssize_t gpuSetWriteEn(int32_t fd, uint32_t idx)

Set write enable for buffer.

This function enables a DMA buffer for DMA operations.

Parameters:

fd – File descriptor for the device.
idx – Buffer index to enable.

Returns:

0 on success, negative error code on failure. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline bool gpuIsGpuAsyncSupported(int32_t fd)

Check if the firmware supports GPU Async.

Note

The ioctl returns 1 (supported), 0 (not supported), or a negative errno on failure. We must compare against >0 because casting -1 directly to bool yields true and would produce a false-positive “supported” result on driver/fd errors.

Parameters:

fd – File descriptor for the device.

Returns:

true if the firmware and driver support GPU Async, false otherwise (including when the driver was built without GPUAsync support and returns -ENOTSUPP, or when the ioctl itself fails with -1 and errno set, e.g. ENOTTY or ENOTSUPP). Callers should treat a false return as “not supported” without attempting any further GPU Async ioctls.

static inline ssize_t gpuGetGpuAsyncVersion(int32_t fd)

Get the version of GpuAsyncCore in the firmware.

Note

The return type is ssize_t (signed) so the -1 failure case is representable without wraparound. Callers must explicitly check for a negative return before using the value.

Parameters:

fd – File descriptor for the device.

Returns:

On success, the version of GpuAsyncCore (>= 0). On failure, -1 with errno set to indicate the cause (for example ENOTSUPP or ENOTTY if the driver was compiled without GPUAsync support, or another ioctl error).

static inline ssize_t gpuGetMaxBuffers(int32_t fd)

Get the maximum number of DMA buffers.

Note

Same signed-return rationale as gpuGetGpuAsyncVersion() — callers must explicitly check for a negative return before using the value.

Parameters:

fd – File descriptor for the device.

Returns:

On success, the number of DMA buffers available for use (>= 0). On failure, -1 with errno set to indicate the cause (for example ENOTSUPP or ENOTTY if the driver was compiled without GPUAsync support, or another ioctl error).

struct GpuNvidiaData

#include <GpuAsync.h>

Represents NVIDIA GPU memory data.

This structure is used for managing memory regions in NVIDIA GPUs, specifically for adding or removing access to these regions.

Public Members

uint32_t write: Write permission flag (non-zero for write access).

uint64_t address: GPU memory address.

uint32_t size: Size of the memory region in bytes.

Register Wrapper (GpuAsyncUser.h)

C++11 wrapper class GpuAsyncCoreRegs over the mapped GpuAsyncCore register block. The lifetime of a GpuAsyncCoreRegs instance must be within the lifetime of the memory mapping it was constructed over. Code calling into the class does not need to be aware of register offsets or the underlying GpuAsyncCore version (V1 vs V4).

class GpuAsyncCoreRegs

#include <GpuAsyncUser.h>

Thin wrapper around the C API and definitions in GpuAsyncRegs.h The lifetime of this object must be within the lifetime of the memory mapped registers provided in the constructor.

Public Functions

GpuAsyncCoreRegs() = delete

inline explicit GpuAsyncCoreRegs(volatile void *regs, int versionOverride = -1)

Parameters:

regs – Pointer to the memory mapped GpuAsyncCore registers.
versionOverride – Force this specific version of GpuAsyncCore, instead of reading from the GpuAsyncReg_Version

inline volatile uint8_t *registers() const

inline uint32_t readReg(const GpuAsyncRegister &reg) const

inline uint32_t readReg(uint32_t offset) const: Read register at a specific offset, instead of using the GpuAsyncRegister struct.

inline void writeReg(const GpuAsyncRegister &reg, uint32_t value)

inline void writeReg(uint32_t offset, uint32_t value)

inline uint32_t version() const: Returns the version of GpuAsyncCore this firmware is running.

inline uint32_t maxBuffers() const: Returns the max number of buffers supported by the firmware.

inline uint32_t arCache() const

inline uint32_t awCache() const

inline uint32_t dmaDataBytes() const: Returns the number of dma header bytes, DMA_AXI_CONFIG_G.DATA_BYTES_C.

inline uint32_t writeCount() const

inline void setWriteCount(uint32_t val)

inline uint32_t writeEnable() const

inline void setWriteEnable(uint32_t val)

inline uint32_t readCount() const

inline void setReadCount(uint32_t val)

inline uint32_t readEnable() const

inline void setReadEnable(uint32_t val)

inline void countReset()

inline uint32_t rxFrameCount() const

inline uint32_t txFrameCount() const

inline uint32_t axiWriteErrorCount() const

inline uint32_t axiReadErrorCount() const

inline uint32_t axiWriteErrorVal() const

inline uint32_t axiReadErrorVal() const

inline uint32_t axiWriteTimeoutCount() const

inline uint32_t axisDeMuxSelect() const

inline void setAxisDeMuxSelect(uint32_t val)

inline uint32_t minWriteBuffer() const

inline uint32_t minReadBuffer() const

inline uint32_t totalLatency(uint32_t buffer) const: Returns the total round-trip latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t totalLatencyOffset(uint32_t buffer) const

inline uint32_t gpuLatency(uint32_t buffer) const: Returns the GPU processing latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t gpuLatencyOffset(uint32_t buffer) const

inline uint32_t writeLatency(uint32_t buffer) const: Returns the FPGA -> GPU write latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t writeLatencyOffset(uint32_t buffer) const

inline uint32_t remoteWriteSize(uint32_t buffer) const

Gets the remote write max size, used for FPGA -> GPU transfers.

Note

buffer is ignored when version() >= 4, since in V4 all buffers share the same register

Parameters:: buffer – The buffer to set the remote size for. Ignored in version >= 4

inline void setRemoteWriteMaxSize(uint32_t buffer, uint32_t size)

Sets the remote write max size, used for FPGA -> GPU transfers.

Note

buffer is ignored when version() >= 4, since in V4 all buffers share the same register

Parameters:

buffer – The buffer to set the remote size for. Ignored in version >= 4
size – The size

inline void setRemoteWriteAddress(uint32_t buffer, uint64_t addr)

Sets the remote write address for the specified buffer.

Used for FPGA -> GPU transfers

Parameters:

buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4
addr – 64-bit address in GPU device memory

inline void setRemoteReadAddress(uint32_t buffer, uint64_t addr)

Sets the remote read address for the specified buffer.

Used for GPU -> FPGA transfers

Parameters:

buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4
addr – 64-bit address in GPU device memory

inline void returnFreeListIndex(uint32_t buffer)

Arms free list buffer for remote write from FPGA -> GPU.

CUDA Helpers (GpuAsyncLib.h)

Thin C++ helpers for the boilerplate CUDA + GpuAsync setup: DataGPU (RAII for /dev/datadev_X), CudaContext (cuInit / device select / context create), GpuDmaBuffer_t and GpuBufferState_t (FPGA-mapped GPU memory), and the gpuMapFpgaMem / gpuMapHostFpgaMem / gpuUnmapFpgaMem helpers. The library is deliberately scope-narrow — it does not own a CUDA stream or prescribe a session lifecycle, leaving stream / buffer-pool / arming policy to the calling application.

Defines

deviceFunc

globalFunc

hostFunc

Functions

void checkError(CUresult status): Print + abort if status is not CUDA_SUCCESS.

void checkError(cudaError_t status): Print + abort if status is not cudaSuccess.

bool wasError(CUresult status): Return true (and print) if status is not CUDA_SUCCESS.

int gpuMapHostFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size)

Map an FPGA register block to host + GPU via cuMemHostRegister.

Use for control / register access — not high-throughput data. Backed by dmaMapRegister(). The caller must call gpuUnmapFpgaMem() to release.

Parameters:

outmem – Output buffer descriptor (zero-initialised on entry).
fd – DMA device fd.
offset – Register-block offset within the device.
size – Register-block size in bytes.

Returns:

0 on success, -1 on failure.

int gpuMapFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size, int write)

Allocate a GPU buffer and register it with the FPGA via RDMA.

Backed by cudaMalloc + cuPointerSetAttribute + gpuAddNvidiaMemory.

Parameters:

outmem – Output buffer descriptor (zero-initialised on entry).
fd – DMA device fd.
offset – Currently unused; reserved for future use.
size – Buffer size in bytes (must be a multiple of 64 KiB).
write – Non-zero if this buffer is for FPGA->GPU writes; zero for GPU->FPGA reads.

Returns:

0 on success, -1 on failure.

void gpuUnmapFpgaMem(GpuDmaBuffer_t *mem): Release a buffer previously returned by gpuMapHostFpgaMem / gpuMapFpgaMem.

int gpuInitBufferState(GpuBufferState_t *b, int fd, size_t bufSize)

Allocate paired rx + tx GPU buffers and register them with the FPGA.

Returns:: 0 on success, -1 on error (rx already cleaned up if tx failed).

void gpuDestroyBufferState(GpuBufferState_t *b): Tear down state allocated by gpuInitBufferState.

inline AxiWrDesc64_t UnpackAxiWriteDescriptor(const void *data)

class DataGPU

#include <GpuAsyncLib.h>

RAII wrapper around an opened /dev/datadev_X file descriptor.

The constructor opens the device with O_RDWR and throws on failure. The destructor closes the descriptor.

Public Functions

explicit DataGPU(const char *path)

inline ~DataGPU()

inline int fd() const: Returns the underlying file descriptor (or -1 if closed).

Protected Attributes

int fd_

class CudaContext

#include <GpuAsyncLib.h>

Wraps cuInit + cuDeviceGet + cuCtxCreate.

The constructor calls cuInit(0) and throws on failure. init() selects a device, verifies stream-memory-ops, and creates a CUDA context. The resulting context and device are exposed via context() and device().

Public Functions

CudaContext()

bool init(int device = -1, bool quiet = false)

Select a CUDA device and create a context.

Parameters:

device – CUDA device index. < 0 selects device 0.
quiet – Suppress informational stderr output.

Returns:

true on success.

void listDevices(): Print all visible CUDA devices to stdout.

int getAttribute(CUdevice_attribute attr): Convenience wrapper around cuDeviceGetAttribute (returns 0 on failure).

inline CUdevice device() const

inline CUcontext context() const

Protected Attributes

CUcontext context_

CUdevice device_

struct GpuDmaBuffer_t

#include <GpuAsyncLib.h>

FPGA memory mapped for GPU access (host-mapped or RDMA).

Public Members

int fd: Owning DMA device fd.

uint8_t *ptr: Host-accessible pointer (NULL when gpuOnly == 1).

size_t size: Size of the block in bytes.

CUdeviceptr dptr: Device pointer for the block.

int gpuOnly: 1 if FPGA<->GPU only (no host mapping).

struct GpuBufferState_t

#include <GpuAsyncLib.h>

A pair of FPGA-mapped GPU buffers: one for FPGA writes, one for FPGA reads.

Public Members

uint8_t *swFpgaRegs

GpuDmaBuffer_t bread: FPGA -> GPU read buffer (GPU is the destination of FPGA reads).

GpuDmaBuffer_t bwrite: FPGA -> GPU write buffer (GPU is the destination of FPGA writes).

struct AxiWrDesc64_t

#include <GpuAsyncLib.h>

64-bit AXI stream write descriptor as emitted by the FPGA.

Layout matches AxiStreamDmaV2Write.vhd.

Public Functions

inline uint32_t result() const

inline uint32_t overflow() const

inline uint32_t cont() const

inline uint32_t lastUser() const

inline uint32_t firstUser() const

Public Members

uint32_t flags

uint32_t size