GPU Async API

User-space API reference for GPU asynchronous (GPUDirect RDMA) support. The interface is split across three headers in include/:

  • GpuAsync.h — low-level ioctl wrappers (C-callable).

  • GpuAsyncUser.hGpuAsyncCoreRegs: a thin C++ wrapper over the mapped GpuAsyncCore register block. Hides offset and behaviour differences between V1 and V4 firmware.

  • GpuAsyncLib.h — CUDA + GpuAsync helpers: DataGPU, CudaContext, GpuDmaBuffer_t, gpuMapHostFpgaMem, gpuMapFpgaMem, paired GpuBufferState_t, plus the AxiWrDesc64_t AXI stream descriptor.

Note

The ioctl wrappers in GpuAsync.h are static inline and compile into the calling application. The GpuAsyncLib helpers (DataGPU, CudaContext, gpuMapFpgaMem, etc.) are declared in the header and defined in data_dev/app/src/GpuAsyncLib.cpp — link that translation unit into your application.

The caller must call cuInit(0) before constructing any CudaContext. The CudaContext constructor calls cuInit(0) itself, so direct application use is straightforward.

Ioctl Wrappers (GpuAsync.h)

C wrappers around the GpuAsync ioctls (GPU_Add_Nvidia_Memory, GPU_Is_Gpu_Async_Supp, GPU_Get_Gpu_Async_Ver, GPU_Get_Max_Buffers, etc.; see Ioctl Command Code Reference). Safe to use from either C or C++ code. gpuGetGpuAsyncVersion and gpuGetMaxBuffers return ssize_t so kernel-level ioctl failures (-1 with errno set) are representable without wraparound; callers must check for a negative return before using the value.

Defines

GPU_Add_Nvidia_Memory

GPU command codes.

GPU_Rem_Nvidia_Memory
GPU_Set_Write_Enable
GPU_Is_Gpu_Async_Supp
GPU_Get_Gpu_Async_Ver
GPU_Get_Max_Buffers

Functions

static inline ssize_t gpuAddNvidiaMemory(int32_t fd, uint32_t write, uint64_t address, uint32_t size)

Adds an NVIDIA GPU memory region.

This function adds a specified memory region to the NVIDIA GPU, allowing for the region to be accessed as specified by the write flag.

Parameters:
  • fd – File descriptor for the device.

  • write – Write access flag (1 for write access, 0 for read-only).

  • address – Memory address of the GPU region to add.

  • size – Size of the memory region to add. Must be a multiple of 64 KB.

Returns:

On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline ssize_t gpuRemNvidiaMemory(int32_t fd)

Removes an NVIDIA GPU memory region.

This function removes a previously added memory region from the NVIDIA GPU, ceasing its accessibility.

Parameters:
  • fd – File descriptor for the device.

Returns:

On success, returns the result of the ioctl call. On failure, returns a negative error code. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline ssize_t gpuSetWriteEn(int32_t fd, uint32_t idx)

Set write enable for buffer.

This function enables a DMA buffer for DMA operations.

Parameters:
  • fd – File descriptor for the device.

  • idx – Buffer index to enable.

Returns:

0 on success, negative error code on failure. Returns -ENOTSUPP if the firmware does not support GPUDirect.

static inline bool gpuIsGpuAsyncSupported(int32_t fd)

Check if the firmware supports GPU Async.

Note

The ioctl returns 1 (supported), 0 (not supported), or a negative errno on failure. We must compare against >0 because casting -1 directly to bool yields true and would produce a false-positive “supported” result on driver/fd errors.

Parameters:
  • fd – File descriptor for the device.

Returns:

true if the firmware and driver support GPU Async, false otherwise (including when the driver was built without GPUAsync support and returns -ENOTSUPP, or when the ioctl itself fails with -1 and errno set, e.g. ENOTTY or ENOTSUPP). Callers should treat a false return as “not supported” without attempting any further GPU Async ioctls.

static inline ssize_t gpuGetGpuAsyncVersion(int32_t fd)

Get the version of GpuAsyncCore in the firmware.

Note

The return type is ssize_t (signed) so the -1 failure case is representable without wraparound. Callers must explicitly check for a negative return before using the value.

Parameters:
  • fd – File descriptor for the device.

Returns:

On success, the version of GpuAsyncCore (>= 0). On failure, -1 with errno set to indicate the cause (for example ENOTSUPP or ENOTTY if the driver was compiled without GPUAsync support, or another ioctl error).

static inline ssize_t gpuGetMaxBuffers(int32_t fd)

Get the maximum number of DMA buffers.

Note

Same signed-return rationale as gpuGetGpuAsyncVersion() — callers must explicitly check for a negative return before using the value.

Parameters:
  • fd – File descriptor for the device.

Returns:

On success, the number of DMA buffers available for use (>= 0). On failure, -1 with errno set to indicate the cause (for example ENOTSUPP or ENOTTY if the driver was compiled without GPUAsync support, or another ioctl error).

struct GpuNvidiaData
#include <GpuAsync.h>

Represents NVIDIA GPU memory data.

This structure is used for managing memory regions in NVIDIA GPUs, specifically for adding or removing access to these regions.

Public Members

uint32_t write

Write permission flag (non-zero for write access).

uint64_t address

GPU memory address.

uint32_t size

Size of the memory region in bytes.

Register Wrapper (GpuAsyncUser.h)

C++11 wrapper class GpuAsyncCoreRegs over the mapped GpuAsyncCore register block. The lifetime of a GpuAsyncCoreRegs instance must be within the lifetime of the memory mapping it was constructed over. Code calling into the class does not need to be aware of register offsets or the underlying GpuAsyncCore version (V1 vs V4).

class GpuAsyncCoreRegs
#include <GpuAsyncUser.h>

Thin wrapper around the C API and definitions in GpuAsyncRegs.h The lifetime of this object must be within the lifetime of the memory mapped registers provided in the constructor.

Public Functions

GpuAsyncCoreRegs() = delete
inline explicit GpuAsyncCoreRegs(volatile void *regs, int versionOverride = -1)
Parameters:
  • regs – Pointer to the memory mapped GpuAsyncCore registers.

  • versionOverride – Force this specific version of GpuAsyncCore, instead of reading from the GpuAsyncReg_Version

inline volatile uint8_t *registers() const
inline uint32_t readReg(const GpuAsyncRegister &reg) const
inline uint32_t readReg(uint32_t offset) const

Read register at a specific offset, instead of using the GpuAsyncRegister struct.

inline void writeReg(const GpuAsyncRegister &reg, uint32_t value)
inline void writeReg(uint32_t offset, uint32_t value)
inline uint32_t version() const

Returns the version of GpuAsyncCore this firmware is running.

inline uint32_t maxBuffers() const

Returns the max number of buffers supported by the firmware.

inline uint32_t arCache() const
inline uint32_t awCache() const
inline uint32_t dmaDataBytes() const

Returns the number of dma header bytes, DMA_AXI_CONFIG_G.DATA_BYTES_C.

inline uint32_t writeCount() const
inline void setWriteCount(uint32_t val)
inline uint32_t writeEnable() const
inline void setWriteEnable(uint32_t val)
inline uint32_t readCount() const
inline void setReadCount(uint32_t val)
inline uint32_t readEnable() const
inline void setReadEnable(uint32_t val)
inline void countReset()
inline uint32_t rxFrameCount() const
inline uint32_t txFrameCount() const
inline uint32_t axiWriteErrorCount() const
inline uint32_t axiReadErrorCount() const
inline uint32_t axiWriteErrorVal() const
inline uint32_t axiReadErrorVal() const
inline uint32_t axiWriteTimeoutCount() const
inline uint32_t axisDeMuxSelect() const
inline void setAxisDeMuxSelect(uint32_t val)
inline uint32_t minWriteBuffer() const
inline uint32_t minReadBuffer() const
inline uint32_t totalLatency(uint32_t buffer) const

Returns the total round-trip latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t totalLatencyOffset(uint32_t buffer) const
inline uint32_t gpuLatency(uint32_t buffer) const

Returns the GPU processing latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t gpuLatencyOffset(uint32_t buffer) const
inline uint32_t writeLatency(uint32_t buffer) const

Returns the FPGA -> GPU write latency, in clock cycles, reported for the buffer.

Note

For V4+, the buffer argument is ignored and should be 0.

inline uint32_t writeLatencyOffset(uint32_t buffer) const
inline uint32_t remoteWriteSize(uint32_t buffer) const

Gets the remote write max size, used for FPGA -> GPU transfers.

Note

buffer is ignored when version() >= 4, since in V4 all buffers share the same register

Parameters:

buffer – The buffer to set the remote size for. Ignored in version >= 4

inline void setRemoteWriteMaxSize(uint32_t buffer, uint32_t size)

Sets the remote write max size, used for FPGA -> GPU transfers.

Note

buffer is ignored when version() >= 4, since in V4 all buffers share the same register

Parameters:
  • buffer – The buffer to set the remote size for. Ignored in version >= 4

  • size – The size

inline void setRemoteWriteAddress(uint32_t buffer, uint64_t addr)

Sets the remote write address for the specified buffer.

Used for FPGA -> GPU transfers

Parameters:
  • buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4

  • addr – 64-bit address in GPU device memory

inline void setRemoteReadAddress(uint32_t buffer, uint64_t addr)

Sets the remote read address for the specified buffer.

Used for GPU -> FPGA transfers

Parameters:
  • buffer – The buffer index. Must be < 16 for V1, and < 1024 for V4

  • addr – 64-bit address in GPU device memory

inline void returnFreeListIndex(uint32_t buffer)

Arms free list buffer for remote write from FPGA -> GPU.

See also

freeListOffset() for something usable with CUDA

Parameters:

buffer – Buffer index to trigger.

inline uint32_t freeListOffset(uint32_t buffer) const

Returns the offset of the free list register from the start of the GpuAsyncCore registers.

Parameters:

buffer – The buffer index.

inline uint32_t remoteReadSizeOffset(uint32_t buffer) const

Returns the offset of the remote read size register from the start of the GpuAsyncCore registers.

This is usable in CUDA kernels.

Parameters:

buffer – the buffer index.

inline uint32_t remoteReadSize(uint32_t buffer) const

Get the remote read size for the specified buffer.

inline void setRemoteReadSize(uint32_t buffer, uint32_t size)

Set the remote read size for the buffer.

Parameters:
  • buffer – Buffer index

  • size – Size of the GPU -> FPGA transfer

Protected Functions

inline uint32_t versionSwitch() const
inline uint32_t readRegV1V4(const GpuAsyncRegister &v1, const GpuAsyncRegister &v4) const
inline void writeRegV1V4(const GpuAsyncRegister &v1, const GpuAsyncRegister &v4, uint32_t val)

Protected Attributes

volatile uint8_t *regs_
uint32_t version_

CUDA Helpers (GpuAsyncLib.h)

Thin C++ helpers for the boilerplate CUDA + GpuAsync setup: DataGPU (RAII for /dev/datadev_X), CudaContext (cuInit / device select / context create), GpuDmaBuffer_t and GpuBufferState_t (FPGA-mapped GPU memory), and the gpuMapFpgaMem / gpuMapHostFpgaMem / gpuUnmapFpgaMem helpers. The library is deliberately scope-narrow — it does not own a CUDA stream or prescribe a session lifecycle, leaving stream / buffer-pool / arming policy to the calling application.

Defines

deviceFunc
globalFunc
hostFunc

Functions

void checkError(CUresult status)

Print + abort if status is not CUDA_SUCCESS.

void checkError(cudaError_t status)

Print + abort if status is not cudaSuccess.

bool wasError(CUresult status)

Return true (and print) if status is not CUDA_SUCCESS.

int gpuMapHostFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size)

Map an FPGA register block to host + GPU via cuMemHostRegister.

Use for control / register access — not high-throughput data. Backed by dmaMapRegister(). The caller must call gpuUnmapFpgaMem() to release.

Parameters:
  • outmem – Output buffer descriptor (zero-initialised on entry).

  • fd – DMA device fd.

  • offset – Register-block offset within the device.

  • size – Register-block size in bytes.

Returns:

0 on success, -1 on failure.

int gpuMapFpgaMem(GpuDmaBuffer_t *outmem, int fd, uint64_t offset, size_t size, int write)

Allocate a GPU buffer and register it with the FPGA via RDMA.

Backed by cudaMalloc + cuPointerSetAttribute + gpuAddNvidiaMemory.

Parameters:
  • outmem – Output buffer descriptor (zero-initialised on entry).

  • fd – DMA device fd.

  • offset – Currently unused; reserved for future use.

  • size – Buffer size in bytes (must be a multiple of 64 KiB).

  • write – Non-zero if this buffer is for FPGA->GPU writes; zero for GPU->FPGA reads.

Returns:

0 on success, -1 on failure.

void gpuUnmapFpgaMem(GpuDmaBuffer_t *mem)

Release a buffer previously returned by gpuMapHostFpgaMem / gpuMapFpgaMem.

int gpuInitBufferState(GpuBufferState_t *b, int fd, size_t bufSize)

Allocate paired rx + tx GPU buffers and register them with the FPGA.

Returns:

0 on success, -1 on error (rx already cleaned up if tx failed).

void gpuDestroyBufferState(GpuBufferState_t *b)

Tear down state allocated by gpuInitBufferState.

inline AxiWrDesc64_t UnpackAxiWriteDescriptor(const void *data)
class DataGPU
#include <GpuAsyncLib.h>

RAII wrapper around an opened /dev/datadev_X file descriptor.

The constructor opens the device with O_RDWR and throws on failure. The destructor closes the descriptor.

Public Functions

explicit DataGPU(const char *path)
inline ~DataGPU()
inline int fd() const

Returns the underlying file descriptor (or -1 if closed).

Protected Attributes

int fd_
class CudaContext
#include <GpuAsyncLib.h>

Wraps cuInit + cuDeviceGet + cuCtxCreate.

The constructor calls cuInit(0) and throws on failure. init() selects a device, verifies stream-memory-ops, and creates a CUDA context. The resulting context and device are exposed via context() and device().

Public Functions

CudaContext()
bool init(int device = -1, bool quiet = false)

Select a CUDA device and create a context.

Parameters:
  • device – CUDA device index. < 0 selects device 0.

  • quiet – Suppress informational stderr output.

Returns:

true on success.

void listDevices()

Print all visible CUDA devices to stdout.

int getAttribute(CUdevice_attribute attr)

Convenience wrapper around cuDeviceGetAttribute (returns 0 on failure).

inline CUdevice device() const
inline CUcontext context() const

Protected Attributes

CUcontext context_
CUdevice device_
struct GpuDmaBuffer_t
#include <GpuAsyncLib.h>

FPGA memory mapped for GPU access (host-mapped or RDMA).

Public Members

int fd

Owning DMA device fd.

uint8_t *ptr

Host-accessible pointer (NULL when gpuOnly == 1).

size_t size

Size of the block in bytes.

CUdeviceptr dptr

Device pointer for the block.

int gpuOnly

1 if FPGA<->GPU only (no host mapping).

struct GpuBufferState_t
#include <GpuAsyncLib.h>

A pair of FPGA-mapped GPU buffers: one for FPGA writes, one for FPGA reads.

Public Members

uint8_t *swFpgaRegs
GpuDmaBuffer_t bread

FPGA -> GPU read buffer (GPU is the destination of FPGA reads).

GpuDmaBuffer_t bwrite

FPGA -> GPU write buffer (GPU is the destination of FPGA writes).

struct AxiWrDesc64_t
#include <GpuAsyncLib.h>

64-bit AXI stream write descriptor as emitted by the FPGA.

Layout matches AxiStreamDmaV2Write.vhd.

Public Members

uint32_t result
uint32_t overflow

Overflow bit.

uint32_t cont

Continue bit.

uint32_t reserved0
uint32_t lastUser
uint32_t firstUser
uint32_t size