PCIe DMA Model
==============

This page explains how ``axi-pcie-core`` moves data between host memory and
the FPGA application over PCIe.  The central entity is ``AxiPcieDma``, which
wraps surf's ``AxiStreamDmaV2`` engine and presents per-lane AXI-Stream
interfaces to the application.

Inbound and Outbound FIFOs
--------------------------

**Inbound (IB)** refers to data flowing from the host into the FPGA
application.  **Outbound (OB)** refers to data flowing from the FPGA
application back to the host.

``AxiPcieDma`` instantiates one ``AxiStreamDmaV2`` lane for each DMA channel.
Each lane contains a dedicated IB FIFO (host → FPGA application) and a
dedicated OB FIFO (FPGA application → host).  The number of lanes is set by
the ``DMA_SIZE_G`` generic (maximum 8), which is bounded by the
``AxiPcieCrossbar`` slave-port budget: the crossbar provides 10 AXI4 slave
ports — one descriptor port, up to eight DMA lane ports, and one user
general-purpose port.

Descriptor Rings
----------------

Each ``AxiStreamDmaV2`` lane uses a descriptor ring: a host-resident circular
buffer whose entries each point to a host DMA buffer (physical address + byte
count).  The DMA engine fetches descriptors from the ring over PCIe using the
descriptor AXI4 slave port on ``AxiPcieCrossbar``.  After the transfer
completes, the engine writes a completion status word back to the descriptor
entry, signalling the driver that the buffer is ready.

The descriptor address space is at most 40 bits wide, even though the internal
AXI4 bus uses 64-bit addresses (``ADDR_WIDTH_C = 64`` in ``AXI_PCIE_CONFIG_C``).
This 40-bit limit is an architectural constraint of ``AxiStreamDmaV2``;
software must ensure descriptor ring memory is allocated within the low 1 TB
of host physical address space.

Back-Pressure via tReady
------------------------

``tReady`` is the AXI-Stream handshake signal asserted by a slave to accept a
data beat from its upstream master.  When a downstream application sink
de-asserts ``tReady`` it stalls ``AxiStreamDmaV2`` IB delivery on that lane,
which in turn stalls ``AxiPcieDma``, and ultimately stalls PCIe completion
scheduling for that DMA channel.  Conversely, when an application source has
no data to send it simply de-asserts ``tValid``; this produces no OB beats and
does not affect other lanes.

This is the canonical SLAC AXI-Stream back-pressure model: the FPGA
application must keep its IB sink draining fast enough to absorb the expected
PCIe throughput, or it must accept that the IB FIFO will fill and PCIe
transfers will stall on that lane.  Per-lane IB/OB traffic monitors (accessible
at BAR0 offsets ``0x0006_0000`` and ``0x0006_8000``) can be used to observe
stall conditions.

DMA IRQ Flow
------------

When ``AxiStreamDmaV2`` completes one or more transfers it asserts the
``dmaIrq`` output of ``AxiPcieDma``.  ``AxiPcieUltrascalePlusIrqFsm``
receives this level-sensitive signal and converts it to a rising-edge MSI
request for the PCIe PHY IP (``usrIrqReq``).  The FSM waits for the PHY's
``usrIrqAck`` handshake before de-asserting the request and returning to
idle.  This ensures that a new interrupt cannot collide with one that is still
being serviced.

The PCIe PHY delivers the MSI to the host.  The host driver (PyRogue /
``rogue.hardware.axi.AxiStreamDma``) handles the interrupt, reads the
completion status words from the descriptor ring, recycles completed
descriptors, and schedules new transfers.

End-to-End Data-Flow Diagram
----------------------------

The diagram below traces both the IB path (host → FPGA application) and the
OB path (FPGA application → host), including the descriptor and IRQ paths:

.. code-block:: text

    Host (DMA buffers + descriptor ring in host physical memory)
      |
      |  PCIe lanes (Gen3 x16 or Gen4 x8, board-dependent)
      |
      v
    PCIe PHY (per-board .dcp wrapper, 250 MHz recovered clock)
      |
      |  AXI4 (256-bit data, 64-bit addr, ID_BITS=4)
      |
      v
    AxiPcieCrossbar  (DMA_SIZE_G + 2 slaves -> 1 master -> PCIe PHY)
      |
      +---> AxiPcieReg  (register path -> BAR0 AXI-Lite slaves)
      |
      +---> AxiPcieDma  (AxiStreamDmaV2 engine, DMA_SIZE_G lanes)
               |
               |  IB path (host -> FPGA application)
               +--- IB FIFO 0 --> appAxisMasters[0]  -> application lane 0
               +--- IB FIFO 1 --> appAxisMasters[1]  -> application lane 1
               +--- ...
               +--- IB FIFO N --> appAxisMasters[N]  -> application lane N
               |
               |  OB path (FPGA application -> host)
               +--- OB FIFO 0 <-- appAxisSlaves[0]   <- application lane 0
               +--- OB FIFO 1 <-- appAxisSlaves[1]   <- application lane 1
               +--- ...
               +--- OB FIFO N <-- appAxisSlaves[N]   <- application lane N
               |
               |  Descriptor path (one AXI4 slave port on AxiPcieCrossbar)
               +--- descriptor read/write <-> host descriptor ring (<=40-bit addr)
               |
               |  IRQ path
               +--- dmaIrq -> AxiPcieUltrascalePlusIrqFsm -> usrIrqReq (MSI)
                                                                  |
                                                                  v
                                                            PCIe PHY -> host MSI