Skip to content
Sébastien Bourdeauducq edited this page Mar 16, 2017 · 9 revisions

ARTIQ Direct Memory Access (DMA) #553

API sketch

# obtain a handle to a named DMA sequence
my_burst = DMA("my_burst")
# record events into it
with my_burst:
    delay(10*ns)
    ttl0.pulse(20*ns)
    for i in range(100):
        dds2.pulse(300*MHz + i*1*MHz, 220*ns)
# timeline is unaltered and rewound to before the `with`

# potentially in a new experiment, new kernel:

# retrieve a reference to a previously recorded DMA sequence
my_pulse = DMA("my_burst")
t = now_mu()
for i in range(100)
    ttl2.pulse(3*us)
    # trigger one playback of the sequence,
    # and wait until the DMA engine has finished
    my_pulse.play()
    # timeline advanced by length of my_pulse
assert t + seconds_to_mu(100*(3*us + 250*ns)) == now_mu()

# release the DMA sequence
my_pulse.free()

Features

  • DMA sequences persist across kernels/experiments. Otherwise the traffic and CPU time for recording would reduce the benefit.
  • There are two use cases: generating the RTIO event sequences at compile time or at runtime. Runtime seems more powerful and generic, especially if sequences persist.
  • DMA takes 3 system (CPU) clock cycles per RTIO event.
  • Large DMA sequences are mostly stalled by FIFO depth and won't be limited by DMA. In other cases they will compete for memory bandwidth and slow down the CPU.

Format

  • An event in a DMA sequence should be serialized as [length, channel_number, timestamp, address, data]. A DMA sequence is a concatenation of events.
  • Timestamps in a DMA sequence are relative to the beginning of the sequence, but purely additive: no scaling of timestamps.
  • The first timestamp is whatever delay the first event has w.r.t. to the logical start of the sequence.
  • RTIO events in DMA are variable length with up to 512 bit data.
  • DMA sequences can be stored sequentially in DRAM. Old sequences are marked as unused explicitly by the experiment (DMA().free()). Leaking DMA sequences is the user's fault.

Multiple engines

  • There should be multiple (4?) DMA engines. Otherwise only one DMA sequence can be replayed at any given time.
  • With multiple engines DMA sequences can be "composed" and would interleave automatically through arbitration at the RTIO interface.
  • It might be good to support DMA sequences triggering replay of other DMA sequences (separate/same engine(s)?).

Arbitration

  • There needs to be arbitration between the kernel accessing RTIO channels and the DMA engine doing the same.
  • The arbitration does not need to be granular. The claim can be on the "entire" RTIO API.
  • The memory arbiter is round-robin. There will be a slow-down due to sharing of the bandwidth and DRAM dynamics but no (inherent) starvation. The slow-down seems irrelevant since the DRAM can easily outpace the CPU.
  • If multiple engines or the kernel and a DMA engine, or DDMA and DRTIO upstream access the same channel, they risk generating out-of-sequence events.
  • This design will also need to foresee distributed DMA. Then concurrency becomes yet a bit trickier. The arbitration would be at the remote end and would need to handle the FIFO status traffic and "FIFO full" events.

DMA for input events

  • Input DMA segments need to know which channels to collect.
  • Input DMA segments have pre-allocated memory. Their max length is specified at creation time.
  • They poll the channels round-robin. Polling rate should be maximized but it seems ok to poll one channel per cycle. This reduces the maximum event rate for a single channel in a larger collection group. Users can allocate/split channels to multiple engines/simultaneous DMA sequences.

Distributed DMA (DDMA)

  • DMA data should be distributed: the DRTIO links are slower than local fabric and DRAM bandwidth.
  • The (DRTIO) gateware should determine where to intercept DRTIO events in DMA record mode and which DMA sequence to record into.
  • The gateware should trigger (broadcast) the execution of a DMA sequence.
  • Output DDMA for remote channels should be supported (i.e. the DMA engine on Metlino and the channel on Kasli).
  • Input DDMA for remote channels should not be supported. The polling traffic would kill the advantage.
  • TODO: interaction with the analyzer

Non-blocking DMA

  • DMA().play() should not be blocking
  • there should be a DMA().wait() method to wait for DMA completion
  • Re-recording an in-use output DMA sequence is forbidden.

TODO (needs specification/funding)

  • Input DMA API
  • non-blocking DMA
  • distributed DMA
  • multiple engines
  • complex arbitration (multiple DMA engines, incoming DRTIO, and CPU accessing simultaneously)