# Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Anastasiia Ruzhanskaia Systems Group, D-INFK, ETH Zürich Zürich, Switzerland

David Cock Systems Group, D-INFK, ETH Zürich Zürich, Switzerland Pengcheng Xu Systems Group, D-INFK, ETH Zürich Zürich, Switzerland

Timothy Roscoe Systems Group, D-INFK, ETH Zürich Zürich, Switzerland

## **Abstract**

Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts for asynchrony between cores and device.

In this paper we question this wisdom in the light of two trends: modern and emerging *cache-coherent interconnects* like CXL3.0, and workloads, particularly *microservices* and *serverless computing*. Like some others before us, we argue that the assumptions of the DMA-based model are obsolete, and in many use-cases *programmed I/O*, where the CPU explicitly transfers data and control information to and from a device via loads and stores, delivers a more efficient system.

However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of coherence with both traditional DMA-style interaction and a highly-optimized implementation using memory-mapped programmed I/O over PCIe.

# 1 Introduction

Modern interfaces between CPUs and high-performance devices like network interface adaptors (NICs), Graphics Processing Units (GPUs), etc. are designed to optimize *throughput for large transfers*. Built over a PCI Express (PCIe) interconnect, *descriptor rings* in memory hold queues of requests and completions, and the device uses DMA to read and write both descriptors and payload data. This design, along with the underlying PCIe interconnect, trades off latency for small transactions in favor of high throughput for large ones.

We question this consensus in the light of workloads which depend for performance not on bulk throughput, but the cumulative latency of small transactions between CPU and device: closely-coupled accelerators for irregular work-loads, or Remote Procedure Call (RPC) using small messages. We do this in the context of emerging *cache-coherent* peripheral interconnects like CXL.mem 3.0 [18].

The continuing rise in PCIe bandwidth means, for a given transfer size, the significance of PCIe *latency* (time to first byte) is now more of a concern, leading to many proposals for reducing the overhead of descriptor management (which we survey in Section 6).

We propose a more radical approach, aimed at use-cases where low latency is more important than high maximum throughput: using Programmed I/O (PIO) to issue loads and stores to Memory-Mapped I/O (MMIO) registers directly, avoiding DMA, descriptors, queues, and interrupts entirely.

Using a real hardware platform which implements a coherent interconnect between a server-class CPU and a large FPGA, we investigate the trade-offs between DMA with descriptors, PIO directly over a PCIe interconnect, and PIO over a full cache-coherence protocol.

We show that, for small (≤ 1KiB) transfers, PIO posted writes over PCIe significantly outperform DMA, although reads are much less efficient since, unlike writes, they cannot be pipelined and so incur PCIe's significant round-trip latency, in line with other recent critiques of DMA [39, 50, 51].

We go further, however, and show how a conventional MESI-like coherence protocol can be *exploited in novel ways* by intelligent devices interacting with an unmodified CPU using coherence messages. By *relaxing traditional coherence protocol assumptions* (for example, the independence of cache lines), we can achieve dramatically more efficient communication between a CPU and a device. A coherence-based message protocol which avoids PCIe significantly outperforms both DMA and PIO over PCIe and *completely eliminates tail latency*, while delivering the same or better throughput.

We first provide background and motivation, including the origins of descriptor-based DMA, and why it is time to question these underlying assumptions. In Section 3 we describe and motivate the experimental hardware platform we use for evaluation, calibrating performance against conventional PC servers. We also decribe PIO with PCIe here.

In Section 4, we discuss why true cache-coherent device interconnects like CXL3.0 are fundamentally different from

today's PCIe, and present a family of protocols that pass messages with low latency between software on an unmodified CPU, and a smart device with message-level access to the coherence protocol. We then present the implementation of these protocols on the hardware platform.

Section 5 compares traditional DMA-based I/O with PIO over PCIe and our new coherence-based protocols using three use-cases: (1) lightweight, synchronous local invocation of functionality on a computational accelerator, (2) I/O to and from an intelligent network interface, and (3) hardware offload of stream processor operators. We survey related work in Section 6 and conclude with Section 7.

# 2 Background and Motivation

Our work in this paper is motivated by the interaction between modern trends in platform interconnect and also the changing ways in which these interconnects are used, particularly in data center and cloud servers.

**Interconnects and devices:** In early machines, the CPU interacted with peripherals exclusively using *PIO*: the device exposed MMIO registers, and the processor issued loads and stores to directly to these registers to perform both data transfer and control of the device.

This model made sense when devices were extremely simple, particularly combined with device interrupts to obviate the need for the CPU to poll status registers on the device waiting for an I/O operation to complete. However, it still requires synchronous rendezvous between CPU and device, and moreover requires all data to be transferred via CPU registers (since the CPU is copying data between device registers and main memory).

DMA removed this limitation by allowing the device itself or a dedicated DMA controller to copy data while the CPU continues to execute other code, providing parallelism and partly decoupling CPU and device. Further decoupling is provided by *descriptor rings*: producer/consumer buffers of I/O requests in memory, read and written by both CPU and DMA-capable device. In this case, interrupts need only be used when the queue of descriptors becomes full or empty.

Almost every high-speed device today uses this technique, with minor variations such as the format of descriptor queues, where they are stored, and whether the queue head and tail are identified by registers or flags in the descriptors. This model is optimized for *throughput* – it is most efficient when a large volume of data must be transferred. On modern PCIe interconnects, this consensus results in impressive bandwidth for applications which transfer data in large chunks, up to 64GiB/s for PCIe 5.0 x16.

Surprisingly, increasing PCIe throughput over time has not been accompanied by a reduction in *latency*, which has remained a roughly  $1\mu s$  for an interconnect round-trip message exchange [25]. For the large transfers used in GPU-derived programming models (e.g., machine learning workloads),

this latency penalty is insignificant when amortized over the transfer. Indeed, much software has adapted to this model imposed by the hardware [24].

However, recent trends have led to proposed *cache coherent* peripheral interconnects, and the extension of processor cache coherence to heterogeneous devices: accelerators, networking interfaces, and storage [4, 12, 18, 47, 54]. The more far-reaching are fully *symmetric*: devices are full participants in a distributed directory-based cache coherence protocol, as in CXL.mem 3.0 [18].

Fine-grained workloads: While throughput-oriented workloads involving data transfer between devices and cores (e.g., AI training and inference) are hugely important today, we step back in this paper and consider other workloads that might be a poor fit today for this model, because they involve fine-grained, frequent interaction between the CPU and another device. Many irregular workloads have this property [7, 22, 34, 61].

Moreover, data center messaging (essentially, RPCs) exchange many small messages [33, 37]. While end-to-end latency is mostly network propagation time in this case, latency incurred in the end system is a good proxy for key resources (such as CPU cycles) wasted before invoking the application code, and in marshaling and sending any reply.

All these workloads incur a significant latency penalty when using DMA for data transfer. Moreover, the assumptions which led to the common DMA descriptor-based model do not hold for these workloads. For example, CPU execution and I/O were decoupled using descriptor rings because CPU cycles were precious (there being few cores), and the CPU had plenty of different tasks to work on at any time.

This is no longer true. Modern servers now have 100s of cores [5] and when running data center applications they are often dedicated to a single application, sometimes with miscellaneous functions moved to a small subset [40]. Most of the cores have nothing to do except handle small messages coming from the network.

As a second example: for throughput-oriented workloads DMA has evolved to efficiently transfer data to and from main memory without polluting the CPU cache. However, for small, fine-grained interactions, it is important that almost all the data gets into the right CPU cache as quickly as possible.

All this led us to rethink the DMA-based model, and whether directly involving the CPU in data transfers to and from devices might be a better approach for these scenarios on modern hardware. We are not the first to suggest this direction for some cases [13, 36, 39, 51], but we push the question much further: given a future cache-coherent interconnect, what *new ways of using PIO* might be compelling for data center applications?

We are not proposing abandoning the DMA model, or some of its enhancements we survey in Section 6, for largetransfer throughput-oriented applications, which can keep



Figure 1. PCIe XDMA invocation latency comparison.



Figure 2. PCIe PIO invocation latency comparison.

riding the current hardware bandwidth curve. Instead, we argue it is time to examine complementary alternatives.

# 3 Experimental platform

We investigate the modern trade-offs between PIO, DMA, and cache-coherent interconnects using a real hardware platform rather than a simulator. Simulation would be appropriate for detailed quantitative comparisons of known techniques, but in this paper we explore unconventional uses of cache coherence and so want to establish not only that our techniques are performant, but that they can practically be implemented in a real system.

The Enzian research computer [15] can be thought of as a two-socket NUMA server, where one socket houses a Marvel Cavium ThunderX-1 CPU, with a Xilinx XCVU9P Field Programmable Gate Array (FPGA) in the other. The CPU is a 48-core ARMv8 processor running at 2.0GHz, and 128GiB of 2133MT/s DDR4 memory spread across 4 memory controllers. Each core has a 32KiB, 32-way associative, write-through, physically indexed, physically tagged, L1 data cache. These are connected to a 16MiB, write-back, 16-way associative, shared L2 last-level cache. Hardware keeps the L1 caches coherent with the L2, and so in ARM architecture parlance the point of coherence is the L1 cache, whereas the point of unification is the L2 cache write-through. The cache-line size is 128 bytes – note that this is double the conventional 64-byte line size.

The CPU and FPGA are connected by the CPU's native inter-socket cache coherent protocol, the Enzian Coherence Interface (ECI), which Enzian implements on the FPGA. It is a MOESI-like directory protocol with the physical address space statically partitioned between the NUMA nodes (CPU and FPGA). It uses 24 bidirectional lanes (organized into 2 12-lane links), each running at about 10Gb/s, for a theoretical inter-socket bandwidth of about 30GiB/s.

Both nodes have PCIe interfaces: the ThunderX-1 has a PCIe Gen.3 x8 interface, while the FPGA has a PCIe Gen.3 x16 interface. In this work we connect these two interfaces together with a cable; this replicates a PCIe accelerator card and allows us to compare PCIe with ECI in our experiments.

While Enzian provides a real hardware platform with server-class performance, direct comparison with modern server platforms is not straightforward: the latter do not offer full cache coherence between CPU and devices, but have more recent, faster cores, memory systems, and PCIe.

We therefore calibrate both PIO and DMA over PCIe with benchmark comparisons of Enzian and a modern PC (Intel Core i7-7700 3.6GHz Kaby Lake, PCIe Gen.3 x16) with 64-byte cache lines connected to an AMD Virtex UltraScale+VCU118 card, using the "write then read" experiment detailed in Section 5.1. The VCU118 uses the same FPGA as Enzian, albeit a slightly slower speed gauge.

**DMA performance over PCIe:** The first experiment uses the Xilinx XDMA IP and its descriptor-based protocol to transfer data between CPU and FPGA over PCIe. Figure 1 shows results for various data transfer sizes, running the Xilinx driver in both interrupt-driven and polled modes.

XDMA over PCIe is about 3 times faster on the PC than on Enzian, while the difference between interrupt-driven and polling performance is much less significant. Single transaction latency is almost constant on both platforms up to the PCIe transaction size limit of 4KiB, and then increases.

The performance difference between the machines is due to several factors. For small transfers, the CPU overhead of descriptor setup dominates, and an x86 core is simply faster than a ThunderX-1 core. In addition, the memory system is somewhat faster on the PC. The factor of 2 difference in *bandwidth* between the two platforms (PCIe Gen3 x8 vs. x16) does not appear to be a factor in these experiments.

**PIO** performance over PCIe. In the second calibration experiment the CPU performs PIO reads and writes over PCIe to the FPGA. PCIe is complex, and we spent considerable time optimizing CPU-initiated reads and writes for different platforms. We describe here how we arrived at optimally-efficient PIO latency to the FPGA over PCIe.

We pre-map PCIe apertures (BARs) into user space on the CPU and measure latency as the time to write a value to this space (thereby *invoking* a function on the device) and read a result back (Figure 3) – this is basically uncached MMIO.

Figure 3. Invoking with PIO & PCIe; error handling omitted.

PCIe memory writes are *posted* transactions and may complete out-of-order, allowing multiple requests in-flight at the same time. Most systems also have dedicated bus units to combine writes and coalesce transactions. Enzian's ThunderX-1 performs *write-combining* for stores to PCIe, allowing 512 bits per bus round-trip (confirmed with a logic analyzer) and resulting in efficient PIO write transactions.

In comparison, PIO *reads* over PCIe show low latency for small data but degrade quickly as size increases. This is because PCIe memory reads are non-posted, forcing each read to finish before the next can start and incurring a round-trip time cost (about  $1\mu$ s) for each word. Many systems also have a narrow read bus between CPU and PCIe; the ThunderX-1 peripheral access bus is only 128 bits wide.

This means that on ThunderX-1, common optimization techniques like vector instructions have negligible benefit for PIO over PCIe: the hardware already coalesces writes into 512-bit transactions and reads are limited by the read bus width of 128 bits.

x86 has more optimization opportunities for PIO reads and writes. The Intel Data Streaming Accelerator [28] offers MOVDIR64B and ENQCMD instructions, allowing 64-byte PIO writes with write-combining mappings and user-space descriptor submission without driver intervention. Mapping memory with *write-through* attributes (as in Tide [26]), allows cache line-sized PIO reads. These architecture-specific extensions are not available on ThunderX-1.

Figure 2 shows the best figures we could obtain on both platforms. For transaction sizes above 32 bytes, the PC is about twice as fast as Enzian over the PCIe cable between CPU and FPGA. Here, the difference in PCIe bandwidth is responsible for the latency difference, in particular for reads (limited to 128 bits on both platforms). Writes, in contrast, are combined and pipelined at the PCIe interface.

Note this benchmark permits reordering of writes, and therefore already reflects the performance achievable using the reorder unit proposed in [39].



Figure 4. FastForward-style coherent messaging

## 4 PIO over a coherent interconnect

While PIO over PCIe outperforms DMA on our experimental hardware over a range of payload sizes (consistent with the findings of other authors), the main contribution of this paper to show the further benefits of using a cachecoherent interconnect. We present a family of message protocols, all of which run over an existing MESI-like coherence protocol, which substantially outperform conventional PIO and DMA.

Fast CPU-CPU message passing. Our starting point is a set of software protocols used to communicate over shared memory between CPUs in a cache-coherent system, including FastForward [20] and the NIC emulation used in CC-NIC [51], and exemplified in Figure 4. These target machines with directory-based coherence which supports direct cache-to-cache forwarding of lines without writebacks, e.g. MOESI [45].

In these protocols, a receiving core generally spins reading from a cache line (usually part of a larger array organized as a ring buffer), holding it in cache in Shared state. The sending core starts to write to the line, causing it to be fetched in Exclusive state and invalidated in the receiver's cache. When the receiving core next polls the cache line, it is fetched from the sending core and arrives back in Shared state on the receiver, where it is immediately loaded into CPU registers.

In practice, the sender generally writes the entire line at once (exploiting a write buffer), so that in the fast case the line is transferred in two round-trips over the interconnect, as shown in Figure 4. Techniques like including a "finished" flag in the cache line data can be used to handle cases where the line is transferred to the receiver before it has been completely written.

These protocols have excellent performance, and on modern hardware achieve register-to-register latencies in the low hundreds of nanoseconds. They are easily parallelized to many cache lines (with a single polled flag for many lines of payload, plus the necessary barriers) to achieve excellent throughput. This basic technique is used, for example,

4

CC-NIC [51] to emulate a NIC using a different CPU socket. However, they have several limitations.

The first is that the CPUs must busy-wait polling memory. This does not generate interconnect traffic since the line being polled is in local cache, but does consume energy. Moreover, optimal performance depends on timing, and if the receiver polls too frequently, the line can be forwarded multiple times before it is complete, leading to wasted interconnect round-trips and increased latency. For these reasons, these protocols are rarely used for real-world Inter-Process Communication (IPC), even when it makes sense to dedicate cores to individual clients and servers for long periods.

The implications of message-level access. The key insight behind this paper is that a *device* on a symmetric coherent interconnect sees more cache state and events than are visible to software, and can also initiate more operations than a CPU core can. Moreover, this does not require a cache on the device at all, indeed this is rarely appropriate: it is sufficient to generate and respond to individual coherence messages. We explain this in more detail.

First, the device now **receives messages** from the CPU cache that request lines in Shared or Exclusive state, or which request that the caching state on the device be downgraded (e.g., from Exclusive to Shared, or from Shared to Invalid). It is free to react internally to these events as it chooses, as long as it does not violate the protocol.

Second, the device can also **issue such requests** itself at any time, allowing more fine-grained control of the state of a line in the CPU's cache.

Third, for lines which are homed at the device, the device' directory maintains explicit **information about the cache line state** at the CPU and other nodes as well as locally.

Fourth, *unlike a conventional cache*, the device **does not have to respond immediately** to every request from the CPU's cache. Instead, it can choose to delay the response (blocking the requesting core) until some other operations complete. Care must be taken to respond before any hardware-imposed timeouts, but these are typically generous (hundreds of milliseconds on the ThunderX-1, for example) and can be worked around in software (e.g., by sending a "try again" response within the timeout).

Finally, the device can **choose to interpret events** like remote requests as more than simply loads or stores to the line concerned. This last insight is important because it, in combination with the previous observation, it allows us to construct efficient communication mechanisms which rely on a combination of the device's richer view of the protocol and the software on the CPU adhering to additional conventions to communicate channel semantics to the device.

This means requests to specific cache lines can be used to signal particular higher-level operations to the device, much as register access to traditional devices frequently has very different semantics to memory access.

**CPU-device message passing with coherence.** We can now present a family of related point-to-point messaging protocols which can be built over a symmetric, directory-based cache coherence protocol where one end-point (the device) has direct access to the low-level cache messages.

Figure 5 shows three such protocols, all with the same broad structure, operating on a pair of cache lines which are homed at the device. This allows the device to precisely track states of the cache lines used and greatly simplifies the protocol. They vary according to whether data flows (a) from CPU to device, (b) from device to CPU, or (c) both ways.

Time is on the vertical axis. Under each end-point (CPU or device) are two columns, one for each cache line, giving its state over time. The notation E->M within a cell indicates the silent upgrade of a line from Exclusive to Modified, while the arrows between cells indicate a state changes corresponding to visible interconnect messages.

The first and second protocols implement *uni-directional* message passing between the CPU and device, and are the basis for the Ethernet NIC presented in Section 5.2. The third option combines the two by packing a request and response into one invocation of the protocol (i.e. 2 round-trips) and is best suited for send-and-receive interactions like RPC. We demonstrate this in Section 5.1.

Each protocol requires just two interconnect round trips, beginning with the two lines in a defined quiescent state. At each point one of the lines (which we refer to as *A*) serves to transfer information from CPU to device, and the other (*B*) from device to CPU. Over the course of one transaction both lines migrate from one cache to the other, ending in the same (caching protocol) state in which the other began. The two then swap roles for the next iteration.

There is no actual cache on the device: any payload is delivered directly into the device's execution pipeline, or read from an output register, as appropriate. The following describes the execution of variant (c) in detail:

- 1. *B* starts Exclusive in the CPU cache, and *A* Invalid, with the (directory) states the opposite on the device. Software writes payload data from registers into *B*, which (silently) transitions to Modified.
- 2. To initiate the exchange, software reads from *A*, triggering a *load shared* message to the device. The device interprets this as a signal that *B* contains fresh data. This coupling of independent line states is the first deviation from 'standard' coherence, and is impossible to achieve purely in software. It is nevertheless entirely invisible to the CPU.
- 3. The device now stalls the CPU by not responding to the load of *A* until it has data to send. Instead, it requests *B* in EXCLUSIVE to fetch the payload, leaving it INVALID in the CPU's cache.



Figure 5. Protocol variants for efficient CPU-device messaging



Figure 6. Invocation latency over ECI vs. FastForward

- 4. The device uses the data in *B* to computes a result which it returns as a result payload when it finally responds to the CPU's load from *A*.
- 5. The CPU cache receives *A* and the CPU unblocks, immediately loading the first word of the response into a register.

This *would* lead to *A* in Shared in both nodes, since software requested *A* using a load. However, as an optimization, the device can return the line in Exclusive instead, invalidating its own copy, so the protocol returns to a quiescent state with the roles of the lines reversed.

This protocol either eliminates or minimizes the draw-backs of a software-only protocol: the ability to stall the CPU eliminates both the need to poll, and the race condition. By coupling and overlapping two coherence transactions we are able to transfer data in both directions with the same two round-trips. All data is transferred atomically so there is no risk of half a payload being shipped. This is the best achievable result with an unmodified coherence protocol, since no cache transaction both sends and receives a payload.

Figure 6 shows the distribution of total latency for this operation implemented on Enzian. Median latency is around 1600ns when the result is returned in Shared state as requested by the CPU ("ECI unopt"). When we return the line in Exclusive, this drops to around 900ns ("ECI").

For comparison, we also show an implementation of the FastForward protocol [20] exchanging cache lines between

CPU sockets in a dual-socket ThunderX-1-based Gigabyte R150-T61 server with similar CPU and DDR specification to Enzian; this achieves a median latency of about 1750ns.

The optimized protocol is performing 2 round-trip message exchanges over ECI, which has a one-way latency at the link layer of about 150ns. The rest of the overhead (300ns) is mostly incurred in the protocol processing in the directory controller, and would be lower in an ASIC implementation. The FPGA is clocked at about 300MHz.

**Returning a line in Exclusive.** The ability for the device to unilaterally return a line in Exclusive delivers an important performance benefit. The ECI protocol, and the ThunderX-1 last-level cache, both support this. Interestingly, it is also proposed in the (as yet unimplemented) CXL.mem 3.0 standard [18], along with the "back invalidate" operation providing symmetric cache coherence. We suggest this is a useful feature for future protocols.

Handling larger messages. The protocols above handle one line at a time (128B on Enzian). They can be used repeatedly to transfer more data than this, but this is inefficient. Instead, we extend the protocols described so far using parallelism and pipelining to give much better throughput. This leads to two further protocols, which are the ones we evaluate in Section 5.

For the NIC example of Section 5.2, with predominantly unidirectional communication, we dedicate one pair of cache lines as *control* (as in Figure 5a and Figure 5b) and add *over-flow cache lines* to carry data. When the protocol invalidates the control lines, it also invalidates these overflow cache lines *in parallel*, exploiting the full bandwidth of the coherent link. We use this technique to transfer packets larger than one cache line, up to an Ethernet jumbo frame (9kB).

We can do even better when the application has a symmetrical communication pattern, i.e. the CPU sends and receives similar amounts of data in one invocation. This is the case for the device invocation and Timely Dataflow experiments

we show in Section 5.1 and Section 5.3. In this case, the better option is to prefetch multiple instances of the protocol shown in Figure 5c.

To transfer n cache lines at a time, we maintain two groups of n lines. The CPU first writes all arguments for the device to the lines in group 0, and then *prefetches* all cache lines in group 1. The prefetches trigger the device logic to invalidate all cache lines in group 0, fetching the arguments. The prefetches are issued in parallel by the CPU and again provide enough concurrency to saturate the link.

*Handling timeouts.* The protocols block a core's request for a cache line until a device operation has completed. This timeout is typically quite long (often hundreds of milliseconds) but if it happens the CPU's cache is likely to cause an exception, or other fault like a machine check.

We solve this issue with a small state machine on the device that returns a "not ready yet" response before an impending timeout, causing the software on the issue another request for the other cache line, extending the response time indefinitely without the need for spinning. In practice, we target operations that are shorter than the timeout.

Avoiding deadlocks. It is also important to prevent deadlocks due to the device stalling a request until it can issue, and get a reply back for, another request to a different cache line. Existing standards for inter-operable cache coherent interconnects are silent on what happens in this situation, possibly because it has not occurred until now. The design assumes that transactions on different cache lines can progress independently.

In some implementations, this assumption holds in order to maximize the memory bandwidth utilization and minimize request latency, and also to simplify reasoning about deadlock freedom of the coherence protocol itself. However, it may be the case that the cache stripes transactions across a limited number of independent units which might deadlock if both A and B were mapped to the same unit. Were this to be the case, it could be avoided by careful placement of A and B in the physical address space.

This is actually the case in the Enzian FPGA directory controller implementation we use [49], which follows similar provisioning choices in the CPU. The ThunderX-1 divides its last-level cache functionality into a set of units called TADs, each of which can handle up to 16 simultaneous cache transactions. We ensure that consecutive cache lines are mapped to different TADs to ensure that the operations can proceed independently.

Another possible source of deadlocks is that the CPU and L2 cache might issue requests out of order, especially in the case of using prefetches. This is mainly relevant for handling requests larger than one cache line, as we discussed above. The device implementation needs to be aware of possible reorderings and should not rely on the request ordering to

perform state transition to avoid deadlocks. In our implementation, we always buffer possible in-flight cache lines on the FPGA and advance state machines based on number of requests we see, without relying on the actual order of them.

Correctness and interoperability. An obvious concern given the complexity of underlying cache coherence protocols is the correctness of the implementation. An error in the device might cause the state machine in the CPU's cache to enter a state from which it cannot recover.

Worse, while our implementation relies on an existing, and well-tested, directory controller implementation on the FPGA, we are using the ECI protocol in unconventional ways unforeseen by the designers of the CPU and its cache.

In practice, after extensive testing we have not seen protocol-related problems even under heavy load. Moreover, we are reassured by recent work on exhaustive semi-open-state testing of coherence protocols [52] which can formally define the state space of a subset of a coherence protocol (which can nevertheless include the kinds of behaviors we describe here), and then exhaustively test a hardware implementation against that formal model.

*Enzian-specific implementation issues.* Our implementations will all be made available as open-source and use a combination of SystemVerilog and SpinalHDL [16], and use a version of the existing Enzian FPGA directory controller [49].

The ARMv8-A weak memory model means that barriers are needed to reliably order reads and writes. For example, it is critical that the core's write buffer is drained into L1 cache (which is write-through to the L2) before a subsequent load from the core signals to the FPGA that the line can be pulled, and this requires a DMB barrier instruction between the writes and the read. On the ThunderX-1 this is sufficient to ensure ordering observed at the FPGA.

*Generality.* If we had only simulated the protocols in Figure 5 rather than implement them in a hardware platform, it seems unlikely that we could have become aware of the numerous subtleties and challenges that real coherent interconnects present. However, this raises generality concerns, since our implementation is tied to a single platform.

In principle, all the protocols we have described here could be implemented over any coherent interconnect which uses a symmetric, MOESI-like, distributed directory-based model of coherence, and where the interconnect hardware exposed major state transitions to the device-specific logic which served as the end-point for messaging.

Several current and future standards satisfy these conditions, for example CCIX [12], TileLink [8], and CXL.mem 3.0 [18], although notably neither CXL 2.0 nor CXL.cache 3.0. We are also seeing something of a boom in custom interconnects from a range of vendors. By making the requirements



**Figure 7.** Invocation latency for different payload sizes

for this fine-grained, low-latency communication clear, future designers can take this into account when designing new interconnects and parts.

#### 5 Evaluation

We compare coherent PIO with our two other reference points: optimized PIO over PCIe described in the previous section, and descriptor-based DMA over PCIe. All use the same Enzian platform. Coherent PIO uses the Enzian native ECI implementation, while the PCIe approaches use a loop-back cable between PCIe interfaces on the CPU and FPGA, effectively turning the Enzian FPGA into a conventional PCIe FPGA card. DMA experiments use the Xilinx XDMA IP.

We explore three different application scenarios: synchronous invocation to an accelerator on the FPGA, a closelycoupled high-speed NIC, and a stream processor which moves some of its dataflow operator graph to the FPGA.

With the exception of the NIC case, the results show very little variance. All graphs show 95% confidence intervals, but these are typically not visible.

#### 5.1 Accelerator invocation

We first explore the scenario where the CPU invokes a function on the device configured as an accelerator, using the three communication approaches to send argument data and receive result data. In this case, the FPGA is configured to map the invocation into a write to, followed by a read from, onboard Block RAM.

For DMA numbers we use the Xilinx XDMA benchmark [60], which performs the operation using descriptor-based DMA. We implement coherent PIO with the protocol



Figure 8. Invocation throughput for different payload sizes

shown in Figure 5c to send a request and receive a response in two round-trips.

Latency comparison. We first compare the latency for the CPU to complete a function invocation on the device, measured as the time taken for the CPU to send a request from the CPU to the FPGA and read a response. We vary the payload size (of both request and response) and report median latency figures.

The results are shown in Figure 7. Up to payloads of 8KiB, the latency of DMA-based communication is dominated by descriptor setup and manipulation, and remains largely independent of payload size (upper scale). Faster PCIe interconnects or CXL would only amplify this effect.

Latency over ECI is dramatically lower across the board. It is constant up to 256 bytes (two Enzian cache lines) and then increases slowly, but linearly. This is due to the pipelined nature of the ECI PIO protocol: before we saturate the links, the incremental latency induced by each extra line is small.

Relative to ECI, PIO over PCIe performs poorly for any payload more than 16 bytes, due to the inherently-limited PIO read bandwidth discussed in Section 3, but competes well with DMA up to 2KiB payload.

We conclude that, for almost all transfers up to and beyond 8 KiB, coherent PIO is significantly lower latency than both the DMA option commonly used for these transaction sizes, and PIO over PCIe. More PCIe bandwidth (e.g. PCIe Gen5) will not change the PIO results significantly without support for more efficient MMIO reads, since PCIe round-trip latency remains the same. Neither would it improve DMA latency, which is dominated by descriptor overhead. In contrast, the lower interconnect latency available with newer CXL versions *would* improve things, but would also deliver the same benefit to the coherent PIO case.

*Throughput comparison.* We also show throughput figures, omitting PIO over PCIe since performance (as seen above) is not practical for larger transfers. We measure how many back-to-back invocations one CPU core can make for

1 ms, using different transaction sizes. We calculate throughput by the total data volume transferred divided by the time elapsed and report the median value in Figure 8.

ECI PIO throughput increases with transaction size up to 32KiB payloads to a peak of 2.19 GiB/s and then dropping slightly, due to thrashing in the 32KiB L1 data cache. It performs better than PCIe DMA over all transaction sizes presented, and by a comfortable margin. At 64 KiB, we are still far from saturating throughput for PCIe DMA; it is well known that very large transaction sizes are needed to achieve maximum throughput.

In conclusion, ECI PIO shows a clear throughput advantage over PCIe DMA at multiple page sizes (up to 64KB) for device invocations, to the extent that the determining factor is less the throughput achieved and more the number of CPU cycles or cache size available.

As with the latency, this is unlikely to change with faster PCIe interconnects, and CXL would likely result in improvements across the board, including for the coherent PIO.

The results also show that there is an *optimal* native transaction size for coherent PIO, in this case the L1 cache size: larger transfers should be broken down into smaller transactions of optimal size to achieve maximum throughput.

#### 5.2 Network interface adapter

We now compare approaches in the context of data center networking. Latencies of less than  $1\mu$ s are seen in data center packet switching and delivery [21, 23, 31], motivating minimizing latency between the network and CPU registers [19]. We show that ECI PIO is a good fit for such scenarios.

We implement 3 variations of a 100 Gb/s NIC in Enzian. The first (Figure 9a) is a conventional approach connecting one of Enzian's 100 Gb/s Ethernet MACs directly to the XDMA engines, resulting in a NIC programmed using descriptor rings. The second design (Figure 9b) buffers packets in onboard SRAM on the FPGA and allows the CPU to read and write packet and control data using PCIe reads and writes. The final approach (Figure 9c) uses a module on the FPGA to bridge between the ECI directory controller and the MAC. A variant of the protocol in Figure 5b delivers network packets to the CPU in multiple cache lines, and Figure 5a is used to send a packet. The NIC logic is clocked at 250 MHz.

For experiments we configure the Ethernet MAC in nearend PCS/PMA loopback mode to reliably measure packet delivery *inside* the server without network delays. We define *receive latency* as time from the last beat of the packet appearing on the ingress streaming interface of the MAC, to the CPU fully receiving the packet contents in its registers. Transmit latency is similarly defined as time from the CPU having the packet ready in registers, to the last beat of the packet appearing on the egress streaming interface on the MAC. We do not include the time for the packet to go through the Ethernet PCS/PMA loopback, since the MAC is fixed and the same in all implementations presented.

**Table 1.** NIC implementations: latency percentiles on selected packet sizes. Higher tail latencies are marked in **bold**.

| Size     | Percentiles RX ( $\mu$ s) |        |        |        | Percentiles TX (μs) |       |       |       |
|----------|---------------------------|--------|--------|--------|---------------------|-------|-------|-------|
|          | P50                       | P95    | P99    | P100   | P50                 | P95   | P99   | P100  |
| PCIe DMA |                           |        |        |        |                     |       |       |       |
| 64       | 65.39                     | 66.36  | 67.65  | 100.01 | 10.06               | 10.35 | 10.59 | 16.49 |
| 1536     | 64.77                     | 65.59  | 66.21  | 133.84 | 10.89               | 11.12 | 11.33 | 30.84 |
| 9600     | 65.89                     | 67.12  | 68.05  | 123.61 | 15.73               | 16.05 | 16.29 | 41.99 |
| PCIe PIO |                           |        |        |        |                     |       |       |       |
| 64       | 3.25                      | 3.26   | 3.27   | 3.39   | 0.34                | 0.36  | 0.38  | 4.80  |
| 1536     | 72.89                     | 72.93  | 72.96  | 73.05  | 1.82                | 1.84  | 1.86  | 6.40  |
| 9600     | 450.28                    | 450.35 | 450.38 | 451.10 | 9.91                | 10.01 | 10.07 | 10.14 |
| ECI PIO  |                           |        |        |        |                     |       |       |       |
| 64       | 1.05                      | 1.06   | 1.07   | 1.17   | 1.06                | 1.12  | 1.14  | 1.18  |
| 1536     | 7.24                      | 7.29   | 7.39   | 7.43   | 3.09                | 3.26  | 3.50  | 3.59  |
| 9600     | 39.43                     | 39.48  | 39.50  | 39.55  | 9.07                | 9.19  | 9.65  | 9.95  |

To minimize disruptions from the Linux scheduler, we follow the common practice to free one CPU core from kernel tasks and IRQ processing [1], with the isolcpus function together with core pinning using taskset. In this setup, the only disturbance from Linux would be the 250 Hz timer interrupt present on all cores. While a custom, completely *tickless* kernel would remove this disturbance completely, it is not commonly deployed in production environments due to performance complications. We use such a tickless kernel to measure tail latency below, but other experiments use a stock generic kernel from Ubuntu. We show figures for XDMA in both polled and interrupt-driven modes; in practice the difference in latency is small.

Figure 10 shows receive latency for different packet sizes for each NIC implementation. As before, DMA latency is stable across all packet sizes, suggesting that it is dominated by various DMA overheads, such as descriptor setup and manipulation. Note that system calls here actually result in little overhead (only a few  $\mu$ s); the overhead is dominated by cache misses manipulating DMA data structures in RAM.

In contrast, PIO over both ECI and PCIe offer much lower latency for small packet sizes up to 1024 bytes DMA, highlighting high efficiency gains in PIO compared to DMA. However, the latency of PCIe PIO quickly degrades for larger packets due to the *non-posted read* problem discussed in Section 3, reaching over 350  $\mu$ s at 8 KiB packets. In contrast, ECI offers dramatically better latency even at this size.

Figure 10 also shows transmit latency. PCIe PIO competes well with ECI across all packet sizes due to combining posted writes. PIO over ECI has slightly higher latency for small packets, since it needs 2 round-trips over ECI for each 128 bytes, versus a single PCIe round trip. Both PIO solutions perform significantly better than DMA across all packet sizes.







(a) PCIe XDMA: packets transferred using descriptor-based DMA

**(b)** PCIe PIO: CPU reads descriptors and data from FPGA SRAM.

**(c)** ECI PIO: CPU interacts with PIO interface and SRAM through the L2 cache.

Figure 9. Different NIC architectures. Blue denotes data movers; red denotes memory that holds packet data.



Figure 10. Latency of NIC implementations

Table 1 shows the tail latency impact for selected packet sizes for each NIC implementation. We select three representative packet sizes to evaluate tail latency: 64 for the smallest possible packet, 1536 for the normal Ethernet MTU, and 9600 for a common jumbo frame MTU. We use the *tickless* kernel setup mentioned earlier to avoid the 250 Hz Linux timer, in order to reveal the actual tail latency impact from our NIC.

PIO over PCIe results in low tail latency compared to the descriptor-based approach, while ECI *completely eliminates* tail latency in this scenario.

These results show clear benefits of using PIO vs. DMA, even for large packets, and especially when using a coherent interconnect. The poor receive performance of PCIe-based PIO for receiving packets is due to the unposted-read limitation of PCIe, and disappears for transmit. This advantage should translate to future CXL3.0 and PCIe Gen6 NICs.

# 5.3 Offloading Timely Dataflow

Finally, we look at hardware offload of operators in a realtime stream processing system: Timely Dataflow [41, 44]. Timely schedules operators in a user-defined dataflow graph



Figure 11. Timely synthetic filter graph offload

to evaluate complex functions over on data streams, operating in variable-sized batches. Operators are independent and may be arbitrarily complex, thus benefiting from hardware offloading where the performance improvement outweighs the overhead of shipping data to the accelerator.

Timely also illustrates the flexibility of the ECI PIO protocol: our offloading implementation the protocol in Section 4 *synchronously* to exchange statistics between FPGA and CPU before and after processing a batch, and operators *asynchronously* for processing data in the batch in parallel.

We modify the latest Rust implementation of Timely [41] to include the ability to partition its dataflow graph between software operators and those implemented on the FPGA, with automatic insertion of communication channels between the CPU and FPGA to transfer both data and progress tracking information across the hardware/software boundary. We perform two experiments: an extreme synthetic benchmark with little actual computation, and real operators in the form of offloaded Bloom filters.

Communication-only graph. The synthetic benchmark uses 31-element pipeline graph of trivial filter operators (filters). This large operator graph maximizes communication overhead in the form of Timely progress tracking data, while the simple operators mean the FPGA has little performance advantage (if any) over the CPU, mimicking cases where small batch sizes are need to ensure data freshness, as in Differential Dataflow [42].

The first (blue) bars of Figure 11 show the baseline performance of the non-offloaded (CPU-only) system against batch size. The remaining bars cover the three offload cases.

ECI PIO batch latency is lower than both PIO and DMA over PCIe for all batch sizes, even a single cache line (128B). Moreover, PIO over ECI is the only technique that delivers lower latency than the software-only Rust implementation, even in this near-worst-case scenario for hardware offload.

**Bloom filters.** In our second Timely experiment we offload Bloom filters [9] to the FPGA, replicating the application scenario used to evaluate Fleet [57]. Bloom filters are used to efficiently check whether an element is a member of a set, using little memory. A Bloom filter consists of k different hash functions and a lookup table. Testing the presence of an element requires computing these k hash functions and querying the lookup table with the results.

In the Fleet C++ software-only baseline, the hashes were computed in parallel for each byte sequentially using AVX instructions. We re-implemented this in Rust as a Timely operator using Arm SIMD instructions. On the FPGA side, the hash computations are parallelized and iteration over the element's bytes is pipelined. We measure a single CPU thread and compare it with a single offloaded operator.

We implement the Bloom filter on the FPGA to take 128-byte elements with k=8 hash functions that take byte-wide inputs. We unroll the hash function calculations for each byte lane by a factor of 2, resulting in a 64-cycle latency for each element. We then pipeline the calculations with an initiation interval of 2 cycles to saturate the 512-bit bus. The return value is the set of 8 64-bit hashes for each element.

In contrast to the synthetic benchmark above, this scenario has much lower progress tracking overhead (only a single operator is offloaded, compared with an entire subgraph) and much higher computation cost (the Bloom filters themselves). The k hash functions consist of sequential bit shifting, addition, and XOR-ing, which are fast in FPGA logic and deliver the offloading benefit even for small batches.

Instead of directly returning the hash results, the FPGA could have computed the lookup table positions or performed the lookup itself, returning either k indices or binary values. In the case of indices, the read-to-write ratio would stay the same as in our approach, but for the lookup, the read requests would be minimized and writes would dominate. In this experiment, we assume that the lookup table is located elsewhere and focus on efficiently calculating the hash function and transferring the results to the lookup table.

Figure 12 shows the results. Latency for both PIO solutions and software (CPU) all start around 25  $\mu$ s, due to the high overhead of streaming the input data. The per-element processing time is 2.6  $\mu$ s on the CPU and 1.7  $\mu$ s when offloading with ECI. As the batch size (and hence processing time) increases, the advantage of hardware offload increases,



Figure 12. Offloading Bloom filters to the FPGA

but so does the cost of PIO over PCIe. DMA latency barely increases, but remains significantly higher than that for ECI.

As with the previous experiments, this shows the advantages of the efficient, low-latency communication between CPU and device that is possible using a cache-coherent interconnect, over and above both PIO and DMA over an interconnect like PCIe. We anticipate that a future implementation of CXL.mem 3.0, which similarly exploited cache coherence at the message level, would deliver similar benefits.

## 6 Related work

Cho *et al.* [14] provide an excellent survey of existing CPU-device communication mechanisms, and address high-speed storage devices with microsecond latencies. They question the conventional wisdom that *hiding* the latency of interacting with such devices is not possible using existing techniques, and show how access latency using existing descriptor-based protocols can be effectively hidden by clever use of prefetching, better hardware queues, and user-level context switching. Our work is different (and complementary) in reducing the actual latency of each message between CPU registers and device memory.

#### 6.1 Improving descriptor-based DMA

Cohort [58] proposes a single, uniform, single-producer, single-consumer, queue-based DMA descriptor interface to all accelerators on an System-on-Chip (SoC) which replaces the variety of ad-hoc interfaces found on most SoCs. Cohort mandates an MMU on each accelerator, which maintains consistency with CPU MMUs and allows seamless user-space access to accelerator queues, and integrates with the cache coherence protocol to optimize the use of lock-free descriptor queues. We focus instead on what a latency-optimized interface might look like, eschewing descriptors in favor of direct cache line transfers.

Ensō [50] shows the benefits of replacing the packet-based DMA software interface used by modern NICs with one based on a contiguous large buffer of opaque, variable-sized messages, with metadata sent via a different completion

buffer. Ensō replaces descriptors with this buffer using an ingenious combination of direct MMIO writes to the PCIe device and polling memory queues written by the NIC. Our work similarly tries to reduce the overhead of descriptor management, but by focusing solely on latency-sensitive small payloads and avoiding DMA access completely.

#### 6.2 Coherent interconnects

Coherent replacements or extensions of existing device interconnects have been under development for some time and are beginning to see widespread availability. IBM's Open-CAPI [54] builds on PCIe by adding a protocol layer for coherence. It and the competing Gen-Z [35] and CCIX [12] protocols have meanwhile been either merged into or replaced by CXL [38], which seems likely to be the standard interoperable protocol in the short- to medium-term future. However, commercial CXL 3.0 hardware implementing the more recent revisions has yet to appear.

Other notable coherent interconnects include the RISC-V-specific TileLink [8, 17, 56] and NVIDIA's NVLink 2.0 [47, 59]. TileLink is an open standard and suitable for research but hasn't yet been implemented in server-scale hardware. NVLink is proprietary and closed.

Existing work has built on the available research platforms to explore the design space for practical applications. Centaur [48] demonstrated the FPGA offload of database operations on Intel HARP v1. However, while Intel HARP [27, 29] coupled a CPU and FPGA using a coherent QPI link, it provided a pre-configured coherent cache in the FPGA rather than user access to coherence protocol messages. In contrast, this access is readily available in Enzian [15].

Open, extensible protocols have been used to explore coherent offload in simulation, including making the case for protocol specialization building on the Spandex protocol family [2, 3]. CoNDA [10] likewise employed simulation to explore the design space for coherent device interconnects, with a focus on reducing unnecessary message traffic.

The Denovo protocol [53] is presented as an improvement specifically for CPU-GPU coherence, tailoring the protocol to match the comparatively predictable access patterns of typical GPU workloads.

# 6.3 Cache line-based communication

Communication in software between peer modes in a cache-coherent system is typically very different from using descriptor-based DMA. FastForward [20], Barrelfish [6], Concord [30], and Shinjuku [32] adopt a much more direct approach to sending small messages with low latency, e.g., by exploiting the cache coherence protocol to transfer lines between caches on demand, and local polling to provide synchronization. Recently, cache coherence between NUMA CPUs has been used to *simulate* communicating between software and a hypothetical cache-coherent NIC [51].

Other work [55] has already noted that the cache line is a better unit of transfer for small operations, as in Fast-Forward [20] and Barrelfish [6] protocols. More recently Concord [30] communicates scheduling decisions between workers and the dispatcher via a polled cache line, converting worker threads from interrupt-driven to poll-mode "CPU drivers". A similar technique is used in Shinjuku [32].

A recent study of data center RPC from Google [33] reinforces the importance of small transfers and highlights the large latency of PCIe transactions.

HyperPlane [43] observes that I/O stacks in cloud data centers frequently resort to spin polling the many descriptor queues provided by modern NIC hardware, and proposes a a new QWAIT hardware instruction which leverages the processor's cache mechanism (much like MWAIT) to watch many different in-memory descriptor queues at once, providing hardware acceleration for select() or epoll()-like operations. Evaluation is performed using the Gem5 simulator, although this implementation is not available. We take a different approach based on transferring data directly in cache lines without descriptors.

## 6.4 Exploring PIO

Neugebauer et. al. [46] provide a comprehensive analysis of PCIe performance in the context of NICs. The space where PIO is preferable to DMA for PCIe has been explored before. The hXDP [11] FPGA NIC work highlighted the high overhead of small PCIe transactions, and consequently performs small batch computations solely on the CPU.

Larsen and Lee [36] demonstrate the benefit of write combining for PCIe-based PIO. Compared to a traditional DMA NIC, their FPGA-based prototype shows better latency and throughput for small-medium-sized messages and comparable throughput for large messages. The ThunderX-1 automatically performs combining for PCIe writes, and the results of Section 5.1 thus reflect this optimization. Current work [39] proposes a write-reordering unit in the device to simplify the use of fast PCIe posted writes and using Ensō [50] DMA to transfer data from the device.

Dagger [37] builds on CCI-P, the commercialized implementation of HARP's coherent cache, to construct an FPGA-based NIC specialized for low latency RPC. A host-coherent cache holds connection states and the necessary structures for the transport layer on the NIC, while the payload remains in host memory. This minimizes FPGA memory demands, and exploits the lower latency of cache misses compared to conventional PCIe-based NICs..

## 7 Conclusion and Future Work

It is important to regularly question long-held assumptions about systems, particularly in the light of new technological trade-offs and emerging workloads.

We show that modern workloads can benefit from techniques that optimize for small, frequent, low-latency interactions between CPU cores and devices as much as the more traditional large, throughput-oriented interactions for which current PCIe interconnects and DMA are optimized.

Moreover, the use of cache coherent interconnects between CPU cores and devices enables new techniques with this property, *particularly* when the device is able to participate in a standard cache coherence protocol *at the level of individual cache transitions* rather than naive coherence.

We have demonstrated the potential of these techniques by showing a family of messaging protocols implemented over Enzian's cache coherent interconnect, but more importantly this opens up a wide design space for the use of future coherent protocols like CXL.mem 3.0, which offers many possibilities for more efficient fine-grained communication within and between machines.

#### References

- Hakan Akkan, Michael Lang, and Lorie M. Liebrock. 2012. Stepping towards noiseless Linux environment. In Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers. ACM, Venice Italy, 1–7. doi:10.1145/2318916.2318925
- [2] Johnathan Alsop, Weon Taek Na, Matthew D. Sinclair, Samuel Grayson, and Sarita Adve. 2022. A Case for Fine-grain Coherence Specialization in Heterogeneous Systems. ACM Trans. Archit. Code Optim. 19, 3, Article 41 (aug 2022), 26 pages. doi:10.1145/3530819
- [3] Johnathan Alsop, Matthew D. Sinclair, and Sarita V. Adve. 2018. Spandex: a flexible interface for efficient heterogeneous coherence. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA '18). IEEE Press, 261–274. doi:10.1109/ISCA.2018.00031
- [4] AMBA® CHI Architecture Specification 2024. https://developer.arm. com/documentation/ihi0050/latest/
- [5] Ampere Computing. 2024. AmpereOne® 64-Bit Multi-Core Processors. https://amperecomputing.com/briefs/ampereone-family-product-brief.
- [6] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (Big Sky, Montana, USA) (SOSP '09). ACM, New York, NY, USA, 29–44. doi:10.1145/1629575. 1629579
- [7] Andrew Bean. 2016. Improving memory access performance for irregular algorithms in heterogeneous CPU/FPGA systems. (Jan. 2016). doi:10.25560/41981 Accepted: 2016-10-25T15:31:13Z Publisher: Imperial College London.
- [8] Berkeley Architecture Research. 2022. TileLink. https://bar.eecs. berkeley.edu/projects/tilelink.html
- [9] Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. *Commun. ACM* 13, 7 (July 1970), 422–426. doi:10. 1145/362686.362692
- [10] Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu. 2019. CoNDA: efficient cache coherence support for near-data accelerators. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA '19). Association for Computing Machinery, New York, NY, USA, 629-642. doi:10.1145/3307650.3322266

- [11] Marco Spaziani Brunella, Giacomo Belocchi, Marco Bonola, Salvatore Pontarelli, Giuseppe Siracusano, Giuseppe Bianchi, Aniello Cammarano, Alessandro Palumbo, Luca Petrucci, and Roberto Bifulco. 2020. hXDP: Efficient Software Packet Processing on FPGA NICs. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 973–990. https://www.usenix.org/conference/osdi20/presentation/brunella
- [12] CCIX Consortium and others. 2024. Cache Coherent Interconnect for Accelerators (CCIX). http://www.ccixconsortium.com
- [13] Mahesh Chaudhari, Kedar Kulkarni, Shreeya Badhe, and Vandana Inamdar. 2017. Evaluating Effect of Write Combining on PCIe Throughput to Improve HPC Interconnect Performance. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 639–640. doi:10.1109/CLUSTER.2017.109
- [14] Shenghsun Cho, Amoghavarsha Suresh, Tapti Palit, Michael Ferdman, and Nima Honarmand. 2018. Taming the killer microsecond. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (Fukuoka, Japan) (MICRO-51). IEEE Press, 627–640. doi:10.1109/MICRO.2018.00057
- [15] David Cock, Abishek Ramdas, Daniel Schwyn, Michael Giardino, Adam Turowski, Zhenhao He, Nora Hossle, Dario Korolija, Melissa Licciardello, Kristina Martsenko, Reto Achermann, Gustavo Alonso, and Timothy Roscoe. 2022. Enzian: an open, general CPU/FPGA platform for systems software research. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS 2022). Association for Computing Machinery, New York, NY, USA, 590–607. doi:10.1145/3503222.3507742
- [16] SpinalHDL contributors. 2024. SpinalHDL. https://github.com/ SpinalHDL/SpinalHDL.
- [17] Henry Cook, Wesley Terpstra, and Yunsup Lee. 2017. Diplomatic Design Patterns: A TileLink Case Study. In First Workshop on Computer Architecture Research with RISC-V (CARRV 2017).
- [18] CXL Consortium. 2024. Compute Express Link. https://www.computeexpresslink.org/
- [19] Geetanjali Gadre, Shreeya Badhe, and Kedar Kulkarni. 2016. Network processor–A simplified approach for transport layer offloading on NIC. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, Jaipur, India, 2542–2548. doi:10.1109/ICACCI.2016.7732440
- [20] John Giacomoni, Tipp Moseley, and Manish Vachharajani. 2008. Fast-Forward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Salt Lake City, UT, USA) (PPoPP '08). Association for Computing Machinery, New York, NY, USA, 43–52. doi:10.1145/1345206.1345215
- [21] Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sunghwan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David Decotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbiciak, and Amin Vahdat. 2022. Aquila: A unified, low-latency fabric for datacenter networks. 1249– 1266. https://www.usenix.org/conference/nsdi22/presentation/gibson
- [22] Roberto Gioiosa, Thomas Warfel, Antonino Tumeo, and Ryan Friese. 2017. Pushing the Limits of Irregular Access Patterns on Emerging Network Architecture: A Case Study. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 874–881. doi:10.1109/CLUSTER.2017.125
- [23] Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Rearchitecting datacenter networks and stacks for low latency and high performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, Los Angeles CA USA,

- 29-42. doi:10.1145/3098822.3098825
- [24] Sara Hooker. 2021. The hardware lottery. Commun. ACM 64, 12 (Dec. 2021), 58–65. doi:10.1145/3467017
- [25] Wentao Hou, Jie Zhang, Zeke Wang, and Ming Liu. 2024. Understanding Routable PCIe Performance for Composable Infrastructures. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 297–312. https://www.usenix.org/conference/nsdi24/presentation/hou
- [26] Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, and Christos Kozyrakis. 2024. Tide: A Split OS Architecture for Control Plane Offloading. http://arxiv.org/abs/2408.17351
- [27] Intel. 2024. Intel Acceleration Stack for Intel® Xeon® CPU with FPGAs Core Cache Interface (CCI-P). https://www.intel.com/content/ www/us/en/docs/programmable/683193/current/acceleration-stackfor-cpu-with-fpgas.html
- [28] Intel. 2024. Intel® Data Streaming Accelerator User Guide. (July 2024). https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
- [29] Intel Harp. 2024. IvyTown Xeon + FPGA: The HARP Program. https://cpufpga.wordpress.com/wp-content/uploads/2016/04/harp isca 2016 final.pdf
- [30] Rishabh Iyer, Musa Unal, Marios Kogias, and George Candea. 2023. Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 466–481. doi:10.1145/3600006.3613136
- [31] Christoforos Kachris, Konstantinos Kanonakis, and Ioannis Tomkos. 2013. Optical interconnection networks in data centers: recent trends and future challenges. *IEEE Communications Magazine* 51, 9 (Sept. 2013), 39–45. doi:10.1109/MCOM.2013.6588648
- [32] Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 345–360. https://www.usenix.org/ conference/nsdi19/presentation/kaffes
- [33] Sagar Karandikar, Chris Leary, Chris Kennelly, Jerry Zhao, Dinesh Parimi, Borivoje Nikolic, Krste Asanovic, and Parthasarathy Ranganathan. 2021. A Hardware Accelerator for Protocol Buffers. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO '21). Association for Computing Machinery, New York, NY, USA, 462–478. doi:10.1145/ 3466752.3480051
- [34] Tomoya Kashimata, Toshiaki Kitamura, Keiji Kimura, and Hironori Kasahara. 2019. Cascaded DMA Controller for Speedup of Indirect Memory Access in Irregular Applications. In 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, Denver, CO, USA, 71–76. doi:10.1109/IA349570.2019.00017
- [35] Kimberly Keeton. 2015. The Machine: An Architecture for Memorycentric Computing. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers. ACM, Portland OR USA, 1–1. doi:10.1145/2768405.2768406
- [36] Steen Larsen and Ben Lee. 2015. Reevaluation of programmed I/O with write-combining buffers to improve I/O performance on cluster systems. In 2015 IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE, Boston, MA, USA, 345–346. doi:10.1109/NAS.2015.7255219
- [37] Nikita Lazarev, Shaojie Xiang, Neil Adit, Zhiru Zhang, and Christina Delimitrou. 2021. Dagger: efficient and fast RPCs in cloud microservices with near-memory reconfigurable NICs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 36–51.

#### doi:10.1145/3445814.3446696

- [38] Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bianchini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 574–587. doi:10.1145/3575693.3578835
- [39] Wei Siew Liew, Md Ashfaqur Rahaman, James McMahon, Ryan Stutsman, and Vijay Nagarajan. 2025. Stop Taking the Scenic Route: the Shortest Distance Between the CPU and the NIC is MMIO. In Proceedings of the 20th Workshop on Hot Topics in Operating Systems (Banff, Canada) (HotOS '25).
- [40] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. 2019. Snap: a microkernel approach to host networking. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP '19). Association for Computing Machinery, New York, NY, USA, 399–413. doi:10.1145/3341301.3359657
- [41] Frank McSherry. 2025. Timely Dataflow. urlhttps://github.com/TimelyDataflow/timely-dataflow.
- [42] Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In Conference on Innovative Data Systems Research. https://api.semanticscholar.org/CorpusID:18593675
- [43] Amirhossein Mirhosseini, Hossein Golestani, and Thomas F. Wenisch. 2020. HyperPlane: A Scalable Low-Latency Notification Accelerator for Software Data Planes. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 852–867. doi:10.1109/ MICRO50266.2020.00074
- [44] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). Association for Computing Machinery, New York, NY, USA, 439–455. doi:10.1145/2517349.2522738
- [45] Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2020. A Primer on Memory Consistency and Cache Coherence. Springer International Publishing, Cham. doi:10.1007/978-3-031-01764-3
- [46] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 327–341. doi:10.1145/3230543.3230560
- [47] NVIDIA. 2024. NVLink. https://www.nvidia.com/en-us/data-center/ nvlink/
- [48] Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. 2017. Centaur: A Framework for Hybrid CPU-FPGA Databases. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 211–218. doi:10.1109/FCCM.2017.37
- [49] Abishek Ramdas. 2023. CCKit: FPGA acceleration in symmetric coherent heterogeneous platforms. Doctoral Thesis. ETH Zurich. doi:10.3929/ ethz-b-000642567
- [50] Hugo Sadok, Nirav Atre, Zhipeng Zhao, Daniel S. Berger, James C. Hoe, Aurojit Panda, Justine Sherry, and Ren Wang. 2023. Ensō: A Streaming Interface for NIC-Application Communication. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 1005–1025. https://www.usenix.org/conference/osdi23/presentation/sadok

- [51] Henry N. Schuh, Arvind Krishnamurthy, David Culler, Henry M. Levy, Luigi Rizzo, Samira Khan, and Brent E. Stephens. 2024. CC-NIC: a Cache-Coherent Interface to the NIC. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (La Jolla, CA, USA) (ASP-LOS '24). Association for Computing Machinery, New York, NY, USA, 52–68. doi:10.1145/3617232.3624868
- [52] Jasmin Schult, Ben Fiedler, David A. Cock, and Timothy Roscoe. 2024. Semi-Open-State Testing for in-Silicon Coherent Interconnects. In Formal Methods in Computer-Aided Design, FMCAD 2024, Prague, Czech Republic, October 15-18, 2024, Nina Narodytska and Philipp Rümmer (Eds.). IEEE, 1–10. doi:10.34727/2024/ISBN.978-3-85448-065-5\_21
- [53] Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 647–659. doi:10.1145/2830772. 2830821
- [54] J. Stuecheli, W. J. Starke, J. D. Irish, L. B. Arimilli, D. Dreps, B. Blaner, C. Wollbrink, and B. Allison. 2018. IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM Journal of Research and Development 62, 4/5 (2018), 8:1–8:8. doi:10.1147/JRD.2018.2856978
- [55] Sajjad Tamimi, Florian Stock, Andreas Koch, Arthur Bernhardt, and Ilia Petrov. 2022. An Evaluation of Using CCIX for Cache-Coherent Host-FPGA Interfacing. In 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 1–9. doi:10.1109/FCCM53951.2022.9786103
- [56] Wesley W Terpstra. 2017. TileLink: A free and open-source, highperformance scalable cache-coherent fabric designed for RISC-V. In

- Proc. 7th RISC-V Workshop.
- [57] James Thomas, Pat Hanrahan, and Matei Zaharia. 2020. Fleet: A Framework for Massively Parallel Streaming on FPGAs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Lausanne Switzerland, 639–651. doi:10.1145/3373376.3378495
- [58] Tianrui Wei, Nazerke Turtayeva, Marcelo Orenes-Vera, Omkar Lonkar, and Jonathan Balkind. 2023. Cohort: Software-Oriented Acceleration for Heterogeneous SoCs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 105–117. doi:10.1145/3582016.3582059
- [59] Ying Wei, Yi Chieh Huang, Haiming Tang, Nithya Sankaran, Ish Chadha, Dai Dai, Olakanmi Oluwole, Vishnu Balan, and Edward Lee. 2023. 9.3 NVLink-C2C: A Coherent Off Package Chip-to-Chip Interconnect with 40Gbps/pin Single-ended Signaling. In 2023 IEEE International Solid-State Circuits Conference (ISSCC). 160–162. doi:10.1109/ISSCC42615.2023.10067395
- [60] Xilinx DMA Benchmark 2024. https://github.com/Xilinx/dma\_ip\_ drivers/tree/master
- [61] Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, Seaside CA USA, 255–265. doi:10.1145/3373087.3375312