Vivado HLS/AutoESL: Agilent packet engine case study
How Xilinx’ Vivado HLS enabled the creation of an in-fabric, processor-free UDP network packet engine
Gigabit Ethernet is one of the most common interconnect options available to link a workstation or laptop to an FPGA-based embedded platform. This is because of the availability of the hardened tri-Ethernet MAC (TEMAC) primitive. The main obstacle to developing Ethernet-based FPGA designs is the processor requirement thought necessary to handle the Internet Protocol (IP) stack. Agilent Technologies approached the problem by using the Vivado HLS/AutoESL high-level synthesis tool to develop a high-performance IPv4 User-Datagram Protocol (UDP) packet transfer engine.
Agilent’s Measurement Research Lab wrote original C source code based on Internet Engineering Task Force requests for comments (RFCs) detailing packet exchanges among several protocols. These were specifically UDP, the Address Resolution Protocol (ARP) and the Dynamic Host Configuration Protocol (DHCP). The design implements a hardware packet-processing engine without any need for a CPU. The architecture is capable of handling traffic at line rate with minimum latency and is compact in logic-resource area. Vivado HLS/AutoESL made it easy to modify the user interface to adapt to one or more FIFO streams or to multiple RAM interface ports. AutoESL is a new addition to the Xilinx ISE Design Suite and is called Vivado HLS in the new Vivado Design Suite.
NB: For consistency with our general overview of the Vivado tool suite this month and to avoid confusion, this article adopts the convention of referring to the software as Vivado HLS/AutoESL. It was called AutoESL when used by Agilent. As the project described here used series-5 Xilinx FPGAs, it would still normally be undertaken using ISE and known as AutoESL. The Vivado suite, where the Vivado HLS name is now used, covers series-7 and future devices.
IPv4 user datagram protocol
Internet Protocol version 4 (IPv4) dominates the Internet, with version 6 (IPv6) growing steadily in popularity. When most developers discuss IP, they commonly refer to the Transmission Control Protocol (TCP), the connection-based protocol that provides reliability and congestion management. But for many applications, increased bandwidth and minimal latency trump reliability. So, applications such as video streaming, telephony, gaming and distributed sensor networks typically use UDP instead.
UDP is connectionless and provides no inherent reliability. If packets are lost, duplicated or sent out of order, the sender has no way of knowing and it is the responsibility of the user’s application to perform packet inspection to handle these errors. UDP has therefore been nicknamed the ‘unreliable’ protocol. But in comparison to TCP, it offers higher performance. UDP support is available in nearly every major operating system. High-level software programming languages refer to network streams as ‘sockets’ and UDP as a datagram socket.
Sensor network architecture
Agilent has developed a LAN-based sensor network that interfaces an analog-to-digital converter (ADC) with a Xilinx Virtex-5 FPGA. The FPGA performs data aggregation and then streams a requested number of samples to a predetermined IP address — that is, a host PC.
Because the block RAM of the FPGA was almost completely devoted to signal processing, there was not enough memory to contain the firmware for a soft processor. Instead, Agilent opted to implement a minimal set of networking functions to transfer sensor data via UDP back to a host. Because of the need for high bandwidth and low latency, UDP packet streaming was the preferred network mode.
Given the time-sensitive nature of the data, a new sample set is more pertinent than any retransmission of lost samples. One of the two more challenging issues the designers faced was to avoid overloading the host device. They had to find a way of efficiently handling the large number of inbound samples. The second major challenge was quickly formatting the UDP packet, and then calculating the required IP header fields and the optional, but necessary, UDP payload checksum, before the next set of samples overflowed internal buffers.
Initial HDL design
An HDL implementation of the packet engine was straightforward given preexisting pseudocode, but not the best option for the FPGA hardware. C and pseudocode from various sources simplified verification. In addition, tools such as Wireshark, the open-source packet analyzer, and high-level languages such as Java simplified the process of simulation and in-lab verification.
Using the pseudocode, the task of developing Verilog to generate the packet headers involved coding a state machine, reading the sample FIFO and assembling the packet into a RAM-based buffer. The team broke the design into three main modules: RX Flow, TX Flow and LAN MCU (Figure 1).
The UDP packet engine design consisted of three main modules. (Source: Agilent/Xilinx – click image to enlarge)
As packets arrive from the LAN, the RX Flow inspects them and passes them either to the instrument core or to the LAN MCU for processing, such as when handling ARP or DHCP packets.
The TX Flow packet engine reads N ADC samples from a TX FIFO and computes a running payload checksum for calculating the UDP checksum. The TX FIFO buffers new samples as they arrive, while the LAN MCU prepares the payload of a yet-to-be-transmitted packet.
After fetching the last requested sample, the LAN MCU computes the remaining header fields of the IP/UDP packet. In network terminology, this procedure is a TX checksum offload.
Once the packet fields are generated, the LAN MCU sends the packet to the TEMAC for transmission but retains it until the TEMAC acknowledges successful transmission — not reception – by the destination device. As this first packet awaits transmission by the TEMAC, new sensor samples are coming into the TX FIFO. When the first packet is finished, the packet engine releases the buffer to prepare for the next packet. The process continues in a double-buffered fashion.
If the TEMAC signals an error and the next transmit buffer overflow is imminent, then the packet is lost to allow the next sample set to continue, and an exception is noted. Due to time-stamping of the sample set incorporated into the packet format, the host will realize a discontinuity in the set and accommodate it.
The latency to transmit a packet is the number of cycles it takes to read in N ADC samples plus the cycles to generate the packet header fields. The fields include the IPv4 flags, source and destination address fields, UDP pseudo-header, and both the IP and UDP checksums. The checksum computations are problematic as they require reading the entire packet, yet they lie before the payload bytes.
Coding HDL in the dark
To support the high-bandwidth and low-latency requirements of the sensor network, Agilent needed an optimal hardware design to keep up with the required sample rate. The straightforward approach implemented first in Verilog failed to meet a target 125MHz clock rate without floorplanning, and took 17 clock cycles to generate the IP/UDP packet header fields.
As the team developed the initial HDL design, ChipScope was vital to understanding the nuances of the TEMAC interface, but it also impeded the goal of achieving a 125MHz clock. Additional logic-capture circuits altered the critical path and required manual floorplanning for timing closure.
The critical path was calculating the IP and UDP header checksums because the straightforward design used a four-operand adder to sum multiple header fields together in various states. The HDL design attempted a ‘greedy’ scheduling algorithm that tried to do as much work as possible per state machine cycle. By removing ChipScope on these operations and by floorplanning, the team closed timing.
The HDL design also used only one port of a 32bit-wide block RAM that acted as the transmit packet buffer. Agilent chose a 32bit-wide memory because it is the native width of the BRAM primitive and allowed for byte-enable write accesses that would avoid the need for read-modify-write access to the transmit buffer.
Using byte enables, the finite state machine (FSM) writes directly to the header field bytes needing modification at a RAM address. However, what seemed like good design choices based on knowledge of the underlying Xilinx fabric and algorithm yielded a design that failed to meet timing without manual placement of the four-input adders.
Because the UDP algorithms were already available in various forms in C code or written as pseudocode in IP-related RFC documentation, recoding the UDP packet engine in C was not a major task and proved to yield a better insight into the packet header processing. Just taking the pseudocode and starting to write Verilog may have made for quicker coding, but this methodology would have sacrificed performance without fully studying the data and control flows involved.
Running Vivado HLS/AutoESL
Vivado HLS/AutoESL’s ability to abstract the FIFO and RAM interfaces offered one of the best opportunities to optimize performance. Being able to code directly in C, the Agilent team could now easily include both ARP and DCHP routines in the packet engine. Figure 2 shows a flowchart of the design. The HDL design used a byte-wide FIFO interface that connected to the aggregation and sensor interface of the design, which remained in Verilog. Also, the Verilog design utilized a 32bit memory interface that collected 4byte of sample data and then saved that in the transmit buffer RAM as a 32bit word.
The packet engine flowchart shows the inclusion of ARP and DHCP. (Source: Agilent/Xilinx – click image to enlarge)
Through its ‘array reshape’ directive, Vivado HLS/AutoESL optimized the memory interface so that the transmit buffers, while written in C code as an 8bit memory, became a 32-bit memory. This meant the C code could avoid having to do many bit manipulations of the header fields, as they would require bit shifting to place into a 32bit word. It also alleviated little-endian vs. big-endian byte-ordering issues.
This optimization reduced the latency of the TX offload function that computes the packet checksums and generates header fields from 17 clocks, as originally written in Verilog, to just seven clock cycles while easily meeting timing. Vivado HLS/AutoESL could do better in the future. The version used did not have the ability to manipulate byte enables on RAM writes. Byte-enabled memory support is on the tool’s long-term roadmap.
Another optimization Vivado HLS/AutoESL performed, which was found by serendipity, was to access both ports of the memory, since Xilinx block RAM is inherently dual-port. The Verilog design reserved the second port of the transmit buffer so that its interface to the TEMAC would be able to access the buffer without any need for arbitration. By allowing Vivado HLS/AutoESL to optimize for our true dual-port RAM, the design became capable of performing reads or writes from two different locations of the buffer. In effect, this halved the cycles necessary to generate the header. The reduction in latency was well worth the effort in creating a simple arbiter in Verilog for the second port of the memory so that the TEMAC interface could access the memory port that Vivado HLS/AutoESL usurped.
Agilent controlled the bit widths of the transmit buffer and the sample FIFO interfaces via directives. Vivado HLS/AutoESL does not automatically optimize a design. You have to experiment with a variety of directives and determine through trial and error which delivers an improvement. For this design, reducing the number of clock cycles to process the packet fields while operating at 125MHz was the goal.
The ‘array reshape’ and loop ‘pipeline’ directives were important for optimizing the design. The reshape directive alters the bit width of the RAM and FIFO interfaces, which ultimately led to processing multiple header fields in parallel per clock cycle and writeback to memory. The best combination that yielded the least cycles was a transmit buffer bit width of 32. The width of the FIFO feeding ADC samples was not a factor in reducing the overall latency because it is impossible to force samples to arrive faster.
The loop-pipelining directive is extremely important too because it indicates to the compiler that loops that push and pop from FIFO interfaces can operate back-to-back. Without the pipeline directive, Vivado HLS/AutoESL spent three to 20 clock cycles between pops of the FIFO due to scheduling reasons. It is therefore vital to use pipelining as much as possible to attain low latency when streaming data between memories.
Xilinx block RAM also has a programmable data output latency of one to three clock cycles. Allowing three cycles of read latency enables the minimum ‘clock to Q’ timing. To experiment with different read latencies was only a matter of changing the ‘latency’ directive for the RAM primitive or ‘core’ resource. Because of the scheduling algorithms that Vivado HLS/AutoESL performed, adding a read latency of three cycles to access the RAM only tacked on one additional cycle of latency to the overall packet header generation. The extra cycle of memory latency allowed for more slack in the design, and that aided the place-and-route effort.
Agilent also implemented ARP and DHCP routines in the Vivado HLS/AutoESL design. It had avoided doing so before because the routines are extremely cumbersome to write in Verilog and require a great number of states to perform. For instance, the ARP request/response exchange would require more than 70 states. One coding error in the Verilog FSM would likely take days to undo. For this reason, many designers prefer to use a CPU to run these network routines.
Overall, Vivado HLS/AutoESL excelled at generating a synthesizable netlist for the UDP packet engine. The module it generated fit between two preexisting ADC and TEMAC interface modules and performed the necessary packet header generation and additional tasks. The team was able to integrate the design Vivado HLS/AutoESL created into the core design and simulate it with Mentor Graphics’ ModelSim to perform functional verification. With the streamlined design, timing closure was reached with less synthesis, map and place-and-route effort than for the original HDL design. Yet the result had significantly more functionality, such as ARP and DHCP.
Comparing the original Verilog design with the hybrid design that used Vivado HLS/AutoESL to craft LAN MCU and TX Flow modules yielded impressive results. Table 1 shows a comparison of lookup table (LUT) usage.
The Vivado HLS/AutoESL design used more lookup tables but incorporated more functionality. (Source: Agilent/Xilinx – click image to enlarge)
The HDL version of TX Flow was smaller by more than 37 percent, but the Vivado HLS/AutoESL design incorporated more functionality. Most impressive is that Vivado HLS/AutoESL reduced the number of cycles to perform our packet header generation by 59 percent.
Table 2 shows the latency of the TX Offload algorithm.
Vivado HLS/AutoESL improved the latency of the TX Offload algorithm. (Source: Agilent/Xilinx – click image to enlarge)
The critical path of the HDL design was computing the UDP checksum. Comparing this with the Vivado HLS/AutoESL design showed that the HDL design suffered from 10 levels of logic and a total path delay of 6.4ns, whereas Vivado HLS/AutoESL optimized this to only three levels of logic and a path delay of 3.5ns. Development time for the HDL design was about a month. It took about the same amount of time with Vivado HLS/AutoESL, but the results incorporated more functionality while increasing familiarity with the nuances of the tool.
Latency and throughput
Vivado HLS/AutoESL has a significant advantage over HDL design in that it performs control and data-flow analyses. It can then use the results to reorder operations to minimize latency and increase throughput. In this case, a greedy algorithm tried to do too many arithmetic operations per clock cycle. The tool rescheduled checksum calculations so as to use only two input adders, but scheduled them in such a way as to avoid increasing overall execution latency.
Software compilers intrinsically perform these types of exercises. As state machines become more complex, the HDL designer is at a disadvantage compared to the omniscience of the compiler. An HDL designer would typically not have the opportunity to explore the effect of more than two architectural choices because of time constraints to deliver a design, but this may be a vital task to deliver a low-power design.
The most important benefit of Vivado HLS/AutoESL was its ability to try a variety of scenarios, which would be tedious in Verilog. These included changing bit widths of FIFOs and RAMs, partitioning a large RAM into smaller memories, reordering arithmetic operations and using dual-port instead of single-port RAM. In an HDL design, each scenario would likely cost an additional day of coding followed by then testbench modification to verify correct functionality. With Vivado HLS/AutoESL these changes took minutes, were seamless and did not entail any major modification of the source code.
Modifying large state machines is extremely cumbersome in Verilog. The advent of tools like Vivado HLS/AutoESL recalls the days when processor designers began to employ microprogramming instead of the hand-constructed state machines of early microprocessors such as the 8086 and 68000. With the arrival of RISC architectures and hardware description languages, microprogramming is mostly a lost artform, but its lesson is well learned in that abstraction is necessary to manage complexity. As microprogramming offered a higher layer of abstraction of state machine design, so do Vivado HLS/AutoESL and high-level synthesis tools in general. Tools of this caliber allow a designer to focus more on the algorithms themselves rather than the low-level implementation, which is error prone, difficult to modify and inflexible with future requirements.
About the authors
Dr. Nathan Jachimiec is an R&D Engineer in the Technology Leadership Organization of Agilent Technologies.
Dr. Fernando Martinez Vallina is a Software Applications Engineer at Xilinx.
This article originally appeared in Issue 79 of Xcell Journal. Many thanks to them and Xilinx for permission to reproduce this version. To read the extended original, download the issue here. For more on the magazine itself and to download other issues, click here.
2100 Logic Drive
T: +1 408 559 7778
Pingback: Vivado, Xilinx design flagship overview - EDA