Search This Site Search This Site
 
About Solutions Products Partners How To Buy Benchmarks Support Home
     
  Solutions  
     
The Terminator Architecture

The Case for a VLIW Processor vs. Multi-processor SOC to Terminate TCP and Process L5-L7 Protocols

A Chelsio Communications White Paper

Introduction
The goal of 10 gigabit-per-second (Gbps) Ethernet is simple: drive the convergence of the high-speed storage/network fabric through economies of scale, and to match the performance attributes of competing high-speed interconnects such as Sonet/SDH, Fibre Channel, and Infiniband.

To compete with these interconnects, 10Gbps Ethernet must first match their low latency. In Ethernet networks, reliable end-to-end transport uses the Transport Control Protocol (TCP), which has predominantly been implemented in software. However, in order to achieve full 10Gbps performance and to reduce latency to a level equal to the competing interconnects, the TCP function must be offloaded from the host CPU to free up the processing cycles and return them to the high-performance applications.

This offload function is accomplished through the use of a TCP Offload Engine (TOE). In addition to enabling high bandwidth and low end-to-end latency, a TOE must also feature a flat performance profile and achieve wire-rate performance for a wide range of connection roundtrip times (RTTs). The flat performance profile requirement specifies that the 10Gbps performance should be attainable for a single connection, all the way up to thousands or even tens of thousands of simultaneously sending and/or receiving connections. This requirement also calls for the wire bandwidth to be distributed equally across all connections with the same priority level. The RTT requirement specifies that the TOE should provide robust support for everything from intra cluster connections with RTTs in the 10 microsecond (μsec) range, to LAN connections with RTTs in the 1-10 millisecond (ms) range, and all the way out to WAN connections with RTTs in the 10-100 ms range. Without fine granularity timers, the TOE won’t deliver the robust support required for high-speed clusters.

Two approaches to TOE implementation can be followed. The first uses a system-on-a-chip (SOC) implementation with multiple RISC processors and special-purpose engines, while the second — the Chelsio Communications approach — is characterized by a pipelined VLIW processor implementation.

While both approaches have their advantages, the multiple-RISC SOC implementation also has a number of scaling problems. First, it cannot achieve wire-rate speeds below a minimum threshold number of connections. Above a modest threshold number of connections, on the other hand, it suffers from memory contention between the different processors competing for the path to off-chip memory. The approach also suffers from “cache thrashing” when scaling to a large number of connections. Finally, the architecture is inherently store-and-forward and cannot achieve low latency via cut-through processing (also known as flow-through processing), which is the ability to perform required TCP processing on the fly (for either the send or receive) without having to store packets in off-chip memory.

The Terminator architecture favored by Chelsio suffers from none of these shortcomings; it is a pipelined VLIW implementation capable of simultaneous cut-through processing on both send and receive paths that is independent of the number of simultaneously active connections. Furthermore, the Terminator architecture is high bandwidth and has low end-to-end latency; the first-generation Terminator product — the T1 chip — has a verified flat performance profile of 7+ Gbps (limited by PCI-X 1.0 performance) for anything from one to more than 10,000 TCP connections, with a measured 9 μs user/application to user/application latency.

For iSCSI, the data throughput has been measured at 854 megabytes per second (MBps), with a measured IOPS rating of 534K IOP/s; for both of these numbers, the hardware CRC generation/checking was enabled. The architecture offloads the CRC-32C generation/checking in iSCSI, recovers PDU on receive, and performs Direct Data Placement (DDP) for the recovered PDU. Finally, it is important to note that all the bandwidth benchmark results employ standard 1500B Ethernet frames.


Multi-RISC Architecture
The typical multi-RISC SOC implementation shown in Figure 1 is a full-duplex device with one terminal connected to a host computer via a PCI-X peripheral bus (for example) and the other terminal connected to Ethernet. The rest of the SOC is composed of a cluster of microprocessors (μP) and some type of memory bus or fabric to connect the μP with an off-chip memory.

The packet information processing proceeds in the following manner:

A packet arrives through an input unit and is stored in the scratch pad memory buffer. Header information is forwarded to a μP in the processor cluster that is responsible for processing that particular packet. Mapping to a specific processor is typically based on TCP 4-tuple information; it could, for example, use a hash function over the 4-tuple to one of the cluster processors, or the map could be performed via a TCAM or a search tree lookup.

The TCP protocol is stateful, typically requiring a minimum of 128B of storage, plus additional storage to manage send/receive buffers and TCP timers. A specific packet, therefore, needs to be processed on the specific μP which already has the connection state in its data cache (D$), or the connection state must be retrieved from a common cluster cache memory or from the off-chip memory and stored in the appropriate processor D$ before processing can begin.

The typical per-processor D$ size in 0.13μ technology is 8KB to 16KB, which is barely sufficient for storing a maximum of 64-128 connections per μP D$. The number of μP cores with a D$ of this size is realistically only scalable to a modest 8-16 cores, so the total on-chip D$ capacity is only enough to store 512-4K connections, depending on assumptions about D$ size and the number of μP. (Note that this is a best-case number of connections, since it does not assume any D$ memory requirements for timers or send and receive buffers.)

Beyond this modest number of connections, there will invariably be a significant loss of performance due to cache misses and contention between the different μP for the cachereplacement path to the off-chip memory. The multi-RISC SOC architecture, therefore, has — even in a best-case scenario — significant scaling issues when the number of connections exceeds a few thousand simultaneous active connections. Once that threshold is crossed, the processor caches will start to thrash and performance will decrease and become unpredictable due to the probabilistic and temporal nature of cache misses. In addition, input packets will start to be dropped due to the increased per-packet processing time, or the input packets will need to be stored in off-chip memory.

All of this is the good news for the multi-RISC SOC chip architecture.

First, it must be noted that an accepted rule of thumb is that TCP processing requires 1 MHz per 1 Mbps TCP performance. Further, because the TCP protocol is stateful, it is impossible to concurrently process the packets belonging to a single connection in more than one processor. Processing a single 10 Gbps TCP connection, therefore, would require a 10GHz processor core, which is close to 10 times more, or faster, than what is feasible with today’s SOC process technology. As a result, the multi-RISC SOC architecture will not reach wire rate for a single TCP connection, and will require on the order of 10 connections and 10 processor cores to reach wire-rate speeds.

Another observation is that it is difficult to achieve load balancing across the available processors, i.e. a hash function will not necessarily balance the active connections across the available processors, and a search tree or TCAM lookup scheme is slow to react to change in the connection activity. To exacerbate this effect, a single overwhelmed RISC core will affect the performance of the other RISC cores through head-of-line blocking, etc.

Another important observation is that TCP termination consists of the basic Read-Modify-Write cycle for the TCB state described above. However, beyond that, TCP termination involves memory management for the per-connection TCP send buffer, memory management for the per-connection TCP receive buffer, management of the various per-connection TCP timers, and management of the per-connection delayed acknowledgement timers. Each of these activities requires memory storage for data structures in the already scarce processor D$, thereby lowering the number of cache-resident TCB state blocks and increasing the cache miss ratio. In addition, TCP timers typically require a dedicated processor from the processor pool.

In summary, the multi-RISC SOC architecture has two modalities of operation. In the first mode, the TCB state is cache-resident and the performance is predictable, but in the second mode, the caches are thrashing and the overall performance degrades. The cache-resident operation only extends in the best case to a few thousand simultaneously active connections. It is interesting to note that a single-chip multi-RISC SOC implementation might be the sweet spot for this architecture; in other words, in that configuration, the connection state is by definition cache-resident and the TCP processing performance is therefore predictable. However, it is clear that scaling this architecture to a large number of simultaneously active connections and maintaining reasonable performance is close to impossible.


Chelsio VLIW Terminator Architecture
The Chelsio pipelined VLIW Terminator architecture is depicted in Figure 2. The packets arrive through the input unit from the core side typically connected to either PCI-X or a PCI-Express peripheral bus, and through another input unit from the wire side connected to 10 Gbps
Ethernet. The packet headers subsequently enter the processing pipeline, where each processing step has an upper boundary or limit for the worst-case processing time, making the processing rate of the overall pipeline predictable.

The packets proceed to the VLIW processor which determines the TCP processing required for each packet. The processing consists of the following steps: the classification of the TCP/IP header information, the update of the TCP congestion window, the classification of received TCP/IP payload information, the update of the TCP receive window, the scheduling of TCP timers, and the scheduling of TCP/IP response messages.

The classification of the TCP/IP header information is accomplished by the TCP classification co-processor (CP) and is based on the processing of the header sequence number, acknowledge number and header TCP (SYN,FIN,ACK,RST,CWR,URG) flag settings, and by comparing these with the current TCP connection state, stored sequence number and acknowledge number information. The operation of the TCP classification CP has been formally verified for all header (sequence number, acknowledge number, TCP flag settings) and stored (sequence numbers, acknowledge number) combinations.

The TX flow control processor manages flow control, including the Nagle algorithm, the Silly Window Syndrome (SWS) avoidance, and peer window, as well as congestion control – slow start and congestion avoidance.

The TCP protocol has several types of per-connection timers, which are managed by the Timer Co-Processor. These per-connection timers include the following: the connection establishment timeout timer; the retransmission timeout timer; the delayed acknowledgement timer; the persist-probe; the keep-alive timer; the FINWAIT2 state timer; and the TIMEWAIT state timer. The Timer CP supports all of these different per-connection timers for any number of connections. The Timer CP allows the use of fine-grained retransmit timers, with values as low as tens of μs, which are required for robust intra-cluster TCP operation.

Finally, the VLIW processor creates response messages and updates the per-connection TCB state. This includes issuing TCP/IP header encapsulation instructions and payload generation instructions to the event dispatch unit.

There is no caching in the pipeline and the FIFO buffering between stages of the pipeline is matched to the off-chip memory latency and bandwidth; therefore, the processing rate is the same, independent of the number of connections. Furthermore, the pipeline is designed in such a way that the packet-processing rate is independent of the connection mix of packets in the pipeline. The pipeline rate is the same when all packets in the pipeline belong to the same TCP connection; it is the same when all packets in the pipeline belong to different TCP connections; and it is the same when some of the packets in the pipeline belong to different TCP connections.

An additional attribute of the pipeline design is that it achieves near-perfect multiplexing of available bandwidth over the active connections. For example, the bandwidth distribution for 2,000 identical connections (symmetric peer setup) was measured in a lab experiment where the aggregate bandwidth equaled the wire rate, and the bandwidth of each of the 2,000 connections was found to be the same to a precision of four significant digits.

The typical latency through the Terminator pipeline is on the order of 1 μs, making cut-through processing feasible on both the send and receive paths. The cut-through processing is conditioned in general, in the send direction on the per-connection TCP send window state, and in the receive direction on the alignment of a packet with the next expected sequence number. Cut-through processing also takes into account fairness issues between different TCP connections.

The Chelsio Terminator architecture can be implemented in a single chip with on-chip memory to store the TCB connection state and buffers, but it can also be scaled to support a large number of connections with the current architectural limit set at 1 million connections, and an aggregate size of the off-chip send and receive buffers of 4 GB. The architecture has also been determined to be very efficient; it achieves full-duplex 1 Gbps performance when the core clock frequency is 16 MHz, and it saturates PCI-X 1.0 with a clock frequency of 125 MHz. These performance metrics are for 1500B Ethernet frames, independent of the number of simultaneous connections in excess of 1,000.

Conclusions
The Chelsio pipelined VLIW Terminator architecture delivers a high-bandwidth, low-latency solution that is capable of cut-through processing on both the send and receive paths, providing a superior alternative to the multi-processors SOC implementation. The Terminator architecture’s performance profile is agnostic with regards to the number of TCP connections and the RTT of the TCP connections. For iSCSI, the architecture adds the ability to achieve wire-rate CRC-32C generation and checking, and it also performs iSCSI DDP at wire rate.

For more information about Chelsio Communications and the Terminator architecture, visit the Chelsio web site at www.chelsio.comor send an e-mail to info@chelsio.com.

About | Solutions | Products | Partners | How To Buy | Support | Contact | Careers | Legal | Privacy Policy | Home |

© Copyright 2007 Chelsio Communications