July 26-28 Cluster Symposium 2005
Neal Bierbaum, Helen Chen, Jeffrey Decker, Erik Van De Vreugde

Outline
- Parallel I/O in cluster computing
- TerraGRID – the Parallel Filesystem
- 10 GigE TOE and IB SDP
- Testbed configuration
- Benchmark methodology
- Results and analysis
- Conclusion and future work
Parallel I/O Requirements
- Cluster computing architecture
- Multiple nodes run single
application in parallel - Global data structure
distributed in memory of multiple nodes - Filesystem with parallel I/O
paths and global name space can eliminate the serial I/O bottleneck
|  |
TerraGRID
- Is iSCSI-based block-level scalable I/O platform
- Uses Shared Access Scheduling Scheme to enable Linux file system to act as a massively parallel file system
Each initiator uses SW RAID to issue requests to all targets in parallel - Each target presents a file or a raw device as block container
- All initiators share a global name space
Fully Harnesses Linux File System and Utilities

InfiniBand
- InfiniBand (IB)
- Transport protocol implemented in silicon
- High speed (2.5 to 30 Gbps) low latency (100 ns) interconnect• InfiniBand (IB)
- Transport protocol implemented in silicon
- High speed (2.5 to 30 Gbps) low latency (100 ns) interconnect
- Socket Direct Protocol
- New AF_INET protocol family that supports reliable stream sockets
- Allows sockets applications transparent access of the hardware IB protocol stack
|  |
TCP Offload Engine (TOE)
- Adapters that deliver hardware-offloaded TCP/IP protocol stacks
- Implemented over 1 and 10 Gigabit Ethernet
- Cooperative TCP Offload
- Provide support for existing sockets-based applications
|  |
The Big Picture

Hardware Setup

Key Software
- Filesystem: TerraGRID v.1.0.0
- Kernel: Linux 2.4.25
- TMPFS used on targets for allocating RAM
- oneSIS used to boot all the nodes
- Mellanox IB/SDP stack used for IB
- Chelsio TOE module and driver
Key Hardware
- Mainboard: Tyan Thunder K8WE (S2895)
- 2 GB per initiator; 8 GB per target
- To avoid bug 56 in the AMD 8131 PCI-X HT
- 10GigE Switch: Fujitsu 10GbE Layer 2 XG800, 500 hS latency, w/ cutthrough forwarding and flow-control
- 10GbE/TOE NIC: Chelsio T210 Protocol Engine
- IB Switch: Voltaire ISR 9288, 150 hS latency
- IB HCA: Mellanox Technologies MT23108
Benchmark Methodology
- A custom Sandia test system integrates the definition, execution, and the organization of results and related information
- XML definition files define the test environment, the test program parameters, and the scheduling of simultaneous runs across multiple hosts
- Results of each run are reported in a series of XML, HTML and serialized compressed data files to allow easy reviewing and consistent, unambiguous searching and processing of results from a large number of test runs
- Test programs include IOZONE, NETPERF, and a custom file system operations test
- Remote test control processes also record system resource usage on participating hosts during each test run
- Post processing tools convert data specific to a test type into a spreadsheet for further analysis
Technology Baseline – Back-to-back Netperf Throughput and Latency

TerraGRID Socket Connection Profile

Infrastructure Baseline – Netperf Throughput

Infrastructure Baseline – Netperf Latency

IOzone Aggregate Throughput

IOzone CPU Overhead

Summary IOzone Aggregate Throughput by Technology

Summary IOzone CPU Load by Technology

IOzone Work Efficiency

Conclusion
- 10 GbE and TOE out performed IB and SDP for socket applications in our test environment
- Protocol offload, TOE and SDP, offered significant performance improvement
- Further improvement possible with RDMA and zero-copy
Future Plans
- Evaluate RDMA performance through DAPL or VAPI
- Evaluate 10 Gigabit Ethernet as a shared I/O infrastructure between large platforms
- Distance advantage (LAN, WAN)
- Existing technology leverage
- Infrastructure
- Knowledge base
- etc.
10 Gigabit Ethernet Market Trend

Click here to download a pdf version of this page