Search This Site Search This Site
 
About Solutions Products Partners How To Buy Benchmarks Support Home
     
  News and Events  
     
Experiences with NFS over IB and iWARP RDMA Sandia

OpenFabric Workshop, Sonoma, CA

May 1, 2007

Helen Chen, Noah Fischer, Matt Leininger, and Mitch Williams
Sandia National Laboratories
SAND 2007 –2137C

Outline


  • Motivation and Background Information
  • Previous Study –NFS over RDMA (SDR IB)
  • This Study –extends the previous study to include DDR IB and 10 GbE iWARP
    • The Testbed
    • The Benchmark
    • Results and Analysis
  • Summary and Future Plans
    • The parallel NFS research collaboration with Open Grid Computing

 

Motivation


  • Scaling I/O for Commodity Clusters
    • While multi-core processor technology speeds ahead, filesystem capability is falling far behind.
    • Panasas, Lustre, and GPFS are being developed outside of the Linux main stream, and they are complex to administer
    • The Linux mainstream distributed filesystem, NFS, is slowly being improved in functionality (NFSv4, NFS-over-RDMA, parallel-NFS)

The NFS RDMA Architecture



  • NFS is a family of protocol layered over RPC
  • XDR encodes RPC requests and results onto RPC transports
  • NFS RDMA is implemented as a new RPC transport mechanism
  • Selection of transport is an NFS mount option
NFS RDMA Architecture

Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad, “NFS over RDMA”, ACM SIGCOMM 2003Workshops, August 25-27 2003

 

The NFS Protocol Stack


NFS Protocol Stack

Relevant OFA Stack



Relevant OFA Stack

iWARP -RDMA protocol for TCP/IP



  • iWARP is the suite of RDMA protocols for TCP/IP
  • RNIC is a RDMA capable NIC with offloaded iWARP as well as TCP/IP (TOE)
  • RNIC typically exposes NIC, TOE and iWARP interfaces to upper layer applications
iWARP -RDMA protocol for TCP/IP


Previous Study –NFS over IB
RDMA vs. TCP (IPoIB)


  • NFS over RDMA can easily fill the 10 Gigabit (1GB) pipe
    • Iozone -i 0 –i 1 –r 64k –s 2g
    • Client Cache < 2g < Server CacheNFS

RDMA vs. TCP
http://www.openfabrics.org/archives/sep2006devcon.htm

 

Write Performance Issues


  • Linux’s NFS client implementation lack concurrency
    • Pdflush activated when dirty page cache reached 34%
    • Application I/O’s blocked while cached data being flushed
    • Most visible with RDMA due to huge bandwidth capacity and CPU efficiency
  • Being addressed by the Linux kernel community (Talpey, Tucker, et. al.)
    • Support Multi-threading to take advantage of multi core hardware
    • Tune Linux VM to flush data as often as network bandwidth allowsin the background

This study evaluates NFS RDMA transport vs. TCP, using Iozone reading from server cache

 

This Study –DDR IB and iWARP RPC Transport with SRP IB Storage



This Study –DDR IB and iWARP RPC Transport


Key Testbed Hardware


  • Mainboard: iWILL DK8ES
    • Dual Core Dual Socket 2.4 Ghz AMD Opteron
    • Dual Channel 400 Registered memory
      • 4 GB on server
      • 2 GB on client
  • DDR IB Switch:Mellanox InfiniScale III 24-port switch
  • DDR IB HCA: PCI-E Mellanox MT25204 InfiniHost III Lx
  • 10 GbE Switch: Fujitsu XG700 CX4
  • 10 GbE RNIC: PCI-E Chelsio Terminator 3
  • SDP IB SRP Storage: DDN S2A 9550

 

Key Testbed Software


 

 NFS Test Configuration


  • One NFS server and one to four clients
  • Ext2 filesystem built on IB SRP Storage at SDR
  • TCP/IPoIB-UD (MTU 2048), TCP/IPoIB-CM (MTU 65520), and IB RDMA transport at DDR
  • Host TCP/IP, TOE, and RNIC (iWARP) transport at 10GbE rate (MTU 9000)
  • Clients ran IOZONE reading 128KB records
  • Read 2GB file on all clients to avoid client-side cache effect and server-side disk I/O

To allow the evaluation of the NFS RDMA transport

  • System resources monitored using “vmstat”at one second intervals
  • All tests repeated 10 times

 

NFS Throughput



NFS Throughput


NFS Throughput Conclusion


  • NFS over RDMA
    • Can take advantage of the IB DDR pipe (theoretical maximum bandwidth 2.0GB)
    • Throughput is limited by the 10GbE rate (theoretical maximum bandwidth 1.25GB)
    • Both RDMA transport out performed their TCP counter part, most noticeable in IB
  • NFS over TCP
    • IPoIB-CM significantly better than IPoIB-UD
      • 65520
    • Both the RNIC TOE and NIC performed surprisingly well
    • Can also easily filled the 10GbE pipe
    • A great all-in-one adapter

 

NFS CPU Efficiency


NFS CPU Efficiency


NFS CPU Efficiency Conclusion


  • Host Efficiency is based on CPU per MB transferred
    • Σ%cpu / 100 / file-size
  • IB RDMA and 10GbE iWARP delivered comparable CPU efficiency
  • RDMA demonstrated better CPU performance than TCP
    • Most significantly in IB
    • Both TOE and host TCP performed extremely well, with TOE better than host stack


SRP Target and Initiator Configuration


  • Target –DDN S2A9550
    • 1 Controller, 1 SDR IB link
    • 4 Power LUN’s each stripped across 4 Tiers (8 plus 1 Parity of 250GB SATAII disks)
    • Block size = 4096
  • Initiator –OFED 1.2 beta
    • Increased maximum number of gather/scatter entries per I/O
      • modprobe ib_srp srp_sg_tablesize (scatter and gather) =64
    • Increased Filesystem read-ahead sector count to 1024
      • hdparm –a 1024


SRP Performance


  • 1 to 4 concurrent sessions from 1 to 4 Initiators

“Iozone –i 0 –r 128k –s 8g –f /mnt/srp1/test”

8g > Initiator memory; measurement reflects SRP performance
SRP Performance


SRP Performance Conclusion


  • Good but still room for tuning
    • Adjust maximum outstanding SCSI requests per LUN
    • Increase maximum SCSI command payload size
    • Evaluate Linux I/O Scheduling Algorithms
    • QoS?
    • etc…


Future Plans: The Need for pNFS


  • Large number of concurrent requests from parallel applications
  • Require parallelism in addition to RDMA

The Need for pNFS



The pNFS Architecture



The pNFS Architecture

  • pNFS extends NFSv4
    • To allow out-of-band I/O
    • A Standards-based scalable I/O solution
  • Asymmetric, Out-of-band solutions offer scalability
    • Control path (open/close) different from Data Path (read/write)

http://www3.ietf.org/proceedings/04nov/slides/nfsv4-8/pnfs-reqs-ietf61.ppt

 

The Sandia -Open Grid Computing Research Collaboration


Open Grid Computing Research Collaboration
  • Based on the CITI implementation at UMICH
  • Modified to stripe pNFS file data across RDMA enabled Linux NFSv3 Filers
  • Open Source Linux environment

 

Future SRP Study


  • Each Storage Server has to handle multiple independent large sequential writes and/or reads
    • Concurrent sequential I/O requests from compute nodes turned into random large accesses on target
    • A challenge for Parallel Filesystem and Storage vendors

Future SRP Study

 

Acknowledgment


  • Tom Tucker from Open Grid Computing and Tom Talpey from Network Appliance for their in depth technical support for NFS/RDMA
  • Chas Williams from NRL, Randy Kreiser from DDN, and Dror Goldenberg from Mellanox for their assistance in IB SRP
  • Felix Marti from Chelsio and Steve Wise from Open Grid Computing for their expertise in iWARP
  • Jim Brandt from Sandia for his technical input and review

Click here to download a pdf  version of this page

About | Solutions | Products | Partners | How To Buy | Support | Contact | Careers | Legal | Privacy Policy | Home |

© Copyright 2007 Chelsio Communications