Problem Solving with Large Data

From Statgen Internal Wiki
Jump to navigationJump to search

This page is a work in progress, and describes my thoughts on the impact of large data analytic problems on the cluster as a whole.

While compute intensive jobs are also interesting, they are a more defined problem.

Heavy I/O has traditionally caused us the most pain.

Big Picture Hardware Summary Today[edit]

  • Gateways - dumbo, fantasia, snowwhite, wonderland
    • 128GiBytes RAM
    • 12-24 cores
    • no local storage
  • SAN Storage (home directories) - more than 1PiByte of slow, but reasonably robust storage
  • Project Machines - 1000g, amd, esp, got2d, t2dgenes, bipolar
    • gateway type compute server
    • 208 TiBytes of RAID60 storage (126 2TiByte disk drives each)
    • limited to project specific analysis
    • each has 4 X 8 core, 96GiByte local compute cluster (C6100)
    • total storage on these is now well past 1PiByte
  • Compute nodes - approximately 78
    • reachable via slurm
    • most 8 core
    • total around 576 cores in public side, 736 counting project mini-clusters
    • at least 32GiBytes, many have 96GiBytes
  • Strong 10 Gigabit network backbone
  • Compute nodes are each 1 Gigabit, but aggregating switch is dual 10 Gigabit

Large Analysis Goals[edit]

  • read lots of BAM or FASTQ files
  • do stuff with them
  • write lots of results back

Dataflow Limits[edit]

  • SAN Storage arrays provide ~50-250 MiBytes/second (gateway home directories)
    • low end is thrashing with many processes/users (>10-15 active)
    • high end is smaller number of highly sequential read/write (3-10 active)
    • absolute maximum is around 500MiBytes/second
  • Project machine RAID60 provides 200-2000 MyBytes/second
    • low end is thrashing case (>60-70 jobs)
    • high throughput is smaller number of highly sequential I/O (5-20)
    • absolute theoretical maximum is almost 5000 MiBytes/second
    • highest observed maximum is around 2000 MiBytes/second
    • highest practical tends to be around 1000-1500 MiBytes/second
  • All servers except compute nodes are 10 Gigabit/second -> 1.25 GiBytes/second peak
  • Compute nodes 1 Gigabit/second -> 125 MiBytes/second peak

I/O Bottlenecks and challenges[edit]

Generally speaking, operating systems are not good at dealing with I/O bandwidth as a limited resource.

  • typically, optimization is for fairness, not throughput
  • local filesystem thrashing
  • NFS increases thrashing (large read/writes are broken down to smaller ones)
  • impossible to rate limit I/O
  • XFS can melt down with small file I/O
    • some directories have millions of files in them
    • rapidly changing metadata is costly in XFS
  • It is generally difficult for an operating system to predict and or limit I/O bandwidth

Ad-Hoc Workarounds[edit]

  • limit max number of jobs in such a way that they don't thrash the resource
    • purely manual/guesswork
    • added load can render the fileserver unusable
  • preload compute nodes with data (e.g. pipeline jobs)
    • still manual - burden is on the researcher
    • can preload from gateway/project machine (inherent rate limiting)
    • difficult to scale

Engineered Workarounds[edit]

  • Lustre, GFS, Hadoop, CXFS, others
  • Defining I/O as a consumable resource in SLURM for scheduling and rate limiting
    • possible, but very unwieldy
    • limitations prevent general use (i.e. rate limiting is defined the wrong way in SLURM in this case)
  • Use or implement new ways of doing I/O
    • FTP/HTTP
      • requires passwords stored on compute clients
      • ftp/http servers are not designed for serving cluster data - i.e. no aggregate rate limiting
  • SLURM sbcast
    • unidirectional
    • for max efficiency you need to preallocate all nodes first
  • bittorrent
    • distributed storage of data
    • distributed bandwidth
    • increases odds of failure of any particular file

My Experimental Workaround[edit]

TCP/IP based client/server C++ reimplementation of read/write/open/close/lseek using munge for encryption.

  • TCP/IP is good for large network I/O
  • trivial to write open/read/write/lseek/close wrapper (I just did)
  • can use munge to do automatic authentication from compute client back to data server
  • simple server code means we can buffer on 1MB boundaries (RAID stripe size - big win)
  • also can limit max number of simultaneous outstanding read/write calls
  • plugs right into libstatgen with no externally visible changes
  • possibly restartable
  • failover to use NFS when data server is not found