Problem Solving with Large Data
From Statgen Internal Wiki
Jump to navigationJump to search
This page is a work in progress, and describes my thoughts on the impact of large data analytic problems on the cluster as a whole.
While compute intensive jobs are also interesting, they are a more defined problem.
Heavy I/O has traditionally caused us the most pain.
Big Picture Hardware Summary Today[edit]
- Gateways - dumbo, fantasia, snowwhite, wonderland
- 128GiBytes RAM
- 12-24 cores
- no local storage
- SAN Storage (home directories) - more than 1PiByte of slow, but reasonably robust storage
- Project Machines - 1000g, amd, esp, got2d, t2dgenes, bipolar
- gateway type compute server
- 208 TiBytes of RAID60 storage (126 2TiByte disk drives each)
- limited to project specific analysis
- each has 4 X 8 core, 96GiByte local compute cluster (C6100)
- total storage on these is now well past 1PiByte
- Compute nodes - approximately 78
- reachable via slurm
- most 8 core
- total around 576 cores in public side, 736 counting project mini-clusters
- at least 32GiBytes, many have 96GiBytes
- Strong 10 Gigabit network backbone
- Compute nodes are each 1 Gigabit, but aggregating switch is dual 10 Gigabit
Large Analysis Goals[edit]
- read lots of BAM or FASTQ files
- do stuff with them
- write lots of results back
Dataflow Limits[edit]
- SAN Storage arrays provide ~50-250 MiBytes/second (gateway home directories)
- low end is thrashing with many processes/users (>10-15 active)
- high end is smaller number of highly sequential read/write (3-10 active)
- absolute maximum is around 500MiBytes/second
- Project machine RAID60 provides 200-2000 MyBytes/second
- low end is thrashing case (>60-70 jobs)
- high throughput is smaller number of highly sequential I/O (5-20)
- absolute theoretical maximum is almost 5000 MiBytes/second
- highest observed maximum is around 2000 MiBytes/second
- highest practical tends to be around 1000-1500 MiBytes/second
- All servers except compute nodes are 10 Gigabit/second -> 1.25 GiBytes/second peak
- Compute nodes 1 Gigabit/second -> 125 MiBytes/second peak
I/O Bottlenecks and challenges[edit]
Generally speaking, operating systems are not good at dealing with I/O bandwidth as a limited resource.
- typically, optimization is for fairness, not throughput
- local filesystem thrashing
- NFS increases thrashing (large read/writes are broken down to smaller ones)
- impossible to rate limit I/O
- XFS can melt down with small file I/O
- some directories have millions of files in them
- rapidly changing metadata is costly in XFS
- It is generally difficult for an operating system to predict and or limit I/O bandwidth
Ad-Hoc Workarounds[edit]
- limit max number of jobs in such a way that they don't thrash the resource
- purely manual/guesswork
- added load can render the fileserver unusable
- preload compute nodes with data (e.g. pipeline jobs)
- still manual - burden is on the researcher
- can preload from gateway/project machine (inherent rate limiting)
- difficult to scale
Engineered Workarounds[edit]
- Lustre, GFS, Hadoop, CXFS, others
- Defining I/O as a consumable resource in SLURM for scheduling and rate limiting
- possible, but very unwieldy
- limitations prevent general use (i.e. rate limiting is defined the wrong way in SLURM in this case)
- Use or implement new ways of doing I/O
- FTP/HTTP
- requires passwords stored on compute clients
- ftp/http servers are not designed for serving cluster data - i.e. no aggregate rate limiting
- FTP/HTTP
- SLURM sbcast
- unidirectional
- for max efficiency you need to preallocate all nodes first
- bittorrent
- distributed storage of data
- distributed bandwidth
- increases odds of failure of any particular file
My Experimental Workaround[edit]
TCP/IP based client/server C++ reimplementation of read/write/open/close/lseek using munge for encryption.
- TCP/IP is good for large network I/O
- trivial to write open/read/write/lseek/close wrapper (I just did)
- can use munge to do automatic authentication from compute client back to data server
- simple server code means we can buffer on 1MB boundaries (RAID stripe size - big win)
- also can limit max number of simultaneous outstanding read/write calls
- plugs right into libstatgen with no externally visible changes
- possibly restartable
- failover to use NFS when data server is not found