For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing. However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be.
What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file. (One implementation is a file allocation table or FAT of sorts.) In a distributed file system, the blocks are “distributed” among disks attached to multiple computers. Additionally, like RAID or most SAN systems, the blocks are duplicated so that if a node is lost from the network then no data is lost.