What is Hadoop: is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.
Hadoop Concepts
Distributed Reliable File System
- Apache Hadoop Distributed File System (HDFS)
- Inspired by Google File System
- Single Logical View of distributed Linux File Systems
Data typically is Replicate 3 times
- Fault Tolerant
- Better I/O
Distributed Compute Framework & Resource Manager
- Apache MapReduce and YARN
- Inspired by Goolge MapReduce
HDFS Blocks
Files are broken into chunks