Motivation

- commodity hardware로 이루어진 cluster architecture에서 data mining을 위한 large scale computing을 할 수 있을까?
- challenges
- How to distribute computation?
- How can we make it easy to write distributed programs?
- How can we deal with machine failures?
- Solution idea
- Move computation close to data to minimize data movement.
- Store data redundantly on multiple nodes for reliability.
- And Spark/Hadoop addresses the challenges through solution idea
- HDFS of Hadoop is Storage Infrastructure(File system)
- MapReduce, Spark is Programming model
Hadoop
Hadoop에 대한 유튜브 영상: https://www.youtube.com/watch?v=aReuLtY0YMI
Hadoop consisted of four major elements that were specifically designed to work on big data.
- HDFS (Hadoop Distributed File System)
- MapReduce: Programming based data processing
- YARN (Yet Another Resource Negotiator)
- Hadoop Common
1. HDFS (Hadoop Distributed File System)