Motivation

commodity hardware로 이루어진 cluster architecture에서 data mining을 위한 large scale computing을 할 수 있을까?
1. challenges
  - How to distribute computation?
  - How can we make it easy to write distributed programs?
  - How can we deal with machine failures?
2. Solution idea
  - Move computation close to data to minimize data movement.
  - Store data redundantly on multiple nodes for reliability.
And Spark/Hadoop addresses the challenges through solution idea
- HDFS of Hadoop is Storage Infrastructure(File system)
- MapReduce, Spark is Programming model

Hadoop

Hadoop consisted of four major elements that were specifically designed to work on big data.