DP-IV presentation - ashutosh
-
Upload
ashutosh-sathe -
Category
Documents
-
view
53 -
download
0
Transcript of DP-IV presentation - ashutosh
![Page 1: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/1.jpg)
Performance analysis of C-means Clustering on Big Data using Hadoop
Fuzz
y C
-mea
ns
Guided ByProf. A. J. Umbarkar
Presented ByA. S. Sathe
BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING
SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING
1
![Page 2: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/2.jpg)
Presentation Agenda• Literature Survey• Problem Statement• Objectives achieved• Results• Future Scope• References
Fuzz
y C
-mea
ns
2
![Page 3: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/3.jpg)
Data Growth Rate[7]
Fuzz
y C
-mea
ns
3
![Page 4: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/4.jpg)
Relevance • Data Clustering - Classification of a data set into a Similar groups based on
some criteria
• Big Data- Amount of data that is difficult to process using traditional database and software techniques
• Hadoop – A MapReduce Architecture based distributed computing framework
• Document Clustering • Text based data stored in file format or unstructured format• Based on text property like frequency of words, keywords provided etc.• Text properties are considered as similarity criteria• Based on similarity criteria documents are differentiated
Fuzz
y C
-mea
ns
4
![Page 5: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/5.jpg)
Fuzz
y C
-mea
ns
5
![Page 6: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/6.jpg)
Relevance• Need of data clustering• Data Mining is used for Knowledge Discovery from Data [KDD].• Based on historical data• Historical data may be Big Data• Big data processing is very tedious task• Data clustering is preprocessing for Big data processing• Processed data will be used for data mining• Data clustering give better results than randomly placed data.
Fuzz
y C
-mea
ns
6
![Page 7: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/7.jpg)
Relevance• Why Text clustering• Type of unstructured data• Free from any database constraints• File can be very large without any restrictions• In real time scenario text clustering
• Retrieve, Filter, and Categorize documents• Information Retrieval
• Clustered data is useful for Knowledge Data Retrieval
Fuzz
y C
-mea
ns
7
![Page 8: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/8.jpg)
Relevance• Why Hadoop• Distributed Framework• Can use processor capacity on the fly• Made for Big data processing
Fuzz
y C
-mea
ns
8
![Page 9: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/9.jpg)
Problem Statement
• “Performance Analysis of C-means Clustering on Big Data using Hadoop.”
Fuzz
y C
-mea
ns
9
![Page 10: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/10.jpg)
Objectives achieved Design of processing model of Fuzzy C-Means
Algorithm for Map-Reduce Implementation of C-means algorithm on Map-Reduce Testing & Performance analysis of above algorithm
with Big-Data on Map-Reduce Compare C-means with other equivalent works
Fuzz
y C
-mea
ns
10
![Page 11: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/11.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
11
![Page 12: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/12.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
12
![Page 13: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/13.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
13
![Page 14: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/14.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
14
![Page 15: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/15.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
15
![Page 16: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/16.jpg)
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
16
![Page 17: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/17.jpg)
Fuzzy C-means Clustering
17
Fuzz
y C
-mea
ns
![Page 18: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/18.jpg)
Fuzzy C-means Clustering
• For example: we have initial centroid 3 & 11 (with m=2)
• For node 2 (1st element): U11 = The membership of first node to first cluster
U12 =The membership of first node to second cluster
Fuzz
y C
-mea
ns
%78.988281
8111
1
11232
3232
1
122
122
%22.1821
1811
112112
32112
1
122
122
![Page 19: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/19.jpg)
Dataset Conversion
Fuzz
y C
-mea
ns
19
![Page 20: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/20.jpg)
Hadoop based
K-Meanson
Documents
Fuzz
y C
-mea
ns
20
![Page 21: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/21.jpg)
Fuzzy C-Means
on Documents
Fuzz
y C
-mea
ns
21
![Page 22: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/22.jpg)
Hadoop based
Fuzzy C-Means
on Documents Fu
zzy
C-m
eans
22
![Page 23: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/23.jpg)
Results
Experimental Setup
3 Centroids
4 Centroids
5 Centroids 6 Centroids Split
4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr
Classical K-Means √ √ √ √ √ √ √ √ Not Applicable
Hadoop Based K-Means
√ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split
16 Mb Split
√ √ √ √ √ √ √ √ 32 Mb Split
Classical Fuzzy C-Means √ √ √ √ √ √ √ √ Not Applicable
Hadoop Based Fuzzy C-Means
√ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split
16 Mb Split
√ √ √ √ √ √ √ √ 32 Mb Split
23
Fuzz
y C
-mea
ns
Experimental Setup
![Page 24: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/24.jpg)
Fuzz
y C
-mea
ns
24
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0 100 200 300 400 500 600 700 800 900 1000
6 centroid5 centroid4 centroid3 centroid
Time (Sec)
No.
of N
odes
![Page 25: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/25.jpg)
Fuzz
y C
-mea
ns
25Classical
FCM
2 Node FCM
4 NodeFCM
8 NodeFCM
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
6 centroid5 centroid4 centroid3 centroid
Time in sec
No.
of N
odes
![Page 26: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/26.jpg)
26
Fuzz
y C
-mea
ns
2Node 4 Node 8 Node0
0.5
1
1.5
2
2.5
3
4MB Split KM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
2Node 4 Node 8 Node0
1
2
3
4
5
6
4MB Split FCM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
Speedup Comparison of KM w.r.t. HKM
Speedup Comparison of FCM w.r.t. HFCM
![Page 27: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/27.jpg)
27
Fuzz
y C
-mea
ns
2Node 4 Node 8 Node0
0.5
1
1.5
2
2.5
8MB Split HKM Performance
4 ITR6 ITR
No of Nodes
Spee
dup
2Node 4 Node 8 Node0
1
2
3
4
5
6
8MB Split HFCM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
Speedup Comparison of KM w.r.t. HKM
Speedup Comparison of KM w.r.t. HKM
![Page 28: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/28.jpg)
28
Fuzz
y C
-mea
ns4 Mb Split 8 Mb Split 32 mb Split
4 Mb Split 8 Mb Split 32 mb Split
0
1
2
3
4
5
6
2Node4 Node8 Node
HKM HFCM
Spee
dup
4 Mb Split 8 Mb Split 32 mb Split 4 Mb Split 8 Mb Split 32 mb Split0
1
2
3
4
5
6
2Node4 Node8 Node
HKM HFCM
Spee
dup
HKM and HFCM speedup performances and comparison
4 Ite
ratio
ns6
Itera
tions
![Page 29: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/29.jpg)
29
Fuzz
y C
-mea
ns
Analysis based on cluster sizes
KM 2 Node HKM 4 Node HKM 8 Node HKM0
2000
4000
6000
8000
10000
12000
3 Centroids4 Centroids5 Centroids6 Centroids
Tim
e
Average FCM and HFCM time consumption w.r.t cluster sizes
CONT…
![Page 30: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/30.jpg)
30
Fuzz
y C
-mea
ns
Average KM and HKM time consumption w.r.t cluster sizes
FCM 2 Node HFCM 4 Node HKM 8 Node HKM0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
3 Centroids4 Centroids5 Centroids6 Centroids
Tim
e
![Page 31: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/31.jpg)
Future Scope
Fuzz
y C
-mea
ns
31
![Page 32: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/32.jpg)
Paper publication• Submitted to IEEE CONECCT 2015
Fuzz
y C
-mea
ns
32
![Page 33: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/33.jpg)
Tools and Platform Required1. Text Dataset4. Hadoop 1.215. JDK 1.66. O.S. Ubuntu 14.04
Fuzz
y C
-mea
ns
33
![Page 34: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/34.jpg)
References1. Cui, Xiaoli et al. "Optimized big data K-means clustering using
MapReduce." The Journal of Supercomputing, Vol 70, pp.1249-1259, 2014.
2. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR), Vol.31, pp.264-323, (1999). DOI:10.1145/331499.331504
3. Zhao, Weizhong et al. "Parallel k-means clustering based on mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.
4. Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-9. IEEE, 2010. DOI:10.1109/IPDPSW.2010.5470880
Fuzz
y C
-mea
ns
34
![Page 35: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/35.jpg)
References(cont...)5. J.Dean, S.Ghemawat, MapReduce, Commun. ACM 51(1) (2008)107,Jan
6. A.Asuncionand, D.J.Newman, UCI Machine Learning Repository, available http://archive.ics.uci.edu/ml/ (accessed:07-Jan-2015)
7. https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar [Used on Apr 9, 2015]
Fuzz
y C
-mea
ns
35
![Page 36: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/36.jpg)
Fuzz
y C
-mea
nsQUESTIONS???
36
![Page 37: DP-IV presentation - ashutosh](https://reader035.fdocuments.es/reader035/viewer/2022062902/58eef8681a28ab817b8b4591/html5/thumbnails/37.jpg)
Thank You