Hadoop presentation
-
Upload
priyaj-kumar -
Category
Internet
-
view
12 -
download
0
Transcript of Hadoop presentation
Word cloud formation on a particular #hashtag
By:Priyaj KumarAmrendra ChaudharyRitinkar PramanikSaddam HussainAniket Roy
Introduction
The last decade saw a social media boom when several services became widely popular and used by people all over the world. Social media services such as Facebook, Twitter, and LinkedIn allow users to connect with their friends, colleagues, and other entities that are important and relevant to their interests. An average Twitter user follows 80 users, leading to hundreds or even thousands of tweets inundating the user daily. Thus, by analyzing social network data, meaningful information can be discovered, such as popular topics users are discussing, and trends of important events. The primary goal of this work is to visualize time-varying Twitter text data by word cloud. The animated word clouds preserve the context while the focus is changing. Thus, the visualization not only provides an overview of huge time-varying Twitter text data but also assists users in identifying the changing of content from time to time.
Big Data
Big Larger volume than you’ve handled before
No litmus test High value, under utilized
Data Structured Unstructured Semi-structured
Hadoop Distributed file system Distributed, batch computation
Hadoop
• Designed to solve problems which has lot of data for processing.
• It uses the divide and rule methodology for processing.
• Used to handle large and complex unstructured data which doesn’t fit into tables.
• Twitter data being relatively unstructured can be best stored using Hadoop.
• Hadoop also finds a lot of applications in the field of online retailing, search engines, finance domain for risk analysis etc.
5A Canonical Hadoop Architecture
©2012 Cloudera, Inc.
Data Source HDFSFlume
Hive (Impala)
Analyzing Twitter Data with Hadoop
USE CASE EXAMPLE
7Analyzing Twitter
Social media popular with marketing teams Twitter is an effective tool for promotion Who is influential?
Tweets Followers Retweets
Similar to e-mail forwarding Which twitter user gets the most retweets? Who is influential in our industry?
©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE QUESTIONS
9Techniques
SQL Filtering Aggregation Sorting
Complex data Deeply nested Variable schema
10Architecture
©2012 Cloudera, Inc.
HDFSFlume Hive
CustomFlumeSource Sink to
HDFSJSON SerDeParses Data
Oozie
AddPartitions
Hourly
11What is Hive?
Created at Facebook HiveQL
SQL like interface Hive interpreter converts HiveQL to
MapReduce code Returns results to the client
©2012 Cloudera, Inc.
Our Project (AIM)
Sentiment Analysis Using Twitter Data Word cloud formation.
Working
The problem is to collect all tweets that contain the hashtag #pokemongo and infer meaning from the data. Stages:-1. We collected all tweets that contain the hashtag #pokemongo and infer
meaning from the data. It was done by the Application Programming Interface (API) provided by Twitter.
2. We ran the script for about 20 hours starting from 00:00AM to 20:00PM.3. Collected 1.8GB of data.4. Extracted the data by timestamp.py and stored in .csv file. 5. Then using R word cloud was formed and graphs were made.
Conclusion
We proposed an improved dynamic word clouds for visualizing time-vary text data.
The applied circular layout and animation methods preserve context while the focus is changing.
A system which is able to visualize information in Twitter not only provides users an overview of the contents from huge time-varying text data, but also assist users in identifying the changing of content from time to time.
Thank You!