Hadoop presentation

15
Word cloud formation on a particular #hashtag By: Priyaj Kumar Amrendra Chaudhary Ritinkar Pramanik Saddam Hussain Aniket Roy

Transcript of Hadoop presentation

Page 1: Hadoop presentation

Word cloud formation on a particular #hashtag

By:Priyaj KumarAmrendra ChaudharyRitinkar PramanikSaddam HussainAniket Roy

Page 2: Hadoop presentation

Introduction

The last decade saw a social media boom when several services became widely popular and used by people all over the world. Social media services such as Facebook, Twitter, and LinkedIn allow users to connect with their friends, colleagues, and other entities that are important and relevant to their interests. An average Twitter user follows 80 users, leading to hundreds or even thousands of tweets inundating the user daily. Thus, by analyzing social network data, meaningful information can be discovered, such as popular topics users are discussing, and trends of important events. The primary goal of this work is to visualize time-varying Twitter text data by word cloud. The animated word clouds preserve the context while the focus is changing. Thus, the visualization not only provides an overview of huge time-varying Twitter text data but also assists users in identifying the changing of content from time to time.

Page 3: Hadoop presentation

Big Data

Big Larger volume than you’ve handled before

No litmus test High value, under utilized

Data Structured Unstructured Semi-structured

Hadoop Distributed file system Distributed, batch computation

Page 4: Hadoop presentation

Hadoop

• Designed to solve problems which has lot of data for processing.

• It uses the divide and rule methodology for processing.

• Used to handle large and complex unstructured data which doesn’t fit into tables.

• Twitter data being relatively unstructured can be best stored using Hadoop.

• Hadoop also finds a lot of applications in the field of online retailing, search engines, finance domain for risk analysis etc.

Page 5: Hadoop presentation

5A Canonical Hadoop Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive (Impala)

Page 6: Hadoop presentation

Analyzing Twitter Data with Hadoop

USE CASE EXAMPLE

Page 7: Hadoop presentation

7Analyzing Twitter

Social media popular with marketing teams Twitter is an effective tool for promotion Who is influential?

Tweets Followers Retweets

Similar to e-mail forwarding Which twitter user gets the most retweets? Who is influential in our industry?

©2012 Cloudera, Inc.

Page 8: Hadoop presentation

Analyzing Twitter Data with Hadoop

HOW DO WE ANSWER THESE QUESTIONS

Page 9: Hadoop presentation

9Techniques

SQL Filtering Aggregation Sorting

Complex data Deeply nested Variable schema

Page 10: Hadoop presentation

10Architecture

©2012 Cloudera, Inc.

Twitter

HDFSFlume Hive

CustomFlumeSource Sink to

HDFSJSON SerDeParses Data

Oozie

AddPartitions

Hourly

Page 11: Hadoop presentation

11What is Hive?

Created at Facebook HiveQL

SQL like interface Hive interpreter converts HiveQL to

MapReduce code Returns results to the client

©2012 Cloudera, Inc.

Page 12: Hadoop presentation

Our Project (AIM)

Sentiment Analysis Using Twitter Data Word cloud formation.

Page 13: Hadoop presentation

Working

The problem is to collect all tweets that contain the hashtag #pokemongo and infer meaning from the data. Stages:-1. We collected all tweets that contain the hashtag #pokemongo and infer

meaning from the data. It was done by the Application Programming Interface (API) provided by Twitter.

2. We ran the script for about 20 hours starting from 00:00AM to 20:00PM.3. Collected 1.8GB of data.4. Extracted the data by timestamp.py and stored in .csv file. 5. Then using R word cloud was formed and graphs were made.

Page 14: Hadoop presentation

Conclusion

We proposed an improved dynamic word clouds for visualizing time-vary text data.

The applied circular layout and animation methods preserve context while the focus is changing.

A system which is able to visualize information in Twitter not only provides users an overview of the contents from huge time-varying text data, but also assist users in identifying the changing of content from time to time.

Page 15: Hadoop presentation

Thank You!