Thesis Presentation V4

15
1 Cork Institute of Technology - Candidate for Master of Science Degree 1 Using Big Data Analytics in a Social Domain Master’s in Cloud Computing 2013/2014 Ahmed Abdel-Aziz May 2015 EMCCAe, CISSP, PMP

Transcript of Thesis Presentation V4

Page 1: Thesis Presentation V4

1 Cork Institute of Technology - Candidate for Master of Science Degree 1

Using Big Data Analytics in a Social Domain

Master’s in Cloud Computing 2013/2014

Ahmed Abdel-Aziz May 2015

EMCCAe, CISSP, PMP

Page 2: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 2

Objective

1) Social Media, Analytics and the Marketing Campaign 2) Sentiment Analysis – Methodology & Techniques 3) The Need for Case-Study & an Analytics Prototype 4) Learning Outcomes & Future Work

Page 3: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 3

Social Media & Social Analytics •  Social media sites offspring of Web 2.0 Movement – Based

on cloud computing model (Software-as-a-Service) •  Percentage of companies using social media for marketing is

88%

Section 1 of 4

Page 4: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 4

Marketing Campaign Lifecycle Section 1 of 4

•  Consists of 5 phases •  Social analytics answers social questions for each phase Ex: What is sentiment trend?

Page 5: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 5

•  Social analytic projects based on sentiment analysis benefit from a well thought out methodology

Section 2 of 4 Sentiment Analysis Methodology &

Techniques

Page 6: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 6

Section 2 of 4 Sentiment Analysis Methodology &

Techniques •  Social sentiment analysis starts with social listening

–  Social listening can be performed using a variety of open source tools such as PostgreSQL, R, Wordle, and Circos, as well as tools such as Attensity 360 and Analyze.

•  Social data comes from 3 main categories of sources –  Social user’s account – analytic capability limited by social

media provider (FB, Twitter, LinkedIn) –  Social APIs – social media provider offers API to tap into

social data. Allows development of unique analytic programs –  3rd party tools – provides very fast results but does not

offer same level of analytic capability of a custom program

Page 7: Thesis Presentation V4

SANS Technology Institute - Candidate for Master of Science Degree 7

Section 2 of 4 Sentiment Analysis Methodology

& Techniques •  Sentiment analysis techniques grouped into two main

categories: •  Supervised machine learning method •  Unsupervised method

•  Supervised learning method learns features/words that correlate with +ve/-ve sentiment. Can identify new text sentiment

•  Unsupervised methods a lexicon is used with words pre-

scored for polarity values. Sum of scores gives sentiment

•  Both techniques widely used and offer comparable results Cork Institute of Technology - Candidate for Master of Science Degree 7

Page 8: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 8

•  Company launched new product to market – Marketing campaign already launched long ago and in Account Performance Phase

•  Marketing team needs to measure upticks in sentiment

trend regarding new product to take appropriate actions

•  Data science team believes continuous user surveys are ineffective and a computational approach is necessary -> Better results and much less intrusive

Need for Case-Study/Analytics Prototype Section 3 of 4

Page 9: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 9

Need for Case-Study/Analytics Prototype

•  Decision made to build a prototype for tool to measure sentiment trend on Twitter specifically as start

•  Twitter found to be the social network of choice regarding brand and product sentiment topics à Thus Twitter

•  Data science team key objectives: –  Produce useful results quickly and cost efficiently

Cloud Computing Value Proposition!? –  Get buy-in from marketing management to build full App

Section 3 of 4

Page 10: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 10

•  Applying the Analytic Project Lifecycle to the Prototype

–  Data Preparation Input Data: raw tweets Output Data: clean tweet text ready for sentiment analysis

–  Analytic Model Planning & Building Input Data: clean tweet text and learnt Naïve Bayesian model Output Data: sentiment of analyzed tweets

–  Communicate Results

Input Data: sentiment of analyzed tweets and tweets Output Data: sentiment trend graph for both +ve and –ve sentiments

Section 3 of 4

Need for Case-Study/Analytics Prototype

Page 11: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 11

Need for Case-Study/Analytics Prototype Section 3 of 4

•  Technology decisions made by data science team –  R programming language for social listening –  Twitter Social APIs for source of social data –  Leverage ready-made R packages to accelerate building time –  R programming for data preparation –  Leverage analytics cloud services offered such as Datumbox –

supervised machine learning method using Naïve Bayesian –  R programming to build main body for prototype analytics

application –  Making use of R plotting capabilities to present easy to understand

results for non-technical members of Marketing team –  Settling on the technologies to use to build the full blown

application dealing with much larger data sets – GPText/Pivotal HD

Page 12: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 12

Need for Case-Study/Analytics Prototype Section 3 of 4

•  Snippet R code for analytics application – Main Loop •  > possentiments = 0 •  > negsentiments = 0 •  > for (i in 1:"5") •  > { •  > tweets = searchTwitter("iPhone", n=5, lang="en”)t •  > tweet_txt = sapply(tweets, function(x) x$getText()) •  > tweet_clean = clean.text(tweet_txt) •  > tweet_num = length(tweet_clean) •  …….. •  > for (i in 1:tweet_num) •  > { •  > tmp = getSentiment(tweet_clean[i], "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa") •  > tweet_df$sentiment[i] = tmp$sentiment •  ……….. •  > } •  > possentiments <- c(possentiments, sum(tweet_df$sentiment=="positive")) •  > negsentiments <- c(negsentiments, sum(tweet_df$sentiment=="negative")) •  > Sys.sleep(5)

Page 13: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 13

Need for Case-Study/Analytics Prototype

•  Plotting the trend of both positive and negative sentiments

Section 3 of 4

Page 14: Thesis Presentation V4

SANS Technology Institute - Candidate for Master of Science Degree 14

Learning Outcomes

•  Initial State –  Good foundation in cloud computing and data analytics –  Very little knowledge in social domain – Not even FB account J –  Last coding experience was Java 13 years back

•  Initial research project stages –  Social media university –  Addictive analytics workshop -> Introduction to Marketing domain –  Pivotal workshop to learn data analytics in social domain ->

Relevant Pivotal Data Analytics Platforms: GPText and Pivotal HD

•  Later research project stages – practical –  Learning enough about R to build small scale analytics application –  How to leverage Datumbox analytics-as-a-service offering

Section 4 of 4

Cork Institute of Technology - Candidate for Master of Science Degree 14

Page 15: Thesis Presentation V4

Cork Institute of Technology - Candidate for Master of Science Degree 15

Summary

•  Cloud, social, and Data Analytics synergy serve Marketing •  Is there an uptick in +ve/-ve sentiments of my product?

Is a question strategically important in the Account Performance phase of a Marketing Campaign

•  The research answered the question using a computational

approach based on a supervised learning method for sentiment analysis that is cloud based

•  Data source and data analytics in the cloud. Data preparation and data presentation on-premise using R. Future work: Optimize & Tune for Large Datasets -> Can be all Cloud