arXiv:1906.01815v1 [cs.CL] 5 Jun 2019

Towards Multimodal Sarcasm Detection(An Obviously Perfect Paper)

Santiago Castro†∗ , Devamanyu HazarikaΦ∗ , Verónica Pérez-Rosas†,Roger ZimmermannΦ, Rada Mihalcea†, Soujanya Poriaι

†Computer Science & Engineering, University of Michigan, USAΦSchool of Computing, National University of Singapore, SingaporeιInformation Systems Technology and Design, SUTD, Singapore

{sacastro,vrncapr,mihalcea}@umich.edu,{hazarika,rogerz}@comp.nus.edu.sg, [email protected]

AbstractSarcasm is often expressed through several ver-bal and non-verbal cues, e.g., a change of tone,overemphasis in a word, a drawn-out syllable,or a straight looking face. Most of the re-cent work in sarcasm detection has been car-ried out on textual data. In this paper, we ar-gue that incorporating multimodal cues can im-prove the automatic classification of sarcasm.As a first step towards enabling the develop-ment of multimodal approaches for sarcasmdetection, we propose a new sarcasm dataset,Multimodal Sarcasm Detection Dataset (MUS-tARD1), compiled from popular TV shows.MUStARD consists of audiovisual utterancesannotated with sarcasm labels. Each utteranceis accompanied by its context of historical ut-terances in the dialogue, which provides addi-tional information on the scenario where the ut-terance occurs. Our initial results show that theuse of multimodal information can reduce therelative error rate of sarcasm detection by up to12.9% in F-score when compared to the use ofindividual modalities. The full dataset is pub-licly available for use at https://github.com/soujanyaporia/MUStARD.

1 Introduction

Sarcasm plays an important role in daily conversa-tions by allowing individuals to express their intentto mock or display contempt. It is achieved by us-ing irony that reflects a negative connotation. Forexample, in the utterance: Maybe it’s a good thingwe came here. It’s like a lesson in what not todo, the sarcasm is explicit as the speaker expresseslearning of a lesson in a positive light when in real-ity, she means it in a negative way. However, thereare also scenarios where sarcasm lacks explicit lin-guistic markers, thus requiring additional cues that

∗Equal contribution.1MUStARD is an abbreviation for MUltimodal SARcasm

Dataset. Similar to how “mustard” adds spice to our food andmeals, we believe sarcasm adds spice to our interactions andlives.

can reveal the speaker’s intentions. For instance,sarcasm can be expressed using a combination ofverbal and non-verbal cues, such as a change oftone, overemphasis in a word, a drawn-out sylla-ble, or a straight looking face. Moreover, sarcasmdetection involves finding linguistic or contextualincongruity, which in turn requires further informa-tion, either from multiple modalities (Schifanellaet al., 2016; Mishra et al., 2016a) or from the con-text history in a dialogue.

This paper explores the role of multimodalityand conversational context in sarcasm detectionand introduces a new resource to further enableresearch in this area. More specifically, our papermakes the following contributions: (1) We curate anew dataset, MUStARD, for multimodal sarcasmresearch with high-quality annotations, includingboth mutlimodal and conversational context fea-tures; (2) We exemplify various scenarios whereincongruity in sarcasm is evident across differentmodalities, thus stressing the role of multimodalapproaches to solve this problem; (3) We introduceseveral baselines and show that multimodal modelsare significantly more effective when compared totheir unimodal variants; and (4) We also providepreceding turns in the dialogue which act as contextinformation. Consequently, we surmise that thisproperty of MUStARD leads to a new sub-task forfuture work: sarcasm detection in conversationalcontext.

The rest of the paper is organized as follows. Sec-tion 2 summarizes previous work on sarcasm detec-tion using both unimodal and multimodal sources.Section 3 describes the dataset collection, the anno-tation process, and the types of sarcastic situationscovered by our dataset. Section 4 explains how weextract features for the different modalities. Sec-tion 5 shows the experimental work around thenew dataset while Section 6 analyzes it. Finally,Section 7 offers conclusions and discusses open

arX

iv:1

906.

0181

5v1

[cs

.CL

] 5

Jun

201

9

https://github.com/soujanyaporia/MUStARD

https://github.com/soujanyaporia/MUStARD

x

Context Video Frames

Time

Joey: Did you call the cops? Rachel: No, we took her to lunch.

Target Utterance Frames

Chandler: Ah! Your own brand of vigilante justice.

Sarcastic Utterance

Audi

ovis

ual

Text

...

Figure 1: Sample sarcastic utterance in the dataset along with its context and transcript.

problems related to this resource.

2 Related Work

Automated sarcasm detection has gained increasedinterest in recent years. It is a widely studied lin-guistic device whose significance is seen in senti-ment analysis and human-machine interaction re-search. Various research projects have approachedthis problem through different modalities, such astext, speech, and visual data streams.

Sarcasm in Text: Traditional approaches for de-tecting sarcasm in text have considered rule-basedtechniques (Veale and Hao, 2010), lexical and prag-matic features (Carvalho et al., 2009), stylisticfeatures (Davidov et al., 2010), situational dispar-ity (Riloff et al., 2013), incongruity (Joshi et al.,2015), or user-provided annotations such as hash-tags (Liebrecht et al., 2013).

Resources in this domain are collected usingTwitter as a primary data source and are anno-tated using two main strategies: manual annota-tion (Riloff et al., 2013; Joshi et al., 2016a) and dis-tant supervision through hashtags (Davidov et al.,2010; Abercrombie and Hovy, 2016). Other re-search leverages context to acquire shared knowl-edge between the speaker and the audience (Wal-lace et al., 2014; Bamman and Smith, 2015). Avariety of contextual features have been explored,including speaker’s background and behavior inonline platforms (Rajadesingan et al., 2015), em-beddings of expressed sentiment and speaker’spersonality traits (Poria et al., 2016), learning ofuser-specific representations (Wallace et al., 2016;Kolchinski and Potts, 2018), user-community fea-tures (Wallace et al., 2015), as well as stylistic anddiscourse features (Hazarika et al., 2018). In ourdataset, we capitalize on the conversational formatand provide context by including preceding utter-ances along with speaker identities. To the best ofour knowledge, there is no prior work which deals

with the task of sarcasm detection in conversation.

Sarcasm in Speech: Sarcasm detection inspeech has mainly focused on the identification ofprosodic cues in the form of acoustic patterns thatare related to sarcastic behavior. Studied featuresinclude mean amplitude, amplitude range, speechrate, harmonics-to-noise ratio, and others (Cheangand Pell, 2008). Rockwell (2000) presented oneof the initial approaches to this problem that stud-ied the vocal tonalities of sarcastic speech. Theyfound slower speaking rates and greater intensityas probable markers for sarcasm. Later, Teppermanet al. (2006) studied prosodic and spectral featuresof sound — both in and out of context — to deter-mine sarcasm. In general, prosodic features suchas intonation and stress are considered importantindicators of sarcasm (Bryant, 2010; Woodland andVoyer, 2011). We take motivation from this previ-ous research and include similar speech parametersas features in our dataset and baseline experiments.

Multimodal Sarcasm: Contextual informationfor sarcasm in text can be included from othermodalities. These modalities help in providingadditional cues in the form of both common orcontrasting patterns. Prior work mainly consid-ers multimodal learning for the readers’ ability toperceive sarcasm. Such research couples textualfeatures with cognitive features such as the gaze-behavior of readers (Mishra et al., 2016a,b, 2017)or electro/magneto-encephalographic (EEG/MEG)signals (Filik et al., 2014; Thompson et al., 2016).In contrast, there is limited work exploring multi-modal avenues to understand sarcasm conveyed bythe opinion holder. Attardo et al. (2003) presentedone of the preliminary explorations on this topicwhere different phonological and visual markersfor sarcasm were studied. However, this work didnot analyze the interplay of the modalities. Morerecently, Schifanella et al. (2016) presented a mul-timodal approach for this task by considering vi-

sual content accompanying text in online sarcas-tic posts. They extracted semantic visual featuresfrom images using pre-trained networks and fusedthem with textual features. In our work, we extendthese notions and propose to analyze video-basedsarcasm in dialogues. To the best of our knowl-edge, ours is the first work to propose a resourceon video-level sarcasm. Joshi et al. (2016b) pro-posed a dataset similar to us, i.e., based on the TVshow Friends. However, their corpus only includesthe textual modality and is thus not multimodal innature. Furthermore, we also analyze multiple chal-lenges in sarcasm that call for multimodal learningand provide an evaluation setup for future works totest upon.

3 Dataset

To enable the exploration of multimodal sarcasmdetection, we introduce a new dataset (MUStARD)consisting of short videos manually annotated fortheir sarcasm property.

3.1 Data Collection

To collect potentially sarcastic examples, we con-duct web searches on differences sources, mainlyYouTube. We use keywords such as Friends sar-casm, Chandler sarcasm, Sarcasm 101, Sarcasmin TV shows. Using this strategy, we obtain videosfrom three main TV shows: Friends, The GoldenGirls, and Sarcasmaholics Anonymous. Note thatduring this initial search, we focus exclusively onsarcastic content. To obtain non-sarcastic videos,we select a subset of 400 videos from MELD, amultimodal emotion recognition dataset derivedfrom the Friends TV series, originally collectedby Poria et al. (2018). In addition, we collect videosfrom The Big Bang Theory, a TV show whose char-acters are often perceived as sarcastic. We obtainvideos from seasons 1–8 and segment episodesusing laughter cues from its audience. Specifi-cally, we use open-source software for laughterdetection (Ryokai et al., 2018) to obtain initial seg-mentation boundaries and fine-tune them using thesubtitles’ timestamps.

The collected set consists of 6,421 videos. Notethat although some of the videos in our initial poolinclude information about their sarcastic nature, themajority of our videos are not labeled. Thus, weconduct a manual annotation as described next.

“Canwemaybeputthephonesdownandhaveanactualhumanconversa6on?”

Figure 2: Graphical user interface used by the annotators tolabel the videos in our dataset.

3.2 Annotation Process

We built a web-based annotation interface thatshows each video along with its transcript and re-quests annotations for sarcasm. We also ask theannotators to flag misaligned videos, i.e., caseswhere the audio or video is not properly synchro-nized. The interface allows the annotators to watcha context video consisting of the previous video ut-terances, whenever they deem it necessary. Giventhe large number of videos to be annotated, werequest annotations in batches of four videos at atime. Our web interface is shown in Fig. 2.

We conduct the annotation in two steps. First,we annotate the videos from The Big Bang Theory,as it contains the largest set of videos. Second,we annotate the remaining videos, belonging tothe other sources. The annotation is conducted bytwo graduate students who have first been providedwith easy examples of explicit sarcastic content, toillustrate sarcasm in videos. Each annotator labeledthe full set of videos independently.

For the first step, after annotating the first part –consisting of 5,884 utterances from The Big BangTheory – we noticed that the majority of them werelabeled as non-sarcastic (98% were considered asnon-sarcastic by both). In addition, our initialinter-annotation agreement was low (Kappa scoreis 0.1463). We thus decided to stop the annotationprocess and reconcile the annotation differences be-fore proceeding further. The annotators discussedtheir disagreements for a subset of 20 videos, andthen re-annotated the videos. This time, we ob-tained an improved inter-annotator agreement of

0.2326. The annotation disagreements were rec-onciled by a third annotator by identifying the dis-agreement cases, watching the videos again anddeciding which is the correct label for each one.

Next, we annotate the second part, consistingof 624 videos drawn from Friends, The GoldenGirls, and Sarcasmaholics Anonymous. As before,the two annotators label each video independently.The inter-annotator agreement was calculated witha Kappa score of 0.5877. Again, the differenceswere reconciled by a third annotator.

The resulting set of annotations consists of 345videos labeled as sarcastic and 6,020 videos labeledas non-sarcastic for a total of 6,365 videos.

3.3 Transcriptions

Since we collect videos from several sources, someof them had subtitles or transcripts readily avail-able. This is particularly the case for videos fromBig Bang Theory and MELD. We use the MELDtranscriptions directly. For Big Bang Theory, weextracted the transcript by applying manual sub-string matching on the episode subtitles. The re-maining videos are manually transcribed.

3.4 Sarcasm Dataset: MUStARD

To enable our experiments, which focus explicitlyon the multimodal aspects of sarcasm, we decidedto work with a balanced sample of sarcastic andnon-sarcastic videos. We thus obtain a balancedsample from the set of 6,365 annotated videos. Westart by selecting all videos marked as sarcasticfrom the full set, and then we randomly obtain anequally sized non-sarcastic sample from the non-sarcastic subset by prioritizing the ones annotatedby a larger number of annotators. Our dataset thuscomprises 690 videos with an even number of sar-castic and non-sarcastic labels. Source, character,and label-ratio statistics are shown in Figs. 3 and 4.

In the remainder of this paper, we use the term ut-terance while referring to the videos in our dataset.We extend the definition of an utterance2 to in-clude consecutive multi-sentence dialogues of thesame speaker to prioritize completeness of infor-mation. As a result, 61.3% of the utterances fromthe dataset are single sentences, while the remain-ing utterances consist of two or more sentences.Each utterance in our dataset is coupled with itscontext utterances, which are preceding turns bythe speakers participating in the dialogue. Some

2An utterance is usually defined as a unit of speechbounded by breaths or pauses.

Statistics Utterance Context

Unique words 1991 3205Avg. utterance length (tokens) 14 10Max. utterance length (tokens) 73 71Avg. duration (seconds) 5.22 13.95

Table 1: Dataset statistics by utterance and context.

of the context videos contain multi-party dialoguebetween speakers participating in the scene. Thenumber of turns in the context is manually set to in-clude a coherent background of the target utterance.Table 1 shows general statistics for the utterancesin our dataset.

Each utterance and its context consists of threemodalities: video, audio, and transcription (text).Also, all the utterances are accompanied by theirspeaker identifiers. Fig. 1 illustrates a sarcasticutterance along with its associated context in thedataset. Fig. 4b provides the list of major characterspresent in the dataset. Fig. 4a details the distribu-tion of labels per character. Some of the characters,such as Chandler and Sheldon, occupy major por-tions of the dataset. This is expected since theyplay comic roles in the shows. To avoid speakerbias of such popular characters, we also includenon-sarcastic samples for these characters. In con-trast, the dataset intentionally includes minor rolessuch as Dorothy from The Golden Girls, who is en-tirely sarcastic throughout the corpus. This allowsthe study of speaker bias for sarcasm detection.

3.5 Qualitative AspectsSarcasm detection in text often requires additionalinformation that can be leveraged from associatedmodalities. Below, we analyze some cases that re-quire multimodal reasoning. We exemplify usinginstances from our proposed dataset to further sup-port our claim of sarcasm being often expressed ina multimodal way.

Role of Multimodality: Fig. 5 presents twocases where sarcasm is expressed through the in-

Friends52%

The Golden Girls6%

Big Bang Theory41%

Sarcasmaholics2%

(a) Source across the dataset.

Utte

ranc

es

0

100

200

300

400

Sarcastic Non-Sarcastic

Fr TBBT TGG SA

Friends: 204

Golden Girls: 1

Big Bang Theory: 140

Friends: 152

Golden Girls: 39

Sarcasmaholics: 14

Big Bang Theory: 140

(b) Labels across source shows.

Figure 3: Ratio of the TV shows composing the dataset.

0

22.5

45

67.5

90

Sheldon

Howard

Penny

Leon

ard RajAmy

Bernad

etteOthe

rs

Utte

ranc

es

0

40

80

120

160

Chand

lerRos

s

Monica Jo

ey

Rache

l

Phoeb

eOthe

rs


0

20

40

Dorothy Ros

e

Sophia

Blanch

e

0

3.5

7

Scott

SA_man

SA_wom

an

SA_mod

erator

The Big Bang Theory Friends

The Golden Girls

Sarcasmaholics Anonymous

(a) Character-label ratio per source.

Amy2%

Person7%

SA_moderator1%

Chandler23%

SA_woman1%

Howard7%

Rose0%

Penny5% Dorothy

6%

Phoebe5%

Rachel 5%

Joey5%

Monica4%

Ross6%

Bernadette2%

Raj4%

Leonard5%

SA_man0%

Sheldon13%

(b) Overall character distribution.

Figure 4: Speaker statistics

Chandler : Oh my god! You almost gave me a heart attack!

Utterance

• Text : suggests fear or anger.• Audio : animated tone• Video : smirk, no sign of anxiety

1)

Sheldon : Its just a privilege to watch your mind at work.

• Text : suggests a compliment.• Audio : neutral tone. • Video : straight face.

2)

Figure 5: Incongruent modalities in sarcasm.

congruity between modalities. In the first case, thelanguage modality indicates fear or anger, whereasthe facial modality lacks any visible sign of anxietythat would corroborate the textual modality. In thesecond case, the text is indicative of a compliment,but the vocal tonality and facial expressions showindifference. In both cases, there exists incongruitybetween modalities, which acts as a strong indica-tor of sarcasm.

Multimodal information is also important in pro-viding additional cues for sarcasm. For example,the vocal tonality of the speaker often indicates sar-casm. Text that otherwise looks seemingly straight-forward is noticed to contain sarcasm only whenthe associated voices are heard. Sarcastic tonalitiescan range from self-deprecatory or broody tone tosomething obnoxious and raging. Such extremitiesare often seen while expressing sarcasm. Anothermarker of sarcasm is the undue stress on particularwords. For instance, in the phrase You did “really”well, if the speaker stresses the word really, thenthe sarcasm is evident. Fig. 6 provides sarcasticcases from the dataset where such vocal stressesexist.

It is important to note that sarcasm does not nec-

• Text and Video: positive indication.• Audio : stressed word

Utterances

Chandler : Yes and we are very excited about it.

1)

SA_man: You got off to a really good start with the group.

2)

Remarks

Figure 6: Vocal stress in sarcasm.

essarily imply conflicting modalities. Rather, theavailability of complementary information throughmultiple modalities improves the capacity of mod-els to learn discriminative patterns responsible forthis complex process.

Role of Context: In Fig. 7, we present two in-stances from the dataset where the role of con-versational context is essential in determining thesarcastic nature of an utterance. In the first case,the sarcastic reference of the sun is apparent onlywhen the topic of discussion is known, i.e., tan-ning. In the second case, the reference made bythe speaker regarding a venus flytrap can be rec-ognized as sarcastic only when it is known to bereferred as a thing to go on a date with. Theseexamples demonstrate the importance of havingcontextual information. The availability of contextin our proposed dataset provides models with theability to utilize additional information while rea-soning about sarcasm. Enhanced techniques wouldrequire commonsense reasoning to understand il-logical statements (such as going on a date witha venus flytrap), which indicate the presence ofsarcasm.

Chandler :

Was that place the sun?

Oh my god! You almost gave me a heart attack!Utterance:

Utterance ContextChandler : Hold on, there is something

different.

Ross : I went to their tanning place your wife suggested.

Dorothy :

No Blanche, with a venus fly trap

Dorothy : Morning everybody, Rose honey I hope you don't mind, I borrowed your golf club, I have a date to play this morning.

Blanche : With a man?

1)

2)

Figure 7: Context importance in sarcasm detection.

4 Multimodal Feature Extraction

We obtain several learning features from the threemodalities included in our dataset. The process fol-lowed to extract each of them is described below:

Text Features: We represent the textual utter-ances in the dataset using BERT (Devlin et al.,2018), which provides a sentence representationut ∈ Rdt for every utterance u. In particular, weaverage the last four transformer layers of the firsttoken ([CLS]) in the utterance – using the BERT-Base model – to get a unique utterance representa-tion of size dt = 768. We also considered averag-ing Common Crawl pre-trained 300 dimensionalGloVe word vectors (Pennington et al., 2014) foreach token; however, it resulted in lower perfor-mance as compared to BERT-based features.

Speech Features: To leverage information fromthe audio modality, we obtain low-level featuresfrom the audio data stream for each utterance in thedataset. Through these features, we intend to pro-vide information related to pitch, intonation, andother tonal-specific details of the speaker (Tepper-man et al., 2006). We utilize the popular speech-processing library Librosa (McFee et al., 2018)and perform the processing pipeline described next.First, we load the audio sample for an utterance as atime series signal with a sampling rate of 22050 Hz.Then we remove background noise from the signalby applying a heuristic vocal-extraction method.3

Finally, we segment the audio signal into dw non-overlapping windows to extract local features thatinclude MFCC, melspectogram, spectral centroidand their associated temporal derivatives (delta).Segmentation is done to achieve a fixed lengthrepresentation of the audio sources which are oth-erwise variable in length across the dataset. Allthe extracted features are concatenated together to

3http://librosa.github.io/librosa/

auto_examples/plot_vocal_separation.html#sphx-glr-auto-examples-plot-vocal-separation-py

Time segments

0

128

512

2048

8192

Full spectrum

806040200

0:00 0:10 0:20 0:30 0:40 0:50 1:00 1:10

Time

Foreground

It’s just a privilege to watch your mind at work

Utterance:

1) In

put a

udio

2) V

ocal

Ext

ract

ion

3) A

udio

Fea

ture

s

MFCCMFCC delta

Mel SpectogramMel Spectogram delta

Spectral Centroid

Hz

Hz

0

128

512

2048

8192

Figure 8: Feature extraction for audio modality.

compose a da = 283 dimensional joint represen-tation {uai }dwi=1 for each window. The final audiorepresentation of each utterance is obtained by cal-culating the mean across the window segments, i.e.ua = 1

dw(∑i uai ) ∈ Rda .

Video Features: We extract visual features foreach of the f frames in the utterance video us-ing a pool5 layer of an ImageNet (Deng et al.,2009) pretrained ResNet-152 (He et al., 2016) im-age classification model. We first preprocess everyframe by resizing, center-cropping and normaliz-ing it. To obtain a visual representation of eachutterance, we compute the mean of the obtaineddv = 2048 dimensional feature vector uv

i for everyframe: uv = 1

f(∑i uv

i ) ∈ Rdv . While we could usemore advanced visual encoding techniques (e.g.,recurrent neural network encoding techniques), wedecide to use the same averaging strategy as withthe other modalities.

5 Experiments

To explore the role of multimodality in sarcasm de-tection, we conduct multiple experiments evaluat-ing each modality separately and also combinationsof modalities provided in the dataset. Additionally,we investigate the role of context and speaker in-formation for improving predictions.

5.1 Experimental Setup

We perform two main sets of evaluations. The firstset involves conducting five-fold cross-validationexperiments where the folds are randomly createdin a stratified manner. This is done to ensure label

http://librosa.github.io/librosa/auto_examples/plot_vocal_separation.html#sphx-glr-auto-examples-plot-vocal-separation-py



balance across folds. In each of the K iterations,the kth fold acts as a testing set while the remainingare used for training. Validation folds can be ob-tained from a part of the training folds. As the foldsare created in a randomized manner, there is over-lap between speakers across training and testingsets, thus resulting in a speaker-dependent setup.The second set of evaluations restrict the inclusionof utterances from the same speaker to be either inthe training or testing sets. Utterances from TheBig Bang Theory, The Golden Girls and Sarcas-maholics Anonymous are made part of the trainingset while Friends is used as a testing set.4 We callthis the speaker-independent setup. Motivation forsuch a setup is discussed in Section 6.

During our experiments, we use precision, re-call, and F-score as the main evaluation metrics,weighted across both sarcastic and non-sarcasticclasses. The weights are obtained based on theclass ratios. For speaker-dependent scenraio, we re-port results by averaging across the five-fold cross-validation results.

5.2 BaselinesThe experiments are conducted using three mainbaseline methods:

Majority: This baseline assigns all the instancesto the majority class, i.e., non-sarcastic.

Random: This baseline makes random/chancepredictions sampled uniformly across the test set.

SVM: We use Support Vector Machines (SVM)as the primary baseline for our experiments. SVMsare strong predictors for small-sized datasets andat times outperform neural counterparts (Byvatovet al., 2003). We use the SVM classifiers fromscikit-learn (Pedregosa et al., 2011) with an RBFkernel and a scaled gamma. The penalty term Cis kept as a hyper-parameter which we tune basedon each experiment (we choose between 1, 10, 30,500, and 1000). For the speaker dependent setupwe scale the features by subtracting the mean anddividing them by the standard deviation. Multiplemodalities are combined using early fusion, wherethe features drawn from the different modalities areconcatenated together.

6 Multimodal Sarcasm Classification

Table 2 presents the classification results for sar-casm prediction in the speaker-dependent setup.

4Split details are released along with the dataset for con-sistent comparison by future works.

Algorithm Modality Precision Recall F-Score

Majority - 25.0 50.0 33.3Random - 49.5 49.5 49.8

SVM

T 65.1 64.6 64.6A 65.9 64.6 64.6V 68.1 67.4 67.4

T+A 66.6 66.2 66.2T+V 72.0 71.6 71.6A+V 66.2 65.7 65.7

T+A+V 71.9 71.4 71.5

∆multi−unimodal ↑ 3.9% ↑ 4.2% ↑ 4.2%Error rate reduction ↑ 12.2% ↑ 12.9% ↑ 12.9%

Table 2: Speaker-dependent setup. All results are averagedacross five folds where each fold present weighted F-scoreacross both sarcastic and non-sarcastic classes.

Algorithm Modality Precision Recall F-Score

Majority - 32.8 57.3 41.7Random - 51.1 50.2 50.4

SVM

T 60.9 59.6 59.8A 65.1 62.6 62.7V 54.9 53.4 53.6

T+A 64.7 62.9 63.1T+V 62.2 61.5 61.7A+V 64.1 61.8 61.9

T+A+V 64.3 62.6 62.8

∆multi−unimodal ↓ 0.4% ↑ 0.3% ↑ 0.4%Error rate reduction ↓ 1.1% ↑ 0.8% ↑ 1.1%

Table 3: Multimodal sarcasm classification. Evaluated us-ing an speaker-independent setup. Note: T=text, A=audio,V=video.

The lowest performance is obtained with the Ma-jority baseline which achieves 33.3% weighted F-score (66.7% F-score for non-sarcastic class and0% for sarcastic). The pre-trained features forthe visual modality provide the best performanceamong the unimodal variants. The addition of tex-tual features through concatenation improves theunimodal baseline and achieves the best perfor-mance. The tri-modal variant is unable to achievethe best score due to a slightly sub-optimal per-formance from the audio modality. Overall, thecombination of visual and textual signals signifi-cantly improves over the unimodal variants, with arelative error rate reduction of up to 12.9%.

We manually investigate the utterances wherethe bimodal textual and visual model predicts sar-casm correctly while the unimodal textual modelfails. In most of these samples, the textualcomponent does not reveal any explicit sarcasm(see Fig. 9). As a result, the utterances require ad-ditional cues, which it avails from the multimodalsignals.

The speaker-independent setup is more challeng-ing as compared to the speaker-dependent scenario,as it prevents the model from registering speaker-specific patterns. The presence of new speakers in

Speaker Utterance

Sheldon Darn. If you weren't busy, I'd ask you to join us.

Chandler I’m sorry, we don’t have your sheep.

Chandler I am sorry, it was a one time thing. I was very drunk and it was someone else's subconscious.

Figure 9: Sample sarcastic utterances correctly predicted byT+V but not only T model in the speaker-dependent setup.The utterances either do not have explicit sarcastic markers orneed commonsense-reasoning to detect ironies, such as beingdrunk in someone else’s subconscious.

the testing set requires a higher degree of general-ization from the model. Our setup also segregatesat the source level, thus the testing involves an en-tirely new environment concerning all the modali-ties. We believe that the speaker-independent setupis a strong test-bed for multimodal sarcasm re-search. The increased difficulty of this task is alsonoticed in the model training, which now requiresa smaller error margin (or higher C value) of theSVM’s decision function to provide good test per-formance.

Table 3 presents the performance of our base-lines in the speaker-independent setup. In this case,the multimodal variants do not greatly outperformthe unimodal counterparts. Unlike Table 2, the au-dio channel plays a more important role, and it isslightly improved by adding text. By inspecting thecorrectly predicted sarcastic examples by text plusaudio but not by text, we observe a tendency ofhigher mean pitch (mean fundamental frequency)with respect to those incorrectly predicted, as At-tardo et al. (2003) suggested. Failure cases seemto contain particular patterns of high pitch, alsostudied by Attardo et al. (2003), but in average theyseem to have normal pitch. In this sense, futurework can focus on analyzing the temporal localitiesof the audio channel.

In this setup, video features do not seem to workwell. We hypothesize that, because the visual fea-tures are about object features (not specific to sar-casm) and the model is shallow, these features maymake the model capture character biases whichmake them unsuitable for the speaker-independentsetup. This is also suggested by the statistics inFig. 10 which we describe in the next section. Bylooking at the incorrect predictions by the bestmodel, we infer that models should better capturethe mismatches between the main speaker facialexpressions and the emotions of what is being said.

Setup Features Precision Recall F-Score

SpeakerDependent

T 65.1 64.6 64.6+ context 65.5 65.1 65.0+ speaker 67.7 67.2 67.3Best (T + V) 72.0 71.6 71.8+ context 71.9 71.4 71.5+ speaker 72.1 71.7 71.8

SpeakerIndependent

T 60.9 59.6 59.8+ context 57.9 54.5 54.1+ speaker 60.7 60.7 60.7Best (T + A) 64.7 62.9 63.1+ context 65.2 62.9 63.0+ speaker 64.7 62.9 63.1

Table 4: Role of context and utterance’s speaker. Note:T=text, A=audio, V=video.

6.1 The Role of Context and SpeakerInformation

We investigate whether additional information,such as an utterance’s context (i.e., the precedingutterances, cf. Section 3.5) and the speaker iden-tification, are helpful for the predictions. Contextfeatures are generated by averaging the represen-tations of the utterances (as per Section 4) presentin the context. For the speakers, we use a one-hotencoding vector with size equal to the total uniquespeakers in a training fold.

Table 4 shows the results for both evaluationsettings for the textual baseline and the best multi-modal variant. For the context features, we seea slight improvement in the best variant of thespeaker independent setup (text plus audio); how-ever, in other models, there is no improvement. Apossible reason could be the loss of temporal infor-mation when pooling across the conversation.

For the speaker features, we see an improve-ment in the speaker-dependent setup for the textualmodality. Due to the speaker overlap across splits,the model can leverage speaker regularities for sar-castic tendencies. However, we do not observe thesame trend for the best multimodal variant (text +video) where the score barely improves. To under-stand this result, we visualize the correct predic-tions made by this model. The results, as seen inFig. 10, show a correlation between the class dis-tributions among the overall ground truth and thecorrectly predicted instances per speaker. As thismodel does not use speaker information, this corre-lation indicates that the multimodal variant is ableto learn speaker-specific information transitivelythrough the input features, rendering additionalspeaker input redundant. Lastly, in the speaker in-

Utte

ranc

es

0

40

80

120

160

Chand

ler

Sheldon

Others

Howard

Dorothy Ros

sJo

ey0

40

80

120

160

Chand

ler

Sheldon

Others

Howard

Dorothy Ros

sJo

ey


Original distribution Correct prediction

Figure 10: Correlation in speaker-specific sarcastic tenden-cies of the top-7 speakers. Predictions are obtained fromthe best performing model from Table 2. Speaker identifierfeatures are not used.

dependent setup, the speaker information does notlead to improvement. This is also expected as thereis no speaker overlap between the splits.

7 Conclusion and Future Work

In this paper, we provided a systematic introduc-tion to multimodal learning for sarcasm detection.To enable research on this topic, we introduceda novel dataset, MUStARD, consisting of sarcas-tic and non-sarcastic videos drawn from differentsources. By showing multiple examples from ourcurated dataset, we demonstrate the need for mul-timodal learning for sarcasm detection. Conse-quently, we developed models that leverage threedifferent modalities, including text, speech, andvisual signals. We also experimented with the in-tegration of context and speaker information asadditional input for our models.

The results of the baseline experiments sup-ported the hypothesis that multimodality is impor-tant for sarcasm detection. In multiple evaluations,the multimodal variants were shown to significantlyoutperform their unimodal counterparts, with rela-tive error rate reductions of up to 12.9%.

Moreover, while conducting this research, weidentified several challenges that we believe areimportant to address in future research work onmultimodal sarcasm detection.

Multimodal fusion: So far, we have only ex-plored early fusion for multimodal classification.Future work could investigate advanced spatiotem-poral fusion strategies (e.g., Tensor-Fusion (Zadehet al., 2017), CCA (Hotelling, 1936)) to better en-code the correspondence between modalities. An-other direction could be to create fusion strategiesthat can better model incongruity among modalitiesto identify sarcasm.

Multiparty conversation: The dialogues repre-sented in our dataset are often multi-party conver-sations. Advanced techniques to learn multimodal

relationships could incorporate better relationshipmodeling (Majumder et al., 2018), and exploit mod-els that provide gesture, facial and pose informationabout the people in the scene (Cao et al., 2018).

Neural baselines: As we strove to create a high-quality dataset with rich annotations, we had totrade-off corpus size. Moreover, the occurrenceof sarcastic utterances itself is scanty. To focuson effects induced by multimodal experiments, wechose a balanced version of the dataset with a lim-ited size. This, however, arises the problem ofover-fitting in complex neural models. As a conse-quence, in our initial experiments, we noticed thatSVM classifiers perform better than their neuralcounterparts, such as CNNs. Future work shouldtry to overcome this issue with solutions involvingpre-training, transfer learning, domain adaption, orlow-parameter models.

Sarcasm detection in conversational context:Our proposed MUStARD is inherently a dialoguelevel dataset where we aim to classify the last utter-ance in the dialogue. In a dialogue, to classify anutterance at time t, the preceding utterances at time< t can be considered as its context. In this work,although we utilize conversational context, we ig-nore modeling various key conversation specificfactors such as interlocutors’ goals, intents, depen-dency, etc. (Poria et al., 2019). Considering thesefactors can improve context modeling necessary forsarcasm detection in conversational context. Fu-ture work should try to leverage these factors toimprove the baseline scores reported in this paper.

Main speaker localization: We currently ex-tract visual features ubiquitously for each frame.As gesture and facial expressions are important fea-tures for sarcasm analysis, we believe the capabilityfor models to identify the speakers in the multipartyvideos is likely to be beneficial for the task.

Finally, we believe the resource introduced inthis paper has the potential to enable novel researchin multimodal sarcasm detection.

Acknowledgements

We are grateful to Gautam Naik for his help incurating part of the dataset from online resources.This research was partially supported by the Sin-gapore MOE Academic Research Fund (grant#T1 251RES1820), by the Michigan Institute forData Science, by the National Science Founda-tion (grant #1815291), by the John TempletonFoundation (grant #61156), and by DARPA (grant#HR001117S0026-AIDA-FP-045).

References

Gavin Abercrombie and Dirk Hovy. 2016. Putting sar-casm detection into context: The effects of class im-balance and manual labelling on supervised machineclassification of twitter conversations. In Proceed-ings of the ACL 2016 Student Research Workshop,pages 107–113.

Salvatore Attardo, Jodi Eisterhold, Jennifer Hay, andIsabella Poggi. 2003. Multimodal markers of ironyand sarcasm. Humor, 16(2):243–260.

David Bamman and Noah A Smith. 2015. Contextual-ized sarcasm detection on twitter. ICWSM, 2:15.

Gregory A Bryant. 2010. Prosodic contrasts in ironicspeech. Discourse Processes, 47(7):545–566.

Evgeny Byvatov, Uli Fechner, Jens Sadowski, and Gis-bert Schneider. 2003. Comparison of support vec-tor machine and artificial neural network systems fordrug/nondrug classification. Journal of chemical in-formation and computer sciences, 43(6):1882–1889.

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei,and Yaser Sheikh. 2018. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.In arXiv preprint arXiv:1812.08008.

Paula Carvalho, Luís Sarmento, Mário J Silva, and Eu-génio De Oliveira. 2009. Clues for detecting irony inuser-generated contents: oh...!! it’s so easy;-. In Pro-ceedings of the 1st international CIKM workshopon Topic-sentiment analysis for mass opinion, pages53–56. ACM.

Henry S Cheang and Marc D Pell. 2008. The sound ofsarcasm. Speech communication, 50(5):366–381.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.Semi-supervised recognition of sarcastic sentencesin twitter and amazon. In Proceedings of the four-teenth conference on computational natural lan-guage learning, pages 107–116. Association forComputational Linguistics.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hier-archical image database. In 2009 IEEE Conferenceon Computer Vision and Pattern Recognition, pages248–255. IEEE.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Ruth Filik, Hartmut Leuthold, Katie Wallington, andJemma Page. 2014. Testing theories of irony pro-cessing using eye-tracking and erps. Journal ofExperimental Psychology: Learning, Memory, andCognition, 40(3):811.

Devamanyu Hazarika, Soujanya Poria, Sruthi Gorantla,Erik Cambria, Roger Zimmermann, and Rada Mihal-cea. 2018. Cascade: Contextual sarcasm detectionin online discussion forums. Proceedings of the 27thInternational Conference on Computational Linguis-tics, page 1837–1848.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Harold Hotelling. 1936. Relations between two sets ofvariates. Biometrika, 28(3/4):321–377.

Aditya Joshi, Pushpak Bhattacharyya, Mark Carman,Jaya Saraswati, and Rajita Shukla. 2016a. How docultural differences impact the quality of sarcasmannotation?: A case study of indian annotators andamerican text. In Proceedings of the 10th SIGHUMWorkshop on Language Technology for Cultural Her-itage, Social Sciences, and Humanities, pages 95–99.

Aditya Joshi, Vinita Sharma, and Pushpak Bhat-tacharyya. 2015. Harnessing context incongruity forsarcasm detection. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers), volume 2, pages 757–762.

Aditya Joshi, Vaibhav Tripathi, Pushpak Bhat-tacharyya, and Mark J Carman. 2016b. Harnessingsequence labeling for sarcasm detection in dialoguefrom tv seriesfriends’. In Proceedings of The 20thSIGNLL Conference on Computational Natural Lan-guage Learning, pages 146–155.

Y Alex Kolchinski and Christopher Potts. 2018. Rep-resenting social media users for sarcasm detection.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages1115–1121.

CC Liebrecht, FA Kunneman, and APJ van Den Bosch.2013. The perfect solution for detecting sarcasm intweets# not. In Proceedings of the 4th Workshopon Computational Approaches to Subjectivity, Senti-ment and Social Media Analysis, pages 29–37. NewBrunswick, NJ: ACL.

Navonil Majumder, Soujanya Poria, Devamanyu Haz-arika, Rada Mihalcea, Alexander Gelbukh, and ErikCambria. 2018. Dialoguernn: An attentive rnn foremotion detection in conversations. arXiv preprintarXiv:1811.00405.

Brian McFee, Matt McVicar, Stefan Balke, CarlThomé, Vincent Lostanlen, Colin Raffel, DanaLee, Oriol Nieto, Eric Battenberg, Dan Ellis,Ryuichi Yamamoto, Josh Moore, WZY, Rachel Bit-tner, Keunwoo Choi, Pius Friesch, Fabian-Robert

Stöter, Matt Vollrath, Siddhartha Kumar, nehz, Si-mon Waloschek, Seth, Rimvydas Naktinis, Dou-glas Repetto, Curtis "Fjord" Hawthorne, CJ Carr,João Felipe Santos, JackieWu, Erik, and AdrianHolovaty. 2018. librosa/librosa: 0.6.2.

Abhijit Mishra, Kuntal Dey, and Pushpak Bhat-tacharyya. 2017. Learning cognitive features fromgaze data for sentiment and sarcasm classificationusing convolutional neural network. In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),volume 1, pages 377–387.

Abhijit Mishra, Diptesh Kanojia, and Pushpak Bhat-tacharyya. 2016a. Predicting readers’ sarcasm un-derstandability by modeling gaze behavior. In AAAI,pages 3747–3753.

Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kun-tal Dey, and Pushpak Bhattacharyya. 2016b. Har-nessing cognitive features for sarcasm detection. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 1095–1104.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,and Prateek Vij. 2016. A deeper look into sarcas-tic tweets using deep convolutional neural networks.In Proceedings of COLING 2016, the 26th Inter-national Conference on Computational Linguistics:Technical Papers, pages 1601–1612.

Soujanya Poria, Devamanyu Hazarika, Navonil Ma-jumder, Gautam Naik, Erik Cambria, and Rada Mi-halcea. 2018. Meld: A multimodal multi-partydataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508.

Soujanya Poria, Navonil Majumder, Rada Mihalcea,and Eduard Hovy. 2019. Emotion recognition inconversation: Research challenges, datasets, and re-cent advances. arXiv preprint arXiv:1905.02947.

Ashwin Rajadesingan, Reza Zafarani, and Huan Liu.2015. Sarcasm detection on twitter: A behavioralmodeling approach. In Proceedings of the EighthACM International Conference on Web Search andData Mining, pages 97–106. ACM.

Ellen Riloff, Ashequl Qadir, Prafulla Surve, LalindraDe Silva, Nathan Gilbert, and Ruihong Huang. 2013.

Sarcasm as contrast between a positive sentimentand negative situation. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 704–714.

Patricia Rockwell. 2000. Lower, slower, louder: Vo-cal cues of sarcasm. Journal of Psycholinguistic Re-search, 29(5):483–495.

Kimiko Ryokai, Elena Durán López, Noura Howell,Jon Gillick, and David Bamman. 2018. Capturing,representing, and interacting with laughter. In Pro-ceedings of the 2018 CHI Conference on HumanFactors in Computing Systems, page 358. ACM.

R Schifanella, P de Juan, J Tetreault, L Cao, et al. 2016.Detecting sarcasm in multimodal social platforms.In ACM Multimedia, pages 1136–1145. ACM.

Joseph Tepperman, David Traum, and ShrikanthNarayanan. 2006. " yeah right": Sarcasm recogni-tion for spoken dialogue systems. In Ninth Interna-tional Conference on Spoken Language Processing.

Dominic Thompson, Ian G Mackenzie, HartmutLeuthold, and Ruth Filik. 2016. Emotional re-sponses to irony and emoticons in written language:evidence from eda and facial emg. Psychophysiol-ogy, 53(7):1054–1062.

Tony Veale and Yanfen Hao. 2010. Detecting ironic in-tent in creative comparisons. In ECAI, volume 215,pages 765–770.

Byron C Wallace, Eugene Charniak, et al. 2015.Sparse, contextually informed models for irony de-tection: Exploiting user communities, entities andsentiment. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),volume 1, pages 1035–1044.

Byron C Wallace, Laura Kertz, Eugene Charniak, et al.2014. Humans require context to infer ironic intent(so computers probably do, too). In Proceedings ofthe 52nd Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers), vol-ume 2, pages 512–516.

Silvio Amir Byron C Wallace, Hao Lyu, and Paula Car-valho Mário J Silva. 2016. Modelling context withuser embeddings for sarcasm detection in social me-dia. CoNLL 2016, page 167.

Jennifer Woodland and Daniel Voyer. 2011. Con-text and intonation in the perception of sarcasm.Metaphor and Symbol, 26(3):227–239.

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-bria, and Louis-Philippe Morency. 2017. Tensorfusion network for multimodal sentiment analysis.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages

1103–1114.

https://doi.org/10.5281/zenodo.1342708

arXiv:1906.01815v1 [cs.CL] 5 Jun 2019

Documents

Transcript of arXiv:1906.01815v1 [cs.CL] 5 Jun 2019