Info/CS 4300: Language and Information - in-class demo¶

Sentiment analysis¶

Discovering polarity lexicons from labeled data¶

%matplotlib inline

from __future__ import print_function,division
import json
from operator import itemgetter
from collections import defaultdict

import numpy as np
from matplotlib import pyplot as plt
from nltk.tokenize import TreebankWordTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import LinearSVC

## loading movie review data: 
## http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

data = load_files('txt_sentoken')

## First review and first label:
print(data.data[0])
print(data.target[0])

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . \nanyway , this definitely doesn't seem like an arnold movie . \nit just wasn't the type of film you can see him doing . \nsure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . \nit's understandable , especially when the ending had to be changed according to some sources . \naside form that , he still walked through it , much like he has in the past few films . \ni'm sorry to say this arnold but maybe these are the end of your action days . \nspeaking of action , where was it in this film ? \nthere was hardly any explosions or fights . \nthe devil made a few places explode , but arnold wasn't kicking some devil butt . \nthe ending was changed to make it more spiritual , which undoubtedly ruined the film . \ni was at least hoping for a cool ending if nothing else occurred , but once again i was let down . \ni also don't know why the film took so long and cost so much . \nthere was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . \nthe budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . \nit's pitiful to see how scripts like these get bought and made into a movie . \ndo they even read these things anymore ? \nit sure doesn't seem like it . \nthankfully gabriel's performance gave some light to this poor film . \nwhen he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . \nthe guy is creepy looking anyway ! \nwhen it's all over , you're just glad it's the end of the movie . \ndon't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . \nit's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . \nsave your money and see the world is not enough for an entertaining experience . \n"
0

## Building the term document matrix using CountVectorizer
vec = CountVectorizer(min_df=50)
X = vec.fit_transform(data.data)
terms = vec.get_feature_names()
len(terms)

2153

## METHOD 1: We estimate the positive_score as P(Positive|W)
def wordscore_pos():
    total_count = X.sum(axis=0)                 # shape (1, n_terms)
    pos_count = X[data.target == 1].sum(axis=0) # shape (1, n_terms)
    
    # make sure they are 1d np.arrays
    total_count = np.asarray(total_count).ravel()
    pos_count = np.asarray(pos_count).ravel()
    
    prob = pos_count * 1.0 / total_count
    return zip(terms,prob)

## most "negative" words
negative_movies=sorted(wordscore_pos(), key=itemgetter(1), reverse=False)[:20]
negative_movies

[('lame', 0.13513513513513514),
 ('wasted', 0.1440677966101695),
 ('poorly', 0.14893617021276595),
 ('waste', 0.15384615384615385),
 ('ridiculous', 0.15714285714285714),
 ('awful', 0.1590909090909091),
 ('worst', 0.1590909090909091),
 ('unfunny', 0.17045454545454544),
 ('stupid', 0.17786561264822134),
 ('dull', 0.1791044776119403),
 ('painfully', 0.18333333333333332),
 ('pointless', 0.18478260869565216),
 ('laughable', 0.1891891891891892),
 ('boring', 0.1925925925925926),
 ('terrible', 0.1958041958041958),
 ('embarrassing', 0.2),
 ('bland', 0.20238095238095238),
 ('mess', 0.20754716981132076),
 ('badly', 0.21052631578947367),
 ('anywhere', 0.21818181818181817)]

positive_movies=sorted(wordscore_pos(),key = itemgetter(1),reverse = True)[:20]
positive_movies

[('outstanding', 0.9324324324324325),
 ('wonderfully', 0.8823529411764706),
 ('nomination', 0.8387096774193549),
 ('era', 0.8372093023255814),
 ('fantastic', 0.8181818181818182),
 ('portrayal', 0.8160919540229885),
 ('cameron', 0.8159509202453987),
 ('jackie', 0.8091872791519434),
 ('superb', 0.8064516129032258),
 ('pulp', 0.8037383177570093),
 ('memorable', 0.8027210884353742),
 ('terrific', 0.8),
 ('political', 0.7987012987012987),
 ('allows', 0.7934782608695652),
 ('excellent', 0.7934782608695652),
 ('satisfying', 0.7903225806451613),
 ('wars', 0.7877094972067039),
 ('contrast', 0.7846153846153846),
 ('perfectly', 0.7831325301204819),
 ('portrayed', 0.7777777777777778)]

# METHOD 2: Use class probabilities already calculated by your NB classifier

sentiment_classifier=MultinomialNB()
sentiment_classifier.fit(X,data.target)
predicted_classes_train=sentiment_classifier.predict(X)
print("Accuracy on train: {:.2f}%".format(np.mean(predicted_classes_train == data.target) * 100))

Accuracy on train: 86.40%

# don't forget to check this against the majority baseline:

sum(data.target)/len(data.target)

0.5

positive_lexicon=[]

negative_lexicon=[]

# P(word|positive)
positives_probs=sentiment_classifier.feature_log_prob_[0,:]

# P(word|negative)
negative_probs=sentiment_classifier.feature_log_prob_[1,:]

logodds=positives_probs-negative_probs

#positive
print("\nFeatures that are most indicative of positive sentiment:\n")
for i in np.argsort(logodds)[:20]:
    print(terms[i])
    
print("\n\nFeatures that are most indicative of negative sentiment\n")
#negative
for i in np.argsort(-logodds)[:20]:
    print(terms[i])

# put the top/bottom words in the positive/negative lexicons
for i in np.argsort(logodds)[:500]:
    positive_lexicon.append(terms[i])
    
#negative
for i in np.argsort(-logodds)[:500]:
    negative_lexicon.append(terms[i])

positive_lexicon=set(positive_lexicon)
negative_lexicon=set(negative_lexicon)

Features that are most indicative of positive sentiment:

outstanding
wonderfully
era
nomination
cameron
fantastic
portrayal
jackie
superb
memorable
pulp
terrific
political
excellent
allows
wars
satisfying
perfectly
contrast
subtle


Features that are most indicative of negative sentiment

lame
wasted
poorly
waste
worst
ridiculous
awful
unfunny
stupid
dull
pointless
painfully
boring
laughable
terrible
bland
embarrassing
mess
badly
anywhere

Now we can try to apply what we learned to a data without labels

## Loading the Kardashian data
with open("kardashian-transcripts.json", "rb") as f:
    transcripts = json.load(f)

msgs = [m['text'].lower() for transcript in transcripts
        for m in transcript if m['speaker'] == 'KIM']

## using the same transformation to get a term-document matrix (where terms match the ones in the movie-review data)
X2=vec.transform(msgs)
labels=sentiment_classifier.predict(X2)

#Looking at the classes assigned by the classifier:
z = zip(labels,msgs)
list(z)[:20]

[(0, 'you just need a duster.'),
 (0, 'when did you start to get gray hair?'),
 (0, 'oh, please.'),
 (0, 'oh, my god.'),
 (0, 'it is like $4,000.'),
 (0, "that's crazy."),
 (0, "that's way too much money, mom."),
 (0, "you have so many dresses, and it's just a plain black dress."),
 (0, "it's nothing special."),
 (0, 'capitol one, wells fargo...'),
 (0, "so what's, like, your plan?"),
 (0, 'like, where are you gonna stay?'),
 (0, "i'm just afraid if you go to new york, you'll come back hurt."),
 (1,
  "as much as i don't agree with rob going to new york, he's my little brother."),
 (1, "i'm gonna support him no matter what."),
 (1, "so i call rob to see what's going on with adrienne."),
 (1, "rob, you're crazy."),
 (1,
  "i know he feels that she's the one, and i get it, but rob is taking it to a whole 'nother level."),
 (0, 'i think that at some point, you just need to stop calling her.'),
 (0, 'and i think this is now, like, the turning point.')]

# label distribution (everything has an assigned class, even though not everything might be subjective)
plt.hist(labels.tolist())

(array([4911.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
         954.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

## We can look at the predicted class probabilities
label_prob = sentiment_classifier.predict_proba(X2)
positive_label_probabilities = label_prob[:, 1]
plt.hist(positive_label_probabilities)

(array([ 123.,  391.,  834., 1564., 1626., 1004.,  228.,   78.,   13.,
           4.]),
 array([1.92734069e-04, 9.88536484e-02, 1.97514563e-01, 2.96175477e-01,
        3.94836391e-01, 4.93497306e-01, 5.92158220e-01, 6.90819134e-01,
        7.89480048e-01, 8.88140963e-01, 9.86801877e-01]),
 <a list of 10 Patch objects>)

## documents that are considered most negative
kard_sentiment=sorted(set(zip(positive_label_probabilities, msgs)),
       key=itemgetter(0),
       reverse=False)

for sent_score, m in kard_sentiment[:20]:
    print (sent_score, m)
    print ("lexicon items:",", ".join(list(set(m.split()).intersection(negative_lexicon))))
    print()

0.00019273406884585592 i know that you guys sometimes don't understand what goes on, and i'm sorry if, you know, at school, maybe your friends have said things to you or... come to you with things that maybe you don't know the whole truth about, but you have three big sisters that you can always come to, and when you get to that point and you do like boys, i want you to be able to call me and tell me whatever you want to.
lexicon items: sorry, maybe, big, do, whole, guys, have, point, whatever, get

0.0015970813143444077 i've had time to reflect on what my family's trying to say to me, and it probably wouldn't be such a bad idea if i put myself on a budget i just don't like lying to you guys, and i'm really sorry.
lexicon items: trying, budget, bad, idea, myself, if, just

0.0018022656570855005 this was supposed to be me!" like, i think, you know, like, what if i have a nervous breakdown when they're like, "does anyone object?" what if i'm, like, "this was supposed to be me!" like in a dream.
lexicon items: if, supposed, have, anyone

0.003536964441454643 you don't know when you're wrong, and that's probably why you and adrienne have a big problem 'cause you don't know when you're wrong and you don't know when to apologize!
lexicon items: big, problem, wrong, why, have

0.00946303806464491 i don't want to have to sneak behind anyone's back to do anything or feel guilty, and i was feeling like i was doing something that i shouldn't have been doing, like i was feeling guilty.
lexicon items: have, anything, do

0.011833571173051408 it's kind of scary 'cause they're, you know, one's talking and one's writing down everything you say, and i want to do it fast so they can hurry up and get out of there and go find this guy.
lexicon items: there, do, fast, scary, get, talking

0.01627306292774108 she's gonna kill me for saying anything, so you guys have to promise that you're not gonna tell her that i said anything
lexicon items: guys, kill, promise, anything, have

0.017836877079826948 scott's idea that i should marry shengo is ridiculous, but, like, in the back of my mind, i just, like, want to know, like, more about him, and, like, just, does he want the same things that i do?
lexicon items: should, idea

0.018020207994180915 this was the biggest waste of time not only for me, but everyone that put this together-- my mom, khloe, kourtney-- everyone wasting their time to talk to me about this addiction that i have.
lexicon items: waste, only

0.01864895420567231 my first memory of bruce was it was my 11th birthday party and it was at tower lane, and i thought it was so cool that you four: you, casey, brody, everyone was coming, and i was, like, "we have four new, like, brothers and sisters, and they're all coming." and that's why it was so much fun 'cause we had such a good time.
lexicon items: cool, brothers, why, have

0.02295108510281941 the idea that i need professional help for a shopping addiction is so ridiculous, i cannot even-- where would they even find someone like this?
lexicon items: idea, someone, even

0.028408387224013176 but he's my friend, and then you go and say no, and it makes me look bad-- it makes you look bad.
lexicon items: then

0.028421299801806424 okay, okay, let's get her like those babies that they rent in high school when they're trying to teach you a lesson.
lexicon items: rent, trying, get

0.02914446847980453 yeah, but just please tell me which one, so that you're not like wearing--
lexicon items: just, please

0.0291953344322823 i don't care what they say about me, you know what i mean, but if they say it about you and then it's, like...
lexicon items: then, care, if

0.029672919480405055 he said if i go and they lose, then i'll be, like, the bad luck charm of all time.
lexicon items: then, bad, luck, if

0.02995248607415213 i don't want to be one of those full of people that just say, "oh, yeah, sure, i'll do it, i'll do it."
lexicon items: just, do

0.03006564103298027 i've asked bruce to come over to help me party-proof my house before my mom gets there, because as soon as she gets there, all my ideas will be thrown out the door.
lexicon items: thrown, ideas

0.03242987093479214 i mean, that's like they bring that to you and then you just check, check, check, check.
lexicon items: then, just

0.033928853226912796 well, since i already planned my wedding and i know what i want my cake to look like and i know what i want my flowers to look like and i have it all planned out and cut out, i can give it to khloe.
lexicon items: cut, already, have, give

## documents that are considered most positive
kard_sentiment=sorted(set(zip(positive_label_probabilities, msgs)),
       key=itemgetter(0),
       reverse=False)

for sent_score, m in kard_sentiment[-20:]:
    print (sent_score, m)
    print ("lexicon items:",", ".join(list(set(m.split()).intersection(positive_lexicon))))
    print()

0.7719650891899279 everything from career to family is going perfect.
lexicon items: family

0.7789525214239469 to restore hair's natural defense against frizz.
lexicon items: against, natural

0.7810829263423256 and a strong immune system.
lexicon items: strong

0.7820114680538289 rob asked adrienne to be his date for the wedding and she said that, you know, he can tag along with her and her friends.
lexicon items: date, rob

0.7833492617671066 i'm bringing my best friend brittny gastineau with me for support.
lexicon items: bringing, best

0.7894137428759795 this is gonna be such an amazing life experience for her.
lexicon items: life, experience, amazing

0.7928196558948932 i'm in the best mood, everyone!
lexicon items: best

0.7933490786402522 thanks for understanding.
lexicon items: thanks

0.7934013715140062 fantastic. phenomenal.
lexicon items: 

0.7977885608563229 when it really comes down to it, the most important thing in life is family.
lexicon items: life, important, most

0.7979045195112897 rob is the baby in our family.
lexicon items: rob

0.7987598632018919 rob has just finished the last of his finals, and our whole family is getting together to celebrate.
lexicon items: family, rob

0.8013329401785663 this is the life that i've lived being in a relationship with reggie for so many years.
lexicon items: many, life, relationship

0.812986427226234 combining the insurances family's need most, term life and disability, in one affordable package.
lexicon items: life

0.8169893399694116 when imes to dance partners, working together is a thing of beauty.
lexicon items: dance

0.8326951462014087 rob lets me know that david from the modeling agency told him about a potential job offer in japan.
lexicon items: rob, job, lets, told

0.854909532289006 in a world with so many flaws, i guess few things come as close to perfect as this shine.
lexicon items: world, perfect, many

0.8924223058556139 kourtney and khloe are saying it's not a big deal, and all of our friends and family know it's not true, but the whole world thinks it's true.
lexicon items: world, family, friends

0.9199950326266608 with history, science,and the arts as well as english and math.
lexicon items: 

0.9868018770179279 which hasbeauty perfect 10 fromnice'n easy rich color stunninghigh gloss and flawlessgray coverage all in just10 minutes that's why it won the mostawards from beauty editors perfect 10 the color thatchanges everything
lexicon items: rich, perfect, color, beauty

We see that the polarity lexicon does not generalize well from one dataset to another. What can we do?