Info/CS 4300: Language and Information - in-class demo¶

Sentiment analysis¶

Discovering polarity lexicons from labeled data¶

%matplotlib inline

from __future__ import print_function,division
import json
from operator import itemgetter
from collections import defaultdict

import numpy as np
from matplotlib import pyplot as plt
from nltk.tokenize import TreebankWordTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import LinearSVC

## loading movie review data: 
## http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

data = load_files('txt_sentoken')

## First review and first label:
print(data.data[0])
print(data.target[0])

arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . 
it's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? 
once again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . 
in this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . 
with the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! 
parts of this are actually so absurd , that they would fit right in with dogma . 
yes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . 
anyway , this definitely doesn't seem like an arnold movie . 
it just wasn't the type of film you can see him doing . 
sure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . 
it's understandable , especially when the ending had to be changed according to some sources . 
aside form that , he still walked through it , much like he has in the past few films . 
i'm sorry to say this arnold but maybe these are the end of your action days . 
speaking of action , where was it in this film ? 
there was hardly any explosions or fights . 
the devil made a few places explode , but arnold wasn't kicking some devil butt . 
the ending was changed to make it more spiritual , which undoubtedly ruined the film . 
i was at least hoping for a cool ending if nothing else occurred , but once again i was let down . 
i also don't know why the film took so long and cost so much . 
there was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . 
the budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . 
it's pitiful to see how scripts like these get bought and made into a movie . 
do they even read these things anymore ? 
it sure doesn't seem like it . 
thankfully gabriel's performance gave some light to this poor film . 
when he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . 
the guy is creepy looking anyway ! 
when it's all over , you're just glad it's the end of the movie . 
don't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . 
it's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . 
save your money and see the world is not enough for an entertaining experience . 

0

## Building the term document matrix using CountVectorizer
vec = CountVectorizer(min_df=50)
X = vec.fit_transform(data.data)
terms = vec.get_feature_names()
len(terms)

2153

## METHOD 1: We estimate the positive_score as P(Positive|W)
def wordscore_pos():
    total_count = X.sum(axis=0)                 # (shape 1, n_terms)
    pos_count = X[data.target == 1].sum(axis=0) # shape (1, n_terms)
    
    # make sure they are 1d np.arrays
    total_count = np.asarray(total_count).ravel()
    pos_count = np.asarray(pos_count).ravel()
    
    prob = pos_count * 1.0 / total_count
    return zip(terms,prob)

## most "negative" words
negative_movies=sorted(wordscore_pos(), key=itemgetter(1), reverse=False)[:20]
negative_movies

[(u'lame', 0.13513513513513514),
 (u'wasted', 0.1440677966101695),
 (u'poorly', 0.14893617021276595),
 (u'waste', 0.15384615384615385),
 (u'ridiculous', 0.15714285714285714),
 (u'awful', 0.15909090909090909),
 (u'worst', 0.15909090909090909),
 (u'unfunny', 0.17045454545454544),
 (u'stupid', 0.17786561264822134),
 (u'dull', 0.17910447761194029),
 (u'painfully', 0.18333333333333332),
 (u'pointless', 0.18478260869565216),
 (u'laughable', 0.1891891891891892),
 (u'boring', 0.19259259259259259),
 (u'terrible', 0.19580419580419581),
 (u'embarrassing', 0.20000000000000001),
 (u'bland', 0.20238095238095238),
 (u'mess', 0.20754716981132076),
 (u'badly', 0.21052631578947367),
 (u'anywhere', 0.21818181818181817)]

positive_movies=sorted(wordscore_pos(),key = itemgetter(1),reverse = True)[:20]
positive_movies

[(u'outstanding', 0.93243243243243246),
 (u'wonderfully', 0.88235294117647056),
 (u'nomination', 0.83870967741935487),
 (u'era', 0.83720930232558144),
 (u'fantastic', 0.81818181818181823),
 (u'portrayal', 0.81609195402298851),
 (u'cameron', 0.81595092024539873),
 (u'jackie', 0.80918727915194344),
 (u'superb', 0.80645161290322576),
 (u'pulp', 0.80373831775700932),
 (u'memorable', 0.80272108843537415),
 (u'terrific', 0.80000000000000004),
 (u'political', 0.79870129870129869),
 (u'allows', 0.79347826086956519),
 (u'excellent', 0.79347826086956519),
 (u'satisfying', 0.79032258064516125),
 (u'wars', 0.78770949720670391),
 (u'contrast', 0.7846153846153846),
 (u'perfectly', 0.7831325301204819),
 (u'portrayed', 0.77777777777777779)]

sentiment_classifier=MultinomialNB()
sentiment_classifier.fit(X,data.target)
predicted_classes_train=sentiment_classifier.predict(X)
print("Accuracy on train: {:.2f}%".format(np.mean(predicted_classes_train == data.target) * 100))

Accuracy on train: 86.40%

# don't forget to check this against the majority baseline:

sum(data.target)/len(data.target)

0.5

# METHOD 2: Use class probabilities already calculated by your NB classifier

positive_lexicon=[]

negative_lexicon=[]

# P(word|positive)
positives_probs=sentiment_classifier.feature_log_prob_[0,:]

# P(word|Kris)
negative_probs=sentiment_classifier.feature_log_prob_[1,:]

logodds=positives_probs-negative_probs

#positive
print("\nFeatures that are most indicative of positive sentiment:\n")
for i in np.argsort(logodds)[:20]:
    print(terms[i])
    
print("\n\nFeatures that are most indicative of negative sentiment\n")
#negative
for i in np.argsort(-logodds)[:20]:
    print(terms[i])

# put the top/bottom words in the positive/negative lexicons
for i in np.argsort(logodds)[:500]:
    positive_lexicon.append(terms[i])
    
#negative
for i in np.argsort(-logodds)[:500]:
    negative_lexicon.append(terms[i])

positive_lexicon=set(positive_lexicon)
negative_lexicon=set(negative_lexicon)

Features that are most indicative of positive sentiment:

outstanding
wonderfully
era
nomination
cameron
fantastic
portrayal
jackie
superb
memorable
pulp
terrific
political
excellent
allows
wars
satisfying
perfectly
contrast
subtle


Features that are most indicative of negative sentiment

lame
wasted
poorly
waste
worst
ridiculous
awful
unfunny
stupid
dull
pointless
painfully
boring
laughable
terrible
bland
embarrassing
mess
badly
anywhere

Now we can try to apply what we learned to a data without labels

## Loading the Kardashian data
with open("kardashian-transcripts.json", "rb") as f:
    transcripts = json.load(f)

msgs = [m['text'].lower() for transcript in transcripts
        for m in transcript if m['speaker'] == 'KIM']

## using the same transformation to get a term-document matrix (where terms match the ones in the movie-review data)
X2=vec.transform(msgs)
labels=sentiment_classifier.predict(X2)

#Looking at the classes assigned by the classifier:
zip(labels,msgs)[:20]

[(0, u'you just need a duster.'),
 (0, u'when did you start to get gray hair?'),
 (0, u'oh, please.'),
 (0, u'oh, my god.'),
 (0, u'it is like $4,000.'),
 (0, u"that's crazy."),
 (0, u"that's way too much money, mom."),
 (0, u"you have so many dresses, and it's just a plain black dress."),
 (0, u"it's nothing special."),
 (0, u'capitol one, wells fargo...'),
 (0, u"so what's, like, your plan?"),
 (0, u'like, where are you gonna stay?'),
 (0, u"i'm just afraid if you go to new york, you'll come back hurt."),
 (1,
  u"as much as i don't agree with rob going to new york, he's my little brother."),
 (1, u"i'm gonna support him no matter what."),
 (1, u"so i call rob to see what's going on with adrienne."),
 (1, u"rob, you're crazy."),
 (1,
  u"i know he feels that she's the one, and i get it, but rob is taking it to a whole 'nother level."),
 (0, u'i think that at some point, you just need to stop calling her.'),
 (0, u'and i think this is now, like, the turning point.')]

# label distribution (everything has an assigned class, even though not everything might be subjective)
plt.hist(labels.tolist())

(array([ 4911.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
            0.,   954.]),
 array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]),
 <a list of 10 Patch objects>)

## We can look at the predicted class probabilities
label_prob = classifier.predict_proba(X2)
positive_label_probabilities = label_prob[:, 1]
plt.hist(positive_label_probabilities)

(array([  123.,   391.,   834.,  1564.,  1626.,  1004.,   228.,    78.,
           13.,     4.]),
 array([  1.92734069e-04,   9.88536484e-02,   1.97514563e-01,
          2.96175477e-01,   3.94836391e-01,   4.93497306e-01,
          5.92158220e-01,   6.90819134e-01,   7.89480048e-01,
          8.88140963e-01,   9.86801877e-01]),
 <a list of 10 Patch objects>)

## documents that are considered most negative
kard_sentiment=sorted(set(zip(positive_label_probabilities, msgs)),
       key=itemgetter(0),
       reverse=False)

for sent_score, m in kard_sentiment[:20]:
    print (sent_score, m)
    print ("lexicon items:",", ".join(list(set(m.split()).intersection(negative_lexicon))))
    print()

0.000192734068846 i know that you guys sometimes don't understand what goes on, and i'm sorry if, you know, at school, maybe your friends have said things to you or... come to you with things that maybe you don't know the whole truth about, but you have three big sisters that you can always come to, and when you get to that point and you do like boys, i want you to be able to call me and tell me whatever you want to.
lexicon items: do, get, big, point, whatever, whole, have, maybe, sorry, guys

0.00159708131434 i've had time to reflect on what my family's trying to say to me, and it probably wouldn't be such a bad idea if i put myself on a budget i just don't like lying to you guys, and i'm really sorry.
lexicon items: myself, just, budget, idea, bad, trying, if

0.00180226565709 this was supposed to be me!" like, i think, you know, like, what if i have a nervous breakdown when they're like, "does anyone object?" what if i'm, like, "this was supposed to be me!" like in a dream.
lexicon items: anyone, supposed, have, if

0.00353696444145 you don't know when you're wrong, and that's probably why you and adrienne have a big problem 'cause you don't know when you're wrong and you don't know when to apologize!
lexicon items: big, problem, wrong, have, why

0.00946303806464 i don't want to have to sneak behind anyone's back to do anything or feel guilty, and i was feeling like i was doing something that i shouldn't have been doing, like i was feeling guilty.
lexicon items: do, anything, have

0.0118335711731 it's kind of scary 'cause they're, you know, one's talking and one's writing down everything you say, and i want to do it fast so they can hurry up and get out of there and go find this guy.
lexicon items: scary, do, get, there, fast, talking

0.0162730629277 she's gonna kill me for saying anything, so you guys have to promise that you're not gonna tell her that i said anything
lexicon items: kill, promise, anything, have, guys

0.0178368770798 scott's idea that i should marry shengo is ridiculous, but, like, in the back of my mind, i just, like, want to know, like, more about him, and, like, just, does he want the same things that i do?
lexicon items: idea, should

0.0180202079942 this was the biggest waste of time not only for me, but everyone that put this together-- my mom, khloe, kourtney-- everyone wasting their time to talk to me about this addiction that i have.
lexicon items: only, waste

0.0186489542057 my first memory of bruce was it was my 11th birthday party and it was at tower lane, and i thought it was so cool that you four: you, casey, brody, everyone was coming, and i was, like, "we have four new, like, brothers and sisters, and they're all coming." and that's why it was so much fun 'cause we had such a good time.
lexicon items: cool, brothers, have, why

0.0229510851028 the idea that i need professional help for a shopping addiction is so ridiculous, i cannot even-- where would they even find someone like this?
lexicon items: even, someone, idea

0.028408387224 but he's my friend, and then you go and say no, and it makes me look bad-- it makes you look bad.
lexicon items: then

0.0284212998018 okay, okay, let's get her like those babies that they rent in high school when they're trying to teach you a lesson.
lexicon items: trying, rent, get

0.0291444684798 yeah, but just please tell me which one, so that you're not like wearing--
lexicon items: please, just

0.0291953344323 i don't care what they say about me, you know what i mean, but if they say it about you and then it's, like...
lexicon items: then, care, if

0.0296729194804 he said if i go and they lose, then i'll be, like, the bad luck charm of all time.
lexicon items: then, bad, if, luck

0.0299524860742 i don't want to be one of those full of people that just say, "oh, yeah, sure, i'll do it, i'll do it."
lexicon items: do, just

0.030065641033 i've asked bruce to come over to help me party-proof my house before my mom gets there, because as soon as she gets there, all my ideas will be thrown out the door.
lexicon items: thrown, ideas

0.0324298709348 i mean, that's like they bring that to you and then you just check, check, check, check.
lexicon items: then, just

0.0339288532269 well, since i already planned my wedding and i know what i want my cake to look like and i know what i want my flowers to look like and i have it all planned out and cut out, i can give it to khloe.
lexicon items: cut, already, have, give

## documents that are considered most negative
kard_sentiment=sorted(set(zip(positive_label_probabilities, msgs)),
       key=itemgetter(0),
       reverse=False)

for sent_score, m in kard_sentiment[-20:]:
    print (sent_score, m)
    print ("lexicon items:",", ".join(list(set(m.split()).intersection(positive_lexicon))))
    print()

0.77196508919 everything from career to family is going perfect.
lexicon items: family

0.778952521424 to restore hair's natural defense against frizz.
lexicon items: natural, against

0.781082926342 and a strong immune system.
lexicon items: strong

0.782011468054 rob asked adrienne to be his date for the wedding and she said that, you know, he can tag along with her and her friends.
lexicon items: date, rob

0.783349261767 i'm bringing my best friend brittny gastineau with me for support.
lexicon items: bringing, best

0.789413742876 this is gonna be such an amazing life experience for her.
lexicon items: life, amazing, experience

0.792819655895 i'm in the best mood, everyone!
lexicon items: best

0.79334907864 thanks for understanding.
lexicon items: thanks

0.793401371514 fantastic. phenomenal.
lexicon items: 

0.797788560856 when it really comes down to it, the most important thing in life is family.
lexicon items: most, life, important

0.797904519511 rob is the baby in our family.
lexicon items: rob

0.798759863202 rob has just finished the last of his finals, and our whole family is getting together to celebrate.
lexicon items: rob, family

0.801332940179 this is the life that i've lived being in a relationship with reggie for so many years.
lexicon items: many, life, relationship

0.812986427226 combining the insurances family's need most, term life and disability, in one affordable package.
lexicon items: life

0.816989339969 when imes to dance partners, working together is a thing of beauty.
lexicon items: dance

0.832695146201 rob lets me know that david from the modeling agency told him about a potential job offer in japan.
lexicon items: rob, lets, job, told

0.854909532289 in a world with so many flaws, i guess few things come as close to perfect as this shine.
lexicon items: perfect, many, world

0.892422305856 kourtney and khloe are saying it's not a big deal, and all of our friends and family know it's not true, but the whole world thinks it's true.
lexicon items: world, friends, family

0.919995032627 with history, science,and the arts as well as english and math.
lexicon items: 

0.986801877018 which hasbeauty perfect 10 fromnice'n easy rich color stunninghigh gloss and flawlessgray coverage all in just10 minutes that's why it won the mostawards from beauty editors perfect 10 the color thatchanges everything
lexicon items: color, perfect, rich, beauty

We see that the polarity lexicon does not generalize well from one dataset to another. What can we do?