%matplotlib inline
from __future__ import print_function
import json
from operator import itemgetter
from collections import defaultdict
from matplotlib import pyplot as plt
import numpy as np
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist,pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB
tokenizer = TweetTokenizer()
Using the movie review data, but this time we will not use the sentiment labels (we will pretend we don't have labels).
## loading movie review data:
## http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
data = load_files('txt_sentoken')
print(data.data[0])
b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . \nanyway , this definitely doesn't seem like an arnold movie . \nit just wasn't the type of film you can see him doing . \nsure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . \nit's understandable , especially when the ending had to be changed according to some sources . \naside form that , he still walked through it , much like he has in the past few films . \ni'm sorry to say this arnold but maybe these are the end of your action days . \nspeaking of action , where was it in this film ? \nthere was hardly any explosions or fights . \nthe devil made a few places explode , but arnold wasn't kicking some devil butt . \nthe ending was changed to make it more spiritual , which undoubtedly ruined the film . \ni was at least hoping for a cool ending if nothing else occurred , but once again i was let down . \ni also don't know why the film took so long and cost so much . \nthere was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . \nthe budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . \nit's pitiful to see how scripts like these get bought and made into a movie . \ndo they even read these things anymore ? \nit sure doesn't seem like it . \nthankfully gabriel's performance gave some light to this poor film . \nwhen he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . \nthe guy is creepy looking anyway ! \nwhen it's all over , you're just glad it's the end of the movie . \ndon't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . \nit's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . \nsave your money and see the world is not enough for an entertaining experience . \n"
## building the term documnet matrix
vec = CountVectorizer(min_df = 50)
X = vec.fit_transform(data.data)
terms = vec.get_feature_names()
len(terms)
2153
# PMI type measure via matrix multiplication
def getcollocations_matrix(X):
XX=X.T.dot(X) ## multiply X with it's transpose to get number docs in which both w1 (row) and w2 (column) occur
term_freqs = np.asarray(X.sum(axis=0)) ## number of docs in which a word occurs
pmi = XX.toarray() * 1.0 ## Casting to float, making it an array to use simple operations
pmi /= term_freqs.T ## dividing by the number of documents in which w1 occurs
pmi /= term_freqs ## dividing by the number of documents in which w2 occurs
return pmi # this is not technically PMI beacuse we are ignoring some normalization factor and not taking the log
# but it's sufficient for ranking
pmi_matrix = getcollocations_matrix(X)
pmi_matrix.shape
(2153, 2153)
def getcollocations(w,PMI_MATRIX=pmi_matrix,TERMS=terms):
if w not in TERMS:
return []
idx = TERMS.index(w)
col = PMI_MATRIX[:,idx].ravel().tolist()
return sorted([(TERMS[i],val) for i,val in enumerate(col)],key=itemgetter(1),reverse=True)
getcollocations("good")
[('good', 0.0012711337380982813), ('trek', 0.0010038914000850665), ('sean', 0.0009922470727116103), ('nudity', 0.0009374840201587473), ('nicely', 0.0009268742752181751), ('trash', 0.0009217014608968155), ('showed', 0.000916850400576306), ('compared', 0.00091151987499156), ('fairly', 0.0008716089901959017), ('comparison', 0.0008698557537213697), ('laughed', 0.0008665639627895953), ('crap', 0.0008473706979212659), ('pulp', 0.0008450365730278281), ('parts', 0.0008435572066033899), ('fifteen', 0.0008424927416009955), ('sorry', 0.0008413817621615216), ('pretty', 0.0008334590198961828), ('nights', 0.0008333717375608706), ('chris', 0.000833301911692621), ('doctor', 0.0008330167404996009), ('rating', 0.0008322781072402701), ('average', 0.0008295313148071339), ('forward', 0.0008295313148071339), ('watched', 0.0008295313148071339), ('cool', 0.0008275372491465399), ('stupid', 0.0008213343650560753), ('sadly', 0.0008174507616788748), ('matt', 0.0008162941129751053), ('hate', 0.0008140549843070009), ('kills', 0.0008135787895223813), ('terrific', 0.0008122494124153186), ('horrible', 0.0008093970595933685), ('agrees', 0.0008091330037872864), ('subplot', 0.0008082612810941305), ('totally', 0.0008068044294699522), ('sad', 0.0008064887782847135), ('technical', 0.0008033355890763824), ('therefore', 0.0008002537389904116), ('handled', 0.0007999051964211649), ('scientist', 0.0007949675100235034), ('lovely', 0.0007943816828237808), ('barry', 0.000792934345036231), ('villain', 0.0007926632563712613), ('event', 0.0007924384511368963), ('producers', 0.0007895539020453444), ('okay', 0.0007863956864371629), ('fit', 0.000785871771922548), ('mentioned', 0.0007854073087003715), ('detail', 0.0007852359533368501), ('information', 0.0007839070924927416), ('allen', 0.0007790149847387508), ('seven', 0.0007784369946922017), ('shouldn', 0.0007783256780906441), ('naturally', 0.0007776856076316881), ('comments', 0.0007747509449613798), ('entertain', 0.0007747509449613798), ('jail', 0.0007734819016444898), ('fbi', 0.0007733651320337342), ('climactic', 0.0007732919036337689), ('bad', 0.0007712559966343031), ('ended', 0.0007711136165812794), ('judge', 0.0007694203499660373), ('ones', 0.0007682591154179706), ('nice', 0.0007668341805484552), ('kill', 0.000764042000480255), ('critics', 0.0007636954961716471), ('danny', 0.0007634624490260348), ('presented', 0.0007617330823469355), ('rent', 0.0007604037052398728), ('sub', 0.0007604037052398728), ('genius', 0.0007595059440766616), ('thankfully', 0.0007594300769361086), ('wanted', 0.0007590994107197358), ('breaking', 0.0007584286306808082), ('batman', 0.0007559768139867969), ('total', 0.0007557951979353888), ('wasn', 0.0007554151148587302), ('bigger', 0.0007552449284064951), ('ensemble', 0.000752202124443757), ('steals', 0.000752202124443757), ('lot', 0.0007517244632705712), ('kiss', 0.0007491175649023608), ('directing', 0.0007486014304357063), ('perspective', 0.0007479380707277437), ('badly', 0.0007476696718985352), ('crash', 0.0007476696718985352), ('adds', 0.0007473255088352558), ('really', 0.0007456730464617401), ('job', 0.0007452354168096464), ('army', 0.000744825652379645), ('brown', 0.0007446186605355376), ('mainly', 0.0007441383853416938), ('pay', 0.0007420807243907193), ('dumb', 0.000741226368392181), ('explosions', 0.0007406529596492268), ('yeah', 0.0007402779454924423), ('driver', 0.0007399796387768184), ('recommend', 0.0007395349929176807), ('blame', 0.0007389503091672745), ('twice', 0.0007382828701783492), ('gary', 0.0007379596761595932), ('wouldn', 0.0007373611687174524), ('cares', 0.0007367547861773888), ('killed', 0.000736304171629268), ('fiction', 0.0007362894228326887), ('price', 0.0007357152732515652), ('murphy', 0.0007352663926699596), ('hits', 0.0007338161630986185), ('accent', 0.0007334803204610447), ('acts', 0.0007329420521241115), ('saw', 0.0007328073077446421), ('suspenseful', 0.0007327526614129683), ('guilty', 0.0007299875570302779), ('advice', 0.000729680323209979), ('ending', 0.0007295169009651391), ('aren', 0.0007290304055131928), ('jackson', 0.000728289303944846), ('ok', 0.0007282513286969607), ('actor', 0.0007277390106091889), ('news', 0.0007277251988989857), ('fights', 0.0007273426745772697), ('thinks', 0.00072659677209384), ('throw', 0.0007247925124324958), ('saying', 0.0007246200014638788), ('cop', 0.0007238458347956482), ('loves', 0.0007235356468040002), ('extra', 0.0007228772886176453), ('villains', 0.0007228772886176453), ('performance', 0.0007227970372811597), ('range', 0.0007226977363850031), ('flash', 0.0007224950161223425), ('gives', 0.0007223630680223859), ('thrills', 0.0007222643344441425), ('said', 0.0007214337497047879), ('surprised', 0.0007209945072622753), ('treat', 0.0007200792663256371), ('guys', 0.0007196493682561889), ('writing', 0.0007196184155951888), ('particular', 0.0007191066917321584), ('witty', 0.0007183570148845284), ('natural', 0.000717863637813866), ('acted', 0.0007174324884818456), ('liked', 0.0007171432011881029), ('cliched', 0.0007164134082425248), ('grace', 0.0007158548012965267), ('national', 0.000715805247454543), ('acting', 0.0007155453571609738), ('aliens', 0.0007152403336559289), ('chemistry', 0.0007146731327569155), ('guess', 0.0007139107996902104), ('instance', 0.0007133969307341352), ('violent', 0.0007131582166867087), ('mediocre', 0.0007118275471655811), ('alien', 0.0007110268412632576), ('scary', 0.0007110268412632576), ('ask', 0.0007106124899571602), ('probably', 0.0007102573316947909), ('nevertheless', 0.0007100225660637334), ('mean', 0.0007095577775416395), ('allowed', 0.0007091154787867436), ('loud', 0.0007090011237667811), ('flick', 0.0007089106899499742), ('fun', 0.000708794736453356), ('slightly', 0.000708672447749141), ('plain', 0.0007081364882499924), ('allows', 0.0007078066110039132), ('prison', 0.0007071414486880486), ('trailer', 0.00070639776026545), ('stuff', 0.0007058992438503015), ('fantastic', 0.0007056402742839906), ('dog', 0.0007054282047178778), ('critic', 0.0007051016175860639), ('hey', 0.0007051016175860639), ('overall', 0.0007051016175860639), ('working', 0.0007049068919253111), ('developed', 0.0007046989324817885), ('person', 0.0007034519814486633), ('visuals', 0.0007030783704767782), ('emotion', 0.0007022736699219486), ('menace', 0.0007016129344864077), ('murdered', 0.0007013310207005769), ('requires', 0.0007008109383715443), ('track', 0.0007006720814390355), ('usual', 0.0007006493308681725), ('lines', 0.0006999170468685193), ('saving', 0.0006999170468685193), ('yes', 0.0006999170468685193), ('able', 0.0006998405782738653), ('get', 0.0006997175379902146), ('maybe', 0.0006996916307503651), ('think', 0.0006990926451643161), ('bring', 0.0006989242567311171), ('remember', 0.0006988801327250104), ('de', 0.0006986421524297788), ('annoying', 0.0006978596775361603), ('wonderfully', 0.0006977822236318833), ('disappointing', 0.00069756042381509), ('included', 0.0006972871921567213), ('friends', 0.0006965130357913436), ('tell', 0.0006964706559291111), ('williams', 0.0006959289155473312), ('realistic', 0.0006955301024152124), ('except', 0.0006951165184263484), ('episode', 0.0006940976307569897), ('impressive', 0.0006938603053760607), ('terribly', 0.0006936598063473447), ('very', 0.0006935024277037634), ('language', 0.000693488179178764), ('doing', 0.000693268245804233), ('feeling', 0.0006931194985944053), ('somewhere', 0.0006923647194453244), ('study', 0.0006912760956726117), ('theatre', 0.0006912760956726117), ('dull', 0.0006902443403059361), ('decided', 0.000689946718565549), ('hotel', 0.000689668476845466), ('seemingly', 0.0006890461727833452), ('thrillers', 0.0006890461727833452), ('mood', 0.0006884254725976731), ('confused', 0.0006883344952654942), ('anti', 0.0006881339316013725), ('brilliant', 0.0006881339316013725), ('reason', 0.000688112360680975), ('smart', 0.0006876378004322295), ('direction', 0.0006873678209267594), ('jackie', 0.0006873678209267594), ('actually', 0.0006873117883139156), ('drop', 0.000686248633158629), ('planet', 0.0006861555320009626), ('brian', 0.0006859585872443607), ('above', 0.000685583233708249), ('lawyer', 0.000685264999188502), ('better', 0.0006851280870126166), ('warm', 0.0006849917675301333), ('biggest', 0.0006847808840354194), ('hundred', 0.0006846925138090629), ('screenplay', 0.0006846474207826003), ('did', 0.0006843905146410103), ('lose', 0.0006843633347158854), ('will', 0.0006842884672687007), ('direct', 0.0006840940063669222), ('scene', 0.0006840515924536997), ('george', 0.0006839995051918474), ('considered', 0.000683922094654818), ('sheer', 0.0006838028405842591), ('criminal', 0.0006834206854945138), ('general', 0.0006833962645302296), ('develops', 0.000683143435723522), ('rules', 0.000683143435723522), ('guy', 0.0006827392523224378), ('talent', 0.0006825963958166327), ('looks', 0.0006825376168108359), ('had', 0.0006825121813937092), ('great', 0.0006824846052572631), ('tension', 0.0006824512944512591), ('learn', 0.0006824341921233109), ('fact', 0.0006821735781395314), ('entertainment', 0.0006821027635973353), ('agent', 0.0006814007228772886), ('explained', 0.0006814007228772886), ('hit', 0.0006810888689995415), ('reasons', 0.0006807222621508924), ('moved', 0.0006804749066777271), ('offensive', 0.0006802156781418498), ('threatening', 0.0006799437006615852), ('feel', 0.0006798736033728572), ('huge', 0.000679823000596379), ('running', 0.0006792911231160586), ('master', 0.0006792539027043922), ('cops', 0.0006791217906937525), ('why', 0.000678920708442264), ('gore', 0.0006787074393876551), ('failure', 0.0006781978992679947), ('soundtrack', 0.0006781978992679947), ('besides', 0.0006781089319455142), ('either', 0.0006780236523876963), ('aforementioned', 0.0006779823246019844), ('feels', 0.000677834616034533), ('me', 0.0006772941322099265), ('definitely', 0.0006772162428792703), ('capable', 0.0006770439407617049), ('intelligent', 0.0006762483544623375), ('rated', 0.0006761248387811572), ('flicks', 0.0006759144046576647), ('girls', 0.0006755789965569017), ('care', 0.0006753585539959397), ('anyway', 0.0006753514107461222), ('well', 0.0006750278432664558), ('relief', 0.0006745639263266804), ('done', 0.0006741033421380078), ('asking', 0.0006739941932807963), ('evil', 0.0006738733408165179), ('jump', 0.0006732428062202827), ('supporting', 0.0006732428062202827), ('gets', 0.0006727355113724907), ('feet', 0.0006727296638374928), ('sure', 0.0006725072227117109), ('although', 0.0006724942545826389), ('credit', 0.0006722064102747464), ('weird', 0.0006719203649937786), ('happening', 0.0006718035295973268), ('necessary', 0.0006717400320992552), ('right', 0.0006715253500819656), ('1996', 0.0006704431174468618), ('hurt', 0.0006704431174468618), ('basically', 0.0006703084795005513), ('dies', 0.0006700060619596082), ('roles', 0.0006697805845704244), ('interesting', 0.0006689558957182921), ('star', 0.0006687483070094306), ('usually', 0.0006687411040075134), ('whom', 0.0006685774776057497), ('try', 0.0006682335591501913), ('though', 0.0006680374524563834), ('haunting', 0.0006679343054291209), ('major', 0.0006678430076837096), ('role', 0.0006678107603149175), ('path', 0.0006675134798838656), ('regular', 0.0006675134798838656), ('sign', 0.0006674389889252802), ('loved', 0.0006670457995356336), ('don', 0.0006670226685137735), ('thing', 0.0006670088013375039), ('expecting', 0.0006667751707626963), ('make', 0.0006663531085448293), ('knows', 0.0006663202799443585), ('isn', 0.0006661965036826523), ('amount', 0.0006661387831026985), ('relatively', 0.0006661387831026985), ('bruce', 0.0006656732773143668), ('ideas', 0.0006653532420848887), ('he', 0.0006653376447000585), ('want', 0.0006651063577650056), ('reminded', 0.0006650552782505471), ('subplots', 0.0006650552782505471), ('grow', 0.0006648449508380707), ('rise', 0.0006648449508380707), ('sometimes', 0.0006648229310006633), ('special', 0.0006647811930510132), ('individual', 0.0006646885535313573), ('forever', 0.0006645170210014137), ('scenes', 0.0006644715123710205), ('action', 0.0006642620639475215), ('aspect', 0.0006642051436742437), ('kind', 0.0006640702385978397), ('getting', 0.0006640336879613757), ('just', 0.0006639106047940057), ('believable', 0.0006636250518457072), ('boring', 0.0006636250518457072), ('cliche', 0.0006636250518457072), ('funny', 0.0006636250518457072), ('irritating', 0.0006636250518457072), ('weight', 0.0006636250518457072), ('went', 0.0006636250518457072), ('also', 0.0006635828794351934), ('effects', 0.0006633694181585554), ('jack', 0.0006633457483727081), ('bit', 0.0006630408748634487), ('need', 0.00066283752211646), ('but', 0.0006625489862068561), ('disappointment', 0.0006623869454056965), ('hardly', 0.0006622308815687205), ('tight', 0.0006621697337495544), ('likes', 0.000661949231007713), ('budget', 0.0006618118686439429), ('frightening', 0.0006616499772866426), ('heard', 0.0006615893921774687), ('black', 0.0006614045091292062), ('serves', 0.0006612206132520633), ('typical', 0.0006606624400071102), ('myself', 0.0006603810746369642), ('again', 0.0006602878569010809), ('superb', 0.0006600571752228807), ('we', 0.0006600378894032979), ('musical', 0.0006598544549602202), ('nobody', 0.0006598544549602202), ('afraid', 0.0006595453896417377), ('richard', 0.0006595031571137463), ('system', 0.0006593710451031064), ('him', 0.000658615728676154), ('longer', 0.0006585592117552819), ('terrible', 0.0006584042253888791), ('decides', 0.0006583581863548682), ('knowing', 0.0006583581863548682), ('does', 0.0006581230584311701), ('makes', 0.0006581059926947726), ('wars', 0.0006580639480592907), ('sounds', 0.0006580116820462604), ('nothing', 0.0006577440462556566), ('built', 0.0006576998281685133), ('reading', 0.0006575553105178501), ('confusing', 0.0006574476909907604), ('wasted', 0.0006572981180887036), ('grown', 0.0006572440417318061), ('drawn', 0.0006571006482461006), ('fly', 0.000656712290888981), ('responsible', 0.000656712290888981), ('played', 0.0006564938091899696), ('was', 0.0006564883957972653), ('survive', 0.0006562747743727326), ('childhood', 0.0006560838580747332), ('gave', 0.0006559768907872017), ('too', 0.0006556821711522463), ('basic', 0.0006555973294443478), ('calls', 0.0006553297386976359), ('surprising', 0.0006553297386976359), ('some', 0.0006548712038000038), ('brief', 0.0006547372163299164), ('became', 0.0006544925970037938), ('beat', 0.0006542782201295704), ('started', 0.0006538658599067997), ('anyone', 0.0006534515545886386), ('jerry', 0.0006531742636276646), ('however', 0.0006529728297040989), ('heroes', 0.0006529479161105658), ('like', 0.000652946803213366), ('admit', 0.0006528050781743098), ('shoot', 0.0006528050781743098), ('case', 0.0006527621417708519), ('then', 0.0006527316279785068), ('depth', 0.0006526033071035145), ('script', 0.0006525362013649949), ('movies', 0.0006524133101944996), ('times', 0.0006520875564461009), ('buy', 0.0006517746044913196), ('provide', 0.0006517746044913196), ('performances', 0.0006514996521165078), ('tough', 0.0006513116963915387), ('thrown', 0.0006512548480284078), ('hill', 0.0006512208452691519), ('beginning', 0.000651103824452392), ('loving', 0.0006510856249939714), ('ups', 0.0006510245761777506), ('see', 0.0006509615377774679), ('course', 0.000650951656758376), ('problem', 0.0006504279627465027), ('best', 0.0006503077449163204), ('room', 0.00065000588100559), ('filmmakers', 0.0006499420610860018), ('places', 0.0006493805747227564), ('never', 0.0006493165422671562), ('supposedly', 0.0006487360282466047), ('kevin', 0.0006486366355657029), ('especially', 0.0006485261265981212), ('even', 0.000648425062841444), ('occasionally', 0.0006482891787988526), ('company', 0.0006482129946307112), ('money', 0.0006480713396930734), ('fair', 0.0006478244553731903), ('science', 0.0006477404096472726), ('not', 0.0006476204670378808), ('next', 0.0006475053385230357), ('know', 0.0006468572043976747), ('seems', 0.000646841698041768), ('memories', 0.0006467532284936977), ('unbelievable', 0.0006467532284936977), ('sick', 0.0006463880375120525), ('actors', 0.0006462354435466341), ('supposed', 0.0006459044138513752), ('idea', 0.0006457879795324968), ('likable', 0.0006457743779827689), ('extremely', 0.0006455626764426486), ('ve', 0.0006454666510550725), ('plays', 0.0006453135893114008), ('creature', 0.0006451910226277709), ('held', 0.0006451910226277709), ('mike', 0.0006451910226277709), ('seconds', 0.0006451910226277709), ('time', 0.0006449425340547377), ('entertaining', 0.0006446039516335691), ('my', 0.0006441661752358124), ('help', 0.0006441611885932493), ('awful', 0.0006441436346040245), ('could', 0.0006440929900534719), ('considering', 0.0006440669964559454), ('dr', 0.0006440243119177749), ('should', 0.000643552677141085), ('slowly', 0.0006433766496732495), ('fans', 0.0006433099992381855), ('pull', 0.0006432382652953624), ('mistake', 0.0006431873237997343), ('moral', 0.0006431197833898005), ('occur', 0.0006428867689755288), ('characterization', 0.0006425946804843995), ('entirely', 0.0006425946804843995), ('fire', 0.0006425946804843995), ('bond', 0.0006422177921087489), ('nomination', 0.0006422177921087489), ('doesn', 0.0006421234962308942), ('series', 0.000641827148682892), ('today', 0.000641765780712276), ('albeit', 0.0006417129039074055), ('present', 0.0006415906262961427), ('ahead', 0.0006415042167841836), ('speed', 0.0006414399120310978), ('anywhere', 0.0006410014705327854), ('efforts', 0.0006410014705327854), ('mad', 0.0006410014705327854), ('possible', 0.0006410014705327854), ('realize', 0.0006410014705327854), ('selling', 0.0006410014705327854), ('it', 0.0006405730734197016), ('flashbacks', 0.000640446970990802), ('holes', 0.000640446970990802), ('predictable', 0.0006403799435736392), ('flaw', 0.0006403399623072613), ('generally', 0.0006402693157977394), ('used', 0.0006399878692194824), ('animals', 0.000639924157136932), ('got', 0.0006397980885480555), ('things', 0.0006396737955731069), ('non', 0.0006396386041886335), ('pieces', 0.0006394303884971658), ('everything', 0.000639365173771159), ('so', 0.0006390972320839649), ('hasn', 0.0006390776966116186), ('place', 0.0006386716708311836), ('appearance', 0.0006386074407642222), ('largely', 0.0006386074407642222), ('stuck', 0.0006384594951043672), ('wants', 0.0006382347851870402), ('revolves', 0.0006381010113901031), ('theme', 0.0006378593064615462), ('seemed', 0.0006378000203469945), ('exciting', 0.0006377021982579842), ('fake', 0.0006377021982579842), ('saved', 0.0006376248166054836), ('go', 0.0006376136925584035), ('frank', 0.0006375101771202974), ('helped', 0.0006375101771202974), ('oh', 0.0006375101771202974), ('decent', 0.0006373228394249932), ('difference', 0.0006373228394249932), ('happened', 0.0006373228394249932), ('trust', 0.0006373228394249932), ('directors', 0.0006372308736472984), ('work', 0.0006371939070111661), ('etc', 0.0006370800497718789), ('our', 0.0006369861926514015), ('strikes', 0.000636961545298335), ('seen', 0.0006367336520799815), ('little', 0.0006363792864759592), ('funniest', 0.0006363527894410891), ('damn', 0.0006362882244259267), ('couple', 0.0006362330341904833), ('this', 0.0006362222842527883), ('way', 0.0006359903405904665), ('began', 0.0006359740080188027), ('pulls', 0.0006359740080188027), ('making', 0.0006359280760523128), ('instead', 0.0006357293085159097), ('always', 0.0006355965193658757), ('problems', 0.0006355965193658757), ('or', 0.0006355875258306249), ('entire', 0.000635364058522621), ('turn', 0.0006352884449487142), ('personal', 0.0006352463489707263), ('later', 0.0006351777737724782), ('exact', 0.000635109912899212), ('attention', 0.0006350561310452954), ('happens', 0.0006350094367225153), ('ever', 0.0006349762899425742), ('common', 0.0006349105063331525), ('describe', 0.000634717142390307), ('straight', 0.0006345914558274575), ('minor', 0.000634529550505457), ('been', 0.0006344902934719933), ('face', 0.0006343474760289847), ('fight', 0.0006343474760289847), ('twist', 0.0006343474760289847), ('have', 0.0006342080874479353), ('move', 0.0006342056273089425), ('society', 0.0006341900697073896), ('followed', 0.0006339989334597381), ('combination', 0.0006338871367865835), ('nearly', 0.0006336322258054493), ('hot', 0.0006335790357188346), ('may', 0.0006335218734437214), ('if', 0.0006334845249732937), ('social', 0.0006334602767618115), ('strong', 0.0006329819174554437), ('add', 0.0006326933757003564), ('subtle', 0.0006325922256802605), ('talking', 0.0006325176275404396), ('patrick', 0.0006324149627737556), ('took', 0.0006322647216517789), ('eddie', 0.0006318913035611389), ('government', 0.0006318913035611389), ('put', 0.0006318847691429929), ('before', 0.0006317650285653122), ('learned', 0.000631720001276202), ('together', 0.000631683328804283), ('cross', 0.0006314900549657912), ('deserves', 0.0006314900549657912), ('give', 0.0006313901451384067), ('character', 0.000631182985573547), ('ability', 0.0006310363216211411), ('player', 0.0006309111408392287), ('poor', 0.0006306710681067937), ('formula', 0.0006306130913584845), ('needs', 0.0006305939406678666), ('interested', 0.0006305786823940408), ('do', 0.0006304871168155528), ('game', 0.0006304437992534219), ('suspense', 0.0006303616674400746), ('short', 0.0006301762085067098), ('wild', 0.0006300908072045678), ('follow', 0.0006299954039481207), ('second', 0.0006299059485256166), ('all', 0.0006294991237565685), ('ago', 0.00062944335947677), ('say', 0.0006293887843642655), ('because', 0.0006290448272023219), ('powerful', 0.0006289852826559587), ('seeing', 0.0006288777224349069), ('audiences', 0.0006284328142478287), ('worker', 0.0006284328142478287), ('days', 0.0006283205941024273), ('were', 0.0006281126594862564), ('shot', 0.0006281077627921833), ('charming', 0.0006280737097825443), ('oliver', 0.0006280737097825443), ('film', 0.0006279666236763682), ('singing', 0.0006279091202359556), ('leaves', 0.0006278772935280517), ('films', 0.0006278191103276649), ('quite', 0.0006278043814335809), ('laughable', 0.0006277534274216149), ('battle', 0.0006275584729410491), ('powers', 0.0006275584729410491), ('details', 0.0006273766246440509), ('hell', 0.000627333056822895), ('taking', 0.0006272902091310146), ('mark', 0.0006271456627005742), ('perfectly', 0.0006271456627005742), ('robert', 0.000627137076486493), ('made', 0.0006271226129930316), ('generated', 0.000627086172503012), ('big', 0.0006268262942715562), ('starring', 0.0006266568084684328), ('suppose', 0.0006266568084684328), ('dramatic', 0.0006264244207177584), ('what', 0.0006260189663520424), ('dozen', 0.0006259190829908375), ('touches', 0.0006259190829908375), ('wrong', 0.0006259190829908375), ('seriously', 0.0006257867813457326), ('thoughts', 0.0006257867813457326), ('seem', 0.0006257614273719321), ('back', 0.0006256700813097204), ('loose', 0.0006256634493036858), ('sam', 0.0006256241759718609), ('violence', 0.0006255706449948188), ('any', 0.0006253106130995327), ('gotten', 0.0006251540343474053), ('record', 0.0006251540343474053), ('robin', 0.0006250970571295465), ('surprises', 0.0006250693710166432), ('completely', 0.0006249764337694657), ('join', 0.0006246470744029624), ('results', 0.0006245882840900773), ('people', 0.0006245715157190483), ('bunch', 0.0006245321967800837), ('industry', 0.0006245321967800837), ('cliches', 0.0006244274182888866), ('amazing', 0.0006244026472868915), ('point', 0.0006242677266906242), ('ass', 0.0006242017814390315), ('disturbing', 0.000624161911626727), ('which', 0.0006240510808640825), ('sense', 0.0006240167998774386), ('monster', 0.000623891198951584), ('write', 0.000623891198951584), ('ship', 0.00062371956814097), ('hold', 0.0006237077554940856), ('order', 0.0006236089285609969), ('movie', 0.0006234062228935263), ('unlike', 0.0006233902994508702), ('re', 0.0006232476883776215), ('save', 0.0006229081301665292), ('heart', 0.0006228812876201978), ('killer', 0.0006228611418740851), ('between', 0.0006227931995624544), ('take', 0.0006223776384022585), ('asks', 0.0006221484861053505), ('edge', 0.0006221484861053505), ('finally', 0.0006221484861053505), ('lacking', 0.0006221484861053505), ('quiet', 0.0006221484861053505), ('shooting', 0.0006221484861053505), ('stunning', 0.0006221484861053505), ('tommy', 0.0006221484861053505), ('tradition', 0.0006221484861053505), ('going', 0.0006216814076623284), ('they', 0.000621589734442527), ('cast', 0.0006213394503626908), ('sound', 0.000621302025580037), ('mission', 0.0006211748578015862), ('there', 0.0006210483119477813), ('doubt', 0.0006209634413699117), ('kids', 0.0006208839566620469), ('brought', 0.0006208275763683964), ('inside', 0.0006208275763683964), ('six', 0.0006207377185631616), ('small', 0.0006206982565340094), ('thought', 0.0006206420733060639), ('race', 0.0006205155504462813), ('can', 0.000620277579253773), ('one', 0.0006202348373100844), ('explain', 0.000620135060583974), ('using', 0.0006200746578183327), ('many', 0.0006198587703310405), ('humanity', 0.0006197086881206236), ('much', 0.0006196181929293892), ('fan', 0.0006195233870078595), ('accept', 0.00061938338172266), ('trying', 0.0006192172800459613), ('1995', 0.0006191429378632956), ('lee', 0.00061902994732788), ('car', 0.0006189182239760392), ('claims', 0.0006188566951735761), ('out', 0.0006185562072468923), ('effectively', 0.0006185101908649683), ('frankly', 0.0006183778892198636), ('hard', 0.000618262266146714), ('told', 0.0006182356025449395), ('born', 0.0006181603547841623), ('fully', 0.0006180821561308057), ('air', 0.0006180282974556462), ('still', 0.0006179889451285238), ('rob', 0.0006177360854946742), ('against', 0.000617664533052339), ('silent', 0.0006176401637422683), ('failed', 0.0006175399788008664), ('plot', 0.0006173511305174044), ('important', 0.0006173256296239136), ('none', 0.0006170067630796864), ('broken', 0.0006169639153878059), ('shock', 0.0006169639153878059), ('south', 0.0006169639153878059), ('books', 0.0006168309776770996), ('spend', 0.0006166182773399696), ('means', 0.0006164822885998373), ('girlfriend', 0.0006164407018291546), ('same', 0.0006164275804859909), ('suspects', 0.0006163878519747455), ('five', 0.000616306716282765), ('being', 0.0006162586897745075), ('weren', 0.0006162232624281567), ('obsessed', 0.0006160489911435334), ('whatever', 0.0006160489911435334), ('van', 0.0006160232548778717), ('college', 0.0006159579539052972), ('recently', 0.0006159579539052972), ('logic', 0.0006158641579628722), ('them', 0.0006158507656812686), ('marry', 0.000615458717437551), ('speech', 0.000615458717437551), ('far', 0.00061529015633726), ('would', 0.0006151668925359685), ('shows', 0.0006150671212228505), ('those', 0.0006149973540811511), ('here', 0.0006148167699391258), ('must', 0.0006147659258603031), ('long', 0.0006147065185682051), ('exist', 0.0006146527212125149), ('something', 0.0006145255546073584), ('land', 0.000614467640597877), ('no', 0.0006144303549400738), ('telling', 0.0006143521391616743), ('she', 0.0006141595264514455), ('winner', 0.0006140686356364499), ('almost', 0.0006140554976682077), ('throughout', 0.0006139081088059418), ('liners', 0.0006138531729572792), ('chance', 0.0006137787755299422), ('standing', 0.0006136259041039073), ('that', 0.0006135531164153746), ('fascinating', 0.0006135075349094428), ('ex', 0.0006133504267058808), ('quickly', 0.000613202560161352), ('minutes', 0.000613131841379186), ('obviously', 0.0006130527480043951), ('mess', 0.0006130184244643914), ('cute', 0.0006128626878052707), ('plenty', 0.0006128626878052707), ('comedies', 0.000612721993891633), ('enough', 0.000612576970934499), ('drama', 0.0006122731133100275), ('notice', 0.0006122731133100275), ('terms', 0.0006122731133100275), ('decide', 0.0006121138331036512), ('destroy', 0.0006117793446702613), ('50', 0.0006116035965103445), ('style', 0.0006116035965103445), ('succeeds', 0.0006115134692488488), ('theater', 0.0006115134692488488), ('has', 0.0006113816301969087), ('talented', 0.0006113753521468163), ('superior', 0.0006112336003842039), ('off', 0.000610998871659018), ('introduced', 0.0006108366954488896), ('certain', 0.0006106851136645484), ('remarkable', 0.0006106272178441403), ('taste', 0.0006106272178441403), ('john', 0.0006101940874583805), ('end', 0.0006100413906444177), ('smith', 0.0006098461149111769), ('read', 0.0006098176152095687), ('other', 0.0006097992006179753), ('sweet', 0.0006096555446172912), ('use', 0.0006096429888972028), ('visually', 0.0006093470769262281), ('ed', 0.000609187059311489), ('fox', 0.0006090702897007335), ('play', 0.0006088723226785255), ('credits', 0.000608867812346123), ('tried', 0.0006085813851622432), ('part', 0.0006082067833354827), ('office', 0.0006081900264811919), ('main', 0.0006081150616067336), ('despite', 0.0006080087477847743), ('where', 0.0006079708396391428), ('meeting', 0.000607944182769612), ('ways', 0.0006078840587343283), ('involved', 0.0006076402223690117), ('figures', 0.0006075440615488869), ('door', 0.0006073354269123659), ('halfway', 0.0006073354269123659), ('screenwriter', 0.0006073354269123659), ('willing', 0.0006073354269123659), ('opening', 0.0006072386095320195), ('married', 0.0006071569563196793), ('truth', 0.0006070411277231013), ('humor', 0.0006069741327857078), ('highly', 0.0006068997487008076), ('effort', 0.0006068383443891115), ('comic', 0.0006066880695697419), ('led', 0.0006064640704892491), ('friend', 0.0006064299831964133), ('worse', 0.0006060657361243958), ('than', 0.0006058864534585655), ('driving', 0.0006057761575236307), ('final', 0.0006057761575236307), ('paced', 0.0006057761575236307), ('yet', 0.0006057761575236307), ('points', 0.0006056087513009137), ('editing', 0.0006055578598092078), ('disaster', 0.0006054973100782), ('works', 0.0006053961127561814), ('hope', 0.0006052028074954771), ('conclusion', 0.0006051498935888108), ('manage', 0.0006050699002122624), ('pg', 0.0006050699002122624), ('comes', 0.0006048901606608638), ('generation', 0.0006048665837135352), ('past', 0.0006048665837135352), ('adaptation', 0.0006046583680220675), ('score', 0.0006045405100835009), ('students', 0.000604372815073769), ('value', 0.000604372815073769), ('you', 0.000604281417718327), ('only', 0.000604277821507802), ('watch', 0.0006039208079607493), ('how', 0.0006036931915274067), ('talk', 0.0006036321621141198), ('woody', 0.000603555542842432), ('about', 0.0006033704212867672), ('owner', 0.0006032955016779157), ('is', 0.0006032580875393416), ('are', 0.0006032575189779611), ('ll', 0.0006030994567791901), ('came', 0.0006030916856300515), ('field', 0.0006028570601796031), ('90', 0.0006025840683032954), ('enjoyed', 0.0006025840683032954), ('multiple', 0.0006025840683032954), ('everyone', 0.000602438547840305), ('obvious', 0.0006022624614353165), ('wonderful', 0.0006020791801019521), ('look', 0.000602031109907932), ('wait', 0.0006019862666482326), ('likely', 0.0006019160150124935), ('jeff', 0.0006018168362326266), ('dialogue', 0.0006017698266375452), ('didn', 0.0006017325599637242), ('now', 0.0006016610695602147), ('island', 0.0006015685107379979), ('over', 0.0006014962542014384), ('1998', 0.0006014102032351721), ('agree', 0.0006014102032351721), ('date', 0.0006014102032351721), ('opera', 0.0006014102032351721), ('remake', 0.0006011771888209005), ('be', 0.0006011213317860572), ('chief', 0.0006011096484109667), ('games', 0.0006011096484109667), ('producer', 0.0006010587069153386), ('while', 0.000600992473040906), ('due', 0.0006009869729725155), ('phone', 0.0006009869729725155), ('building', 0.0006006950900327521), ('given', 0.0006006665994669187), ('falls', 0.0006004557215967957), ('mary', 0.0006003950425352334), ('figure', 0.0006003187146630575), ('travel', 0.0006003187146630575), ('naked', 0.0006001903042428087), ('am', 0.0006000416871066832), ('addition', 0.0006000008053702085), ('unnecessary', 0.0005998149507066969), ('super', 0.0005997287208402928), ('hero', 0.0005996418225253119), ('forces', 0.0005995622374348592), ('onto', 0.0005994932191043153), ('stands', 0.0005994932191043153), ('choice', 0.0005994659892160929), ('imagine', 0.0005994659892160929), ('worth', 0.0005993948842101705), ('enjoy', 0.0005993089675258589), ('without', 0.000599238187956086), ('position', 0.00059910594958293), ('for', 0.0005988931282437008), ('released', 0.0005988905987743094), ('several', 0.0005988859731006158), ('law', 0.0005988595053420486), ('technology', 0.000598522594227932), ('otherwise', 0.0005982196981782216), ('early', 0.0005981933738790719), ('who', 0.0005981748562238021), ('version', 0.000598001170434595), ('immediately', 0.0005979750275450198), ('studio', 0.0005979750275450198), ('yourself', 0.0005979538227568091), ('pacing', 0.0005977505062580819), ('witness', 0.0005977505062580819), ('most', 0.0005976870249575253), ('audience', 0.0005976437317292098), ('happen', 0.0005976396063496852), ('similar', 0.0005976193343234191), ('often', 0.000597486744313787), ('stop', 0.0005973863573051375), ('surprisingly', 0.0005971756847433556), ('mental', 0.0005970111735354374), ('every', 0.0005969647212682807), ('creating', 0.0005967546703459484), ('girl', 0.0005964550383015896), ('background', 0.000596414850427027), ('million', 0.0005963657560505341), ('1997', 0.0005962256325176275), ('around', 0.0005961969250385714), ('leave', 0.0005961050611055916), ('unfortunately', 0.0005960899107710949), ('growing', 0.000595860521903716), ('members', 0.0005957036958682102), ('brings', 0.0005954556467674971), ('faces', 0.0005954556467674971), ('apparently', 0.0005953574029716272), ('let', 0.0005953107082733549), ('student', 0.0005950985519268569), ('veteran', 0.0005950985519268569), ('more', 0.0005950716124445559), ('forget', 0.0005949507380788871), ('whole', 0.0005949163974879445), ('career', 0.0005948830146027255), ('catherine', 0.0005948612718024842), ('watching', 0.0005948059224834115), ('anything', 0.0005946316706465377), ('starts', 0.0005945849455816957), ('screen', 0.0005942032822377343), ('up', 0.0005941929153640528), ('model', 0.0005941237795240284), ('and', 0.0005940629658477276), ('capture', 0.0005935439580085528), ('solid', 0.0005935439580085528), ('positive', 0.0005934339405927958), ('lots', 0.000593387363876636), ('buddy', 0.0005932723960329503), ('era', 0.0005932113472167296), ('storyline', 0.0005930761269415491), ('thriller', 0.0005929729550712593), ('old', 0.0005925904737386595), ('as', 0.0005925848590923172), ('cause', 0.0005925223677193814), ('handle', 0.0005925223677193814), ('heroine', 0.0005925223677193814), ('mouth', 0.0005925223677193814), ('provided', 0.0005925223677193814), ('easy', 0.0005922554657519402), ('sets', 0.0005920306479121454), ('twenty', 0.0005920159383452623), ('ben', 0.0005919268678523268), ('us', 0.0005918044934621259), ('haven', 0.0005917675621554076), ('stone', 0.0005917675621554076), ('taken', 0.0005917323378957556), ('fill', 0.0005916510112962647), ('least', 0.0005914705528654417), ('begins', 0.0005914642920627396), ('friendly', 0.0005914251040754566), ...]
##example part of speech (POS) tagging (note that you need to tokenize the sentence first)
pos_tag(tokenizer.tokenize("This was a great day but the time is running out fast"))
[('This', 'DT'), ('was', 'VBD'), ('a', 'DT'), ('great', 'JJ'), ('day', 'NN'), ('but', 'CC'), ('the', 'DT'), ('time', 'NN'), ('is', 'VBZ'), ('running', 'VBG'), ('out', 'RP'), ('fast', 'RB')]
## POS tagging all reviews
## POS tagging is relatively slow, so this will take a while
reviews_pos_tagged=[pos_tag(tokenizer.tokenize(m)) for m in data.data]
## Reconstructing adjective-and-adverb-only reviews
reviews_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
for m in reviews_pos_tagged]
print(data.data[1])
b"good films are hard to find these days . \ngreat films are beyond rare . \nproof of life , russell crowe's one-two punch of a deft kidnap and rescue thriller , is one of those rare gems . \na taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these days , genuine motivation in a story that rings true . \nconsider the strange coincidence of russell crowe's character in proof of life making the moves on a distraught wife played by meg ryan's character in the film -- all while the real russell crowe was hitching up with married woman meg ryan in the outside world . \ni haven't seen this much chemistry between actors since mcqueen and mcgraw teamed up in peckinpah's masterpiece , the getaway . \nbut enough with the gossip , let's get to the review . \nthe film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south america who is kidnapped during a mass ambush of civilians by anti-government soldiers . \nupon discovering his identity , the rebel soldiers decide to ransom him for $6 million . \nthe only problem is that the company peter bowman works for is being auctioned off , and no one will step forward with the money . \nwith no choice available to her , bowman's wife alice ( ryan ) hires terry thorne ( crowe ) , a highly skilled negotiator and rescue operative , to arrange the return of her husband . \nbut when things go wrong -- as they always do in these situations -- terry and his team ( which includes the most surprising casting choice of the year : david caruso ) take matters into their own hands . \nthe film is notable in that it takes this very simple story line and creates a complex and intelligent character-driven vehicle filled with well-written dialogue , shades of motivation , and convincing acting by all the actors . \nthe script is based on both a book ( the long march to freedom ) and a magazine article pertaining to kidnap/ransom situations , and the story has been sharply pieced together by tony gilroy , screenwriter of the devil's advocate and dolores claiborne . \nthe biggest surprise for me was not the chemistry between crowe and ryan , but that between crowe and david caruso . \ndug out from b-movie hell , caruso pulls off a gutsy performance as crowe's right hand gun while providing most of the film's humor . \nryan cries a lot and smokes too many cigarettes , david morse ends up getting everyone at the guerilla camp to hate him , and crowe provides another memorable acting turn as the stoic , gunslinger character of terry thorne . \nthe most memorable pieces of the film lie in its action scenes . \nthe bulk of those scenes , which bookend the movie , work extremely well as establishment and closure devices for all of the story's characters . \nthe scenes are skillfully crafted and executed with amazing accuracy and poise . \ndirector taylor hackford mixes both his old-school style of filmmaking with the dizziness of a lars von trier film . \nproof of life is a thinking man's action movie . \nit is a film about the choices men and women make in the face of love and war , and the sacrifices one makes for those choices -- the sacrifices that help you sleep at night . \n"
## It kind of works:
reviews_adj_adv_only[1]
"good hard great rare crowe's one-two rare taut strong subtle intelligent masterful together virtually unheard genuine true strange distraught real married outside much enough let's david american south anti-government only forward available bowman's ryan terry highly skilled wrong always most surprising own notable very simple complex intelligent character-driven well-written long sharply together tony biggest not gutsy right most film's ryan too many david memorable gunslinger terry most memorable extremely well skillfully amazing old-school trier man's"
## term doc matrix only for adj/adv
X = vec.fit_transform(reviews_adj_adv_only)
terms = vec.get_feature_names()
len(terms)
576
pmi_matrix=getcollocations_matrix(X)
pmi_matrix.shape # n_words by n_words
(576, 576)
getcollocations("good",pmi_matrix,terms)
[('good', 0.0012832614349917284), ('sean', 0.0009249332576569759), ('nicely', 0.0009142510507387667), ('fairly', 0.000867719006738669), ('pretty', 0.0008609133674701302), ('terrific', 0.000831187604463116), ('he', 0.0008287103623039518), ('sadly', 0.0008245212623420526), ('horrible', 0.0008203986560303424), ('technical', 0.000817968488099229), ('stupid', 0.0008165931732810714), ('forward', 0.0008158591569455379), ('lovely', 0.0008132714383389111), ('robin', 0.0008028131634819533), ('sad', 0.0008020759613116301), ('total', 0.0007996250034466595), ('cool', 0.0007961783439490446), ('totally', 0.0007939970334176773), ('naturally', 0.0007829087048832272), ('thankfully', 0.0007774887114619778), ('they', 0.0007756992159419563), ('bad', 0.0007720796800988853), ('nice', 0.0007720517274657402), ('average', 0.0007712639195805711), ('fun', 0.0007703436484226743), ('climactic', 0.0007673110589637576), ('badly', 0.0007654486534808358), ('dumb', 0.0007600492426269871), ('therefore', 0.0007561876508739784), ('mainly', 0.0007555888597477207), ('bigger', 0.0007541908292930254), ('twice', 0.0007477153143173636), ('really', 0.0007464406345154477), ('suspenseful', 0.0007449621931686967), ('anti', 0.0007446382965629712), ('guilty', 0.0007372021703231894), ('extra', 0.000736855251654802), ('gary', 0.0007360226468506723), ('smart', 0.000732211878708694), ('aren', 0.0007289455060155697), ('ve', 0.0007279344858962693), ('violent', 0.0007274069971383735), ('boring', 0.00072578337925946), ('forever', 0.0007236625698992255), ('co', 0.0007234410631438232), ('longer', 0.0007230160096402135), ('natural', 0.0007226849583537482), ('scary', 0.0007221226337134648), ('fantastic', 0.0007207509218907141), ('though', 0.0007206722287012948), ('nevertheless', 0.0007151637054419488), ('that', 0.000715019519211013), ('slightly', 0.000714275671039496), ('particular', 0.0007133757961783439), ('probably', 0.0007111342630959992), ('looking', 0.0007101544769016765), ('terribly', 0.0007101544769016765), ('intelligent', 0.0007096530262048105), ('witty', 0.0007094402154212624), ('able', 0.0007077140835102619), ('usual', 0.0007077140835102619), ('brilliant', 0.0007048832271762208), ('realistic', 0.0007032908704883227), ('overall', 0.000703118537513442), ('maybe', 0.0007005098084086604), ('impressive', 0.0006997398403157801), ('somewhere', 0.0006987980005684003), ('very', 0.0006972942200561076), ('general', 0.0006970316067780315), ('plain', 0.0006970316067780315), ('disappointing', 0.0006963906581740977), ('capable', 0.0006948465547191663), ('better', 0.0006924412263889158), ('fair', 0.0006912556164518837), ('weird', 0.0006907289455060155), ('sure', 0.0006904748942965504), ('past', 0.0006904264112413089), ('right', 0.0006895282417358496), ('great', 0.0006882164679575816), ('seemingly', 0.0006870005005782542), ('actually', 0.000685391918869078), ('national', 0.0006848845969454146), ('wonderfully', 0.0006844011489946297), ('loud', 0.0006830979414751223), ('necessary', 0.0006830979414751223), ('peter', 0.0006824385805277526), ('evil', 0.0006819790259280704), ('relatively', 0.0006819790259280704), ('biggest', 0.0006811154333917554), ('dull', 0.0006802721088435374), ('believable', 0.0006794055201698514), ('huge', 0.0006784004824181208), ('robert', 0.0006760084925690022), ('funny', 0.0006758333405353084), ('major', 0.000675087264745043), ('fake', 0.0006748559296329996), ('it', 0.000674409891345073), ('there', 0.0006743076615601185), ('isn', 0.0006732595820762097), ('black', 0.0006723283793347488), ('offensive', 0.0006723283793347488), ('definitely', 0.0006717286216368588), ('danny', 0.0006713720089516268), ('well', 0.0006712165858124412), ('sometimes', 0.0006709129511677283), ('hardly', 0.0006708415850416599), ('special', 0.0006705840136359559), ('awful', 0.0006682137625701541), ('also', 0.0006677363413883081), ('brief', 0.0006667411628859836), ('musical', 0.0006664698190407897), ('responsible', 0.0006664307619721632), ('basic', 0.0006652512384996461), ('either', 0.0006651259793698213), ('anyway', 0.0006640466187830329), ('again', 0.0006629841024745483), ('as', 0.0006629024930696557), ('just', 0.0006621746447228981), ('before', 0.0006605331446095777), ('usually', 0.0006603252990637598), ('occasionally', 0.0006601366661314207), ('basically', 0.000660085931918576), ('around', 0.0006583525129797142), ('interesting', 0.0006582175157064767), ('then', 0.000658144236310809), ('regular', 0.0006574892130675981), ('movie', 0.0006571630775452432), ('especially', 0.000655566729988453), ('unbelievable', 0.0006549354060959373), ('next', 0.0006545539933664034), ('terrible', 0.0006535061962626672), ('however', 0.0006534727007700416), ('supposedly', 0.0006532745386248571), ('too', 0.0006526945865931038), ('extremely', 0.0006523525785905076), ('typical', 0.0006521079769487413), ('best', 0.0006519679895476074), ('minor', 0.0006490255985362402), ('never', 0.0006488495909123945), ('social', 0.0006485234510712218), ('ahead', 0.0006484191197566994), ('frank', 0.0006484191197566994), ('together', 0.0006481608436062671), ('earlier', 0.0006481171080567661), ('even', 0.0006475500993617314), ('not', 0.000647194430057688), ('professional', 0.000646173728422413), ('predictable', 0.0006440198159943383), ('personal', 0.000643647334897754), ('tough', 0.0006435774946921443), ('generally', 0.000643126584626801), ('entirely', 0.0006429233575550971), ('always', 0.0006416607690493041), ('entire', 0.0006411056991798842), ('alien', 0.0006396646524035059), ('likable', 0.0006396301969953506), ('surprising', 0.0006393830685506504), ('re', 0.0006393282282497197), ('quite', 0.0006392831469314743), ('second', 0.0006375118821969114), ('little', 0.0006375096023289368), ('quiet', 0.0006369426751592356), ('instead', 0.0006365668977697612), ('nearly', 0.0006362510978789325), ('possible', 0.0006361034884989469), ('poor', 0.0006359642685921708), ('don', 0.0006359458947599255), ('common', 0.0006352968284533978), ('john', 0.0006348405541191062), ('strong', 0.0006345571220687517), ('ever', 0.0006344707113487858), ('short', 0.0006341305662181353), ('straight', 0.0006339096148013345), ('we', 0.0006335726080949011), ('mean', 0.0006334621140927917), ('interested', 0.0006334041047416844), ('wild', 0.0006331171936267478), ('worse', 0.0006331171936267478), ('running', 0.0006329367463846492), ('so', 0.0006323964773073481), ('dramatic', 0.0006314423066345445), ('laughable', 0.0006312044528605038), ('funniest', 0.0006308765544434335), ('powerful', 0.0006299433051025408), ('back', 0.000629555760482866), ('largely', 0.000626832473966232), ('big', 0.0006259819025465529), ('completely', 0.0006258191508398261), ('wrong', 0.0006257682422617052), ('didn', 0.000625466230561772), ('tight', 0.0006249248888354765), ('finally', 0.0006248104337276312), ('ago', 0.0006243185861020256), ('decent', 0.0006241183259949558), ('more', 0.000621968372132822), ('perfectly', 0.0006215946588903384), ('fully', 0.0006202905790766413), ('hard', 0.0006202687831393604), ('much', 0.0006202111087208874), ('important', 0.00062015503875969), ('mental', 0.0006195398698270162), ('moral', 0.0006194579742725115), ('later', 0.0006191509803223855), ('many', 0.0006188591291767968), ('subtle', 0.0006185532540916463), ('here', 0.0006184390993909081), ('hot', 0.0006183186203300183), ('recently', 0.0006179294609753779), ('frankly', 0.0006176413819725922), ('small', 0.0006176413819725922), ('same', 0.0006169314493496352), ('remarkable', 0.0006160102867737209), ('certain', 0.0006154967938407428), ('still', 0.0006152448508223882), ('almost', 0.000614907621277048), ('enough', 0.0006135664210955028), ('visually', 0.0006133522057088936), ('obviously', 0.000612731403881253), ('other', 0.0006126687777268868), ('present', 0.000611591722914092), ('tony', 0.0006112076175770443), ('long', 0.0006096304083441553), ('quickly', 0.0006094667166229549), ('worth', 0.0006093904474805919), ('far', 0.0006088127147421085), ('positive', 0.0006075453209211171), ('unfunny', 0.000607317434454155), ('comic', 0.0006067717063359034), ('due', 0.0006066120715802245), ('spectacular', 0.0006066120715802245), ('seriously', 0.000605535245417656), ('apparently', 0.0006054510915389225), ('obvious', 0.0006047294823925617), ('superior', 0.0006046339887381151), ('only', 0.0006043969151390608), ('immediately', 0.000604379143709377), ('likely', 0.000604379143709377), ('slowly', 0.0006036040778368514), ('effectively', 0.0006034193764666443), ('incredible', 0.0006034193764666443), ('unfortunately', 0.0006029501089433885), ('future', 0.000602698445311965), ('main', 0.0006023551447621176), ('final', 0.0006023019331768913), ('wonderful', 0.0006022374652947902), ('exciting', 0.0006015569709837226), ('top', 0.0006010365929811415), ('away', 0.0006009182398614209), ('incredibly', 0.0006000947518029163), ('now', 0.0005997020752003287), ('highly', 0.0005994754589733983), ('most', 0.0005991814703546777), ('several', 0.0005989389355912623), ('man', 0.0005986565034283526), ('often', 0.0005983923958135547), ('old', 0.0005982720466243037), ('oh', 0.0005976252260753322), ('star', 0.0005976252260753322), ('mary', 0.0005968833874133718), ('absolutely', 0.0005958495993425108), ('early', 0.0005956009613700224), ('french', 0.00059447983014862), ('pure', 0.00059447983014862), ('emotional', 0.0005941229995182786), ('surprisingly', 0.00059359055590756), ('practically', 0.0005928774586387854), ('easy', 0.0005917276087127468), ('few', 0.0005915726229176027), ('willing', 0.0005914467697907188), ('yet', 0.000591312086826336), ('similar', 0.0005912837020295414), ('least', 0.0005911743392196498), ('ll', 0.0005911494109321011), ('solid', 0.0005910033399138328), ('whole', 0.0005908314711500944), ('along', 0.0005894513353447312), ('happy', 0.0005885547820076039), ('unnecessary', 0.0005879470847623714), ('soon', 0.0005870488322717622), ('about', 0.000586717804716572), ('international', 0.000586391669194217), ('non', 0.0005850436423684831), ('up', 0.0005847487615003539), ('third', 0.0005845194097140311), ('quick', 0.000583864118895966), ('sweet', 0.0005830216021298824), ('entertaining', 0.0005830033855511562), ('intriguing', 0.000582964482349131), ('effective', 0.0005827808830538584), ('simple', 0.0005823475887170154), ('utterly', 0.0005813365685977151), ('first', 0.000581192769237202), ('available', 0.0005810705106715834), ('middle', 0.0005807418508804796), ('double', 0.0005794409058740269), ('apart', 0.0005793995674345695), ('doesn', 0.0005786797017725769), ('last', 0.0005779384843487602), ('she', 0.0005770591757852905), ('what', 0.0005767341635770193), ('deep', 0.0005766355217394896), ('apparent', 0.0005766007375125712), ('pathetic', 0.0005766007375125712), ('honest', 0.0005762814680012133), ('else', 0.000576117518792678), ('single', 0.0005757673899744503), ('secret', 0.0005756074545883464), ('michael', 0.0005753841128657395), ('exactly', 0.0005752567854478683), ('free', 0.0005746851204444231), ('otherwise', 0.0005743176159709176), ('friendly', 0.0005740347566249902), ('rather', 0.0005732484076433122), ('light', 0.000571948524632783), ('constantly', 0.0005711551688047606), ('previous', 0.0005707371641211789), ('certainly', 0.0005704842058212914), ('such', 0.0005690939798374554), ('popular', 0.000569045232629571), ('talented', 0.0005689196710160164), ('already', 0.0005686201044674146), ('appropriate', 0.0005678171135140473), ('flat', 0.000567251746325019), ('known', 0.0005661712668082095), ('normal', 0.0005661712668082095), ('out', 0.0005661712668082095), ('real', 0.0005661712668082095), ('less', 0.0005645254201023716), ('enjoyable', 0.0005644129709485567), ('impossible', 0.0005626326963906581), ('originally', 0.0005621841452109685), ('clever', 0.0005617480537862704), ('convincing', 0.00056160536949524), ('virtually', 0.00056160536949524), ('perhaps', 0.0005612799383692617), ('excellent', 0.0005611161662117076), ('particularly', 0.0005609928710752076), ('aside', 0.0005608300284420943), ('classic', 0.0005605766890729504), ('nasty', 0.0005605095541401274), ('you', 0.0005597375024126616), ('thoroughly', 0.0005594311326795403), ('safe', 0.000559004541911903), ('truly', 0.000559004541911903), ('fascinating', 0.0005583077769914288), ('different', 0.0005582527875521505), ('fast', 0.0005578452187669123), ('once', 0.0005570083587423185), ('easily', 0.0005566899298042442), ('emotionally', 0.0005566075629769897), ('new', 0.0005565231119904089), ('critical', 0.0005564096932425507), ('down', 0.0005563655897475252), ('key', 0.0005559568367369274), ('original', 0.0005557015278771899), ('over', 0.0005551008789097249), ('suddenly', 0.0005548065151022053), ('painful', 0.0005541761128504085), ('military', 0.0005538631957906397), ('serial', 0.0005533037380171138), ('intense', 0.0005525285856803009), ('cute', 0.0005520169851380043), ('nowhere', 0.000551579223849235), ('young', 0.0005511013442752416), ('bright', 0.0005506171111266653), ('dangerous', 0.0005504442871746481), ('silly', 0.0005503408202033747), ('humorous', 0.0005500868558193399), ('necessarily', 0.000549928648498138), ('soft', 0.0005490885130683066), ('somewhat', 0.0005486772108113266), ('crazy', 0.0005482544545674433), ('essentially', 0.0005479076775563317), ('close', 0.0005473421765129823), ('half', 0.0005471174260983178), ('mysterious', 0.0005469790204757277), ('potential', 0.0005464211063381557), ('slow', 0.0005463207498317021), ('familiar', 0.0005452879004095461), ('worst', 0.0005451500564069146), ('mad', 0.0005449398443029017), ('screen', 0.0005445799896841676), ('indeed', 0.0005441534953212235), ('animated', 0.0005440552016985138), ('visual', 0.0005425807973578674), ('low', 0.0005424444362627787), ('serious', 0.0005416989106494434), ('giant', 0.0005416520387180901), ('rarely', 0.0005411931226843178), ('steve', 0.0005411931226843178), ('no', 0.0005411194408432445), ('literally', 0.0005408003845691623), ('favorite', 0.0005407377919320594), ('like', 0.0005404362092260182), ('memorable', 0.0005401736065976285), ('aware', 0.0005397819281010471), ('standard', 0.0005393928960807942), ('chris', 0.0005387077352093038), ('shallow', 0.0005387077352093038), ('true', 0.0005377934893765022), ('sci', 0.0005376344086021505), ('poorly', 0.0005375615485386457), ('computer', 0.0005374203821656051), ('older', 0.0005372849776853417), ('physical', 0.0005366183710132754), ('rare', 0.0005366183710132754), ('clear', 0.0005357231027502098), ('comedic', 0.0005349215540298343), ('simply', 0.0005347540528206044), ('can', 0.0005345322842512801), ('unique', 0.0005337073180233351), ('complex', 0.0005332543326914531), ('female', 0.0005330442246013461), ('oddly', 0.0005325848357263666), ('surely', 0.0005325848357263666), ('fi', 0.0005324705961648636), ('dead', 0.000531612760912124), ('genuinely', 0.0005307855626326964), ('successful', 0.000529644088304454), ('rich', 0.0005296190009565806), ('constant', 0.0005294068988336504), ('jean', 0.0005293313556117849), ('lucky', 0.0005292470537555002), ('psychological', 0.0005290452820994744), ('cold', 0.0005289231571497747), ('mostly', 0.0005281963647661954), ('merely', 0.0005281316348195329), ('clearly', 0.0005275686804349225), ('overly', 0.0005265392781316348), ('life', 0.0005263623496107572), ('lee', 0.0005263000508358004), ('amazing', 0.0005259159703149652), ('ready', 0.0005259159703149652), ('late', 0.0005252946775020133), ('dark', 0.0005251443634163102), ('private', 0.0005248513141063681), ('open', 0.0005247766694708168), ('eventually', 0.0005237084217975938), ('united', 0.0005233792524564262), ('attractive', 0.0005230179690331935), ('graphic', 0.0005230179690331935), ('time', 0.0005229220728159156), ('weak', 0.0005226196308998857), ('large', 0.0005216863815589931), ('strange', 0.0005216863815589931), ('various', 0.0005211349160393747), ('to', 0.0005209980274352141), ('billy', 0.0005185727974747759), ('heavily', 0.0005182964905707506), ('genuine', 0.0005179533841954225), ('own', 0.0005177625711331581), ('barely', 0.000517274657402046), ('hilarious', 0.0005159235668789809), ('lead', 0.0005143386860440776), ('ultimate', 0.0005140239132864007), ('high', 0.0005112873174747606), ('traditional', 0.0005108811040339703), ('difficult', 0.0005095541401273885), ('successfully', 0.000508161915700811), ('possibly', 0.0005075636942675159), ('grand', 0.000506484536873609), ('greatest', 0.0005064087442006763), ('thus', 0.0005060155697098372), ('further', 0.0005057372551826141), ('dimensional', 0.0005055100596501871), ('wide', 0.0005042462845010616), ('recent', 0.0005031142773769634), ('complete', 0.0005025188758652747), ('innocent', 0.0005022487044266374), ('year', 0.000499562882477832), ('thin', 0.0004989384288747346), ('alive', 0.0004978402518485979), ('initially', 0.0004965993738529634), ('english', 0.0004963966388564935), ('david', 0.0004960803527682509), ('perfect', 0.0004944686557157225), ('human', 0.0004931504004900357), ('beautiful', 0.0004920049200492005), ('eccentric', 0.0004909766454352442), ('political', 0.0004908043124603634), ('all', 0.0004905190716743539), ('married', 0.0004905190716743539), ('chinese', 0.0004899559039686428), ('sole', 0.0004887233104995393), ('sympathetic', 0.0004875363686404026), ('sexual', 0.00048720527433232763), ('empty', 0.0004862680638312445), ('off', 0.0004861263635698075), ('tim', 0.0004852896572641795), ('modern', 0.00048489829463735357), ('foreign', 0.00048319789150010983), ('narrative', 0.00048319789150010983), ('outstanding', 0.0004820106730934756), ('lame', 0.00048166809265773043), ('painfully', 0.00048124557678697803), ('hearted', 0.00048071145295036654), ('fresh', 0.00048038774153423834), ('numerous', 0.00047928359714952384), ('ex', 0.0004789413913988051), ('full', 0.000478736230452094), ('unable', 0.0004786720710287589), ('unusual', 0.0004786720710287589), ('public', 0.00047845459166890946), ('blue', 0.00047770700636942675), ('subject', 0.000476979902858971), ('worthy', 0.0004768904131961457), ('ultimately', 0.00047569136499234053), ('equally', 0.00047481181239143023), ('film', 0.0004737351416150324), ('cheap', 0.00047335630503637184), ('former', 0.00047282951741550467), ('occasional', 0.0004725703718923361), ('alone', 0.0004720200181983621), ('sudden', 0.00047095155375410154), ('one', 0.0004706472969156914), ('near', 0.00047035766780989714), ('william', 0.0004702874232358514), ('british', 0.000470124355474674), ('fine', 0.00046983604175244), ('accidentally', 0.00046932618169627895), ('famous', 0.0004684393219425067), ('frequently', 0.00046834020232296744), ('steven', 0.0004639458991900605), ('green', 0.0004615526631588664), ('deadly', 0.00046123435097737757), ('sharp', 0.00046051254448132537), ('meanwhile', 0.0004596493532076958), ('of', 0.0004579326422713459), ('ill', 0.00045780254777070064), ('initial', 0.0004566758803028482), ('local', 0.00045631714041258675), ('ugly', 0.0004549590536851683), ('actual', 0.00045453186208546393), ('be', 0.0004542536908112378), ('ridiculous', 0.0004519259933272672), ('desperately', 0.00045158898662083375), ('somehow', 0.000449853902587711), ('heavy', 0.00044979161751985535), ('self', 0.0004482189195564992), ('american', 0.00044597422497684366), ('limited', 0.00044422668626490285), ('cinematic', 0.00044378462078763795), ('inevitable', 0.0004428268122535639), ('current', 0.0004411724156947087), ('younger', 0.00043862719021954696), ('bottom', 0.0004376939408786542), ('directly', 0.0004361048947036208), ('tiny', 0.00043312101910828024), ('unfortunate', 0.00043295449814745426), ('white', 0.0004315217691013869), ('greater', 0.00043136858423482627), ('tom', 0.0004311611954924057), ('teen', 0.00042832087141142803), ('extraordinary', 0.0004246284501061571), ('fellow', 0.0004246284501061571), ('latest', 0.0004246284501061571), ('unlikely', 0.0004246284501061571), ('odd', 0.00042015867694714494), ('latter', 0.00041755130927105446), ('romantic', 0.00041599779055115387), ('unexpected', 0.0004119529739835853), ('detective', 0.00040998608975766895), ('live', 0.00040875448935452505), ('red', 0.00040864780951076413), ('on', 0.00040208180673768863), ('the', 0.00040030077848549186), ('creative', 0.00039927749786101345), ('ten', 0.0003963198867657466), ('two', 0.0003951403632932295), ('desperate', 0.00039465467715748723), ('central', 0.0003832012842421418), ('in', 0.00038137925611386335), ('previously', 0.0003809166978893468), ('and', 0.00037276543329929824), ('bizarre', 0.00034742327735958313), ('angry', 0.00033777263076626134)]
We can make this better by combining multiple seet terms
def seed_score(pos_seed,PMI_MATRIX=pmi_matrix,TERMS=terms):
score=defaultdict(int)
for seed in pos_seed:
c=dict(getcollocations(seed,PMI_MATRIX,TERMS))
for w in c:
score[w]+=c[w]
return score
sorted(seed_score(['good','great','perfect','cool']).items(),key=itemgetter(1),reverse=True)
[('cool', 0.01233842898097543), ('perfect', 0.006798784836900034), ('great', 0.004248631481147458), ('frank', 0.004199495490947327), ('eccentric', 0.004084710911853615), ('fake', 0.003911762949434756), ('looking', 0.0038393777520473217), ('lovely', 0.0038028477315611427), ('greatest', 0.003730147480106928), ('amazing', 0.0036985631973772324), ('twice', 0.003661242122664058), ('anti', 0.003660653152490723), ('generally', 0.0035927347963626934), ('known', 0.0035564386989262358), ('totally', 0.003546906203489673), ('plain', 0.0035118779583992172), ('earlier', 0.003485008708798782), ('stupid', 0.0033319627079071512), ('sad', 0.0032958074094856364), ('convincing', 0.0032845107876468384), ('overall', 0.003268378795897639), ('nicely', 0.0032643270804776636), ('good', 0.003262124902614077), ('pretty', 0.003249145627494061), ('climactic', 0.003243555286391972), ('man', 0.003190950013225829), ('necessarily', 0.0031860920461633867), ('past', 0.0031797773077773686), ('fun', 0.003153100558314221), ('friendly', 0.0031332047075606456), ('terribly', 0.0031123173387858664), ('intriguing', 0.0031087143840738247), ('necessary', 0.0030837671387598064), ('extra', 0.0030693991457936233), ('actually', 0.0030683303367765517), ('apart', 0.0030613602501711593), ('they', 0.0030539757813460434), ('black', 0.003053138089032602), ('definitely', 0.0030493148349400295), ('best', 0.003043422883124742), ('bigger', 0.00302455408969759), ('musical', 0.003003959069610916), ('quiet', 0.0029999963749380303), ('john', 0.0029968693911054983), ('steven', 0.0029867301907443955), ('pure', 0.0029859258177321285), ('painful', 0.0029827416960734117), ('classic', 0.0029772145231463792), ('perfectly', 0.0029687945292047385), ('basically', 0.0029613324950677937), ('somewhere', 0.0029537299642018867), ('he', 0.002948724291562637), ('horrible', 0.002945643760366221), ('brilliant', 0.0029418273019810766), ('technical', 0.0029277317228736336), ('really', 0.0029263971284278645), ('fully', 0.0029147021133513534), ('forward', 0.002905429173729839), ('mainly', 0.002901635335665364), ('visually', 0.002897170954974414), ('shallow', 0.002869477485777347), ('nasty', 0.002865858196500654), ('maybe', 0.0028622287923194237), ('isn', 0.0028566090192177563), ('all', 0.00285461092390361), ('regular', 0.002851129537218903), ('probably', 0.0028510038317324277), ('scary', 0.0028496778141188024), ('slightly', 0.002839870019252718), ('present', 0.002839191031164896), ('non', 0.002836054321034321), ('green', 0.002833792308872344), ('entire', 0.002819008284677596), ('aren', 0.0028185457174324195), ('interesting', 0.0028106487026091083), ('professional', 0.002801892043622532), ('especially', 0.0027872535559887897), ('excellent', 0.002783733520408009), ('sympathetic', 0.00278352694514409), ('same', 0.0027832670373233006), ('mary', 0.0027826630422002983), ('huge', 0.002781314871475908), ('nevertheless', 0.002771202856021355), ('constantly', 0.002761195678000438), ('weird', 0.0027601413120321343), ('anyway', 0.002757779963983885), ('forever', 0.0027568329027658684), ('tony', 0.002756290460877205), ('future', 0.002749234391960185), ('very', 0.0027481646751586373), ('sure', 0.0027459142198264248), ('though', 0.0027453667922239028), ('soft', 0.002742373141880715), ('nice', 0.002741853177318124), ('blue', 0.002739395424413957), ('wonderful', 0.0027377931144858384), ('light', 0.0027373283200756394), ('sean', 0.0027337378934548374), ('second', 0.0027335395520831232), ('entirely', 0.002733523367163213), ('realistic', 0.002731294773691186), ('memorable', 0.002730908137254546), ('badly', 0.0027307312781661053), ('still', 0.0027191290459714634), ('danny', 0.002718729213354672), ('third', 0.00271541531851348), ('also', 0.0027056370383890453), ('inevitable', 0.002703987742769333), ('famous', 0.0027032161393452437), ('before', 0.0026925347274034638), ('literally', 0.0026894086542412756), ('smart', 0.002683582489884059), ('deadly', 0.0026766545015301318), ('yet', 0.0026766197782430662), ('quick', 0.002675617905789821), ('dumb', 0.0026710891292985525), ('poor', 0.002667166995342299), ('out', 0.0026667066493894324), ('wonderfully', 0.002665905523637924), ('it', 0.0026623466980691805), ('exactly', 0.0026622488358886177), ('always', 0.002652590008331684), ('again', 0.0026501302605318553), ('suspenseful', 0.0026485166277748695), ('straight', 0.002646189752081472), ('that', 0.002642264415676568), ('oh', 0.0026408399178469), ('didn', 0.002634709735067517), ('incredibly', 0.0026306080205473936), ('else', 0.0026236977815681873), ('not', 0.0026197610436078512), ('never', 0.0026185162906170443), ('then', 0.002613898264607873), ('original', 0.0026121619253186364), ('lucky', 0.002611194469540666), ('just', 0.0026082660263522482), ('final', 0.002605105324093498), ('obviously', 0.0025999411138145803), ('mental', 0.002598422855348094), ('chinese', 0.002594664847599488), ('right', 0.0025941719891221897), ('next', 0.002590120626479452), ('sadly', 0.002589445111141814), ('longer', 0.002578784157910541), ('alien', 0.0025777906620490405), ('likable', 0.0025776081267851907), ('together', 0.002577143083714865), ('ve', 0.002576658664841161), ('wrong', 0.002574267820832323), ('similar', 0.002570303153596778), ('about', 0.0025682590836591324), ('over', 0.0025672992416833455), ('easily', 0.0025649029055357614), ('comic', 0.0025648804047939894), ('certain', 0.002563451398133774), ('funny', 0.002562830642541606), ('boring', 0.0025616238435000253), ('desperately', 0.0025568098595270487), ('usual', 0.002556158669214207), ('like', 0.0025558699946088698), ('witty', 0.002554642026201358), ('late', 0.0025543690617334434), ('already', 0.0025527149553354602), ('originally', 0.0025514339703727), ('cold', 0.0025499605916891304), ('french', 0.0025444561222301753), ('believable', 0.002543143621598955), ('later', 0.002542810478900551), ('desperate', 0.002542474373414215), ('completely', 0.0025416873280540166), ('bad', 0.002541534118352169), ('long', 0.0025397945691588344), ('first', 0.002535085273247567), ('co', 0.00253448031405291), ('evil', 0.0025334075067660966), ('close', 0.0025325295336182225), ('terrific', 0.002531797517920264), ('wild', 0.00253043821721538), ('utterly', 0.0025275094878181724), ('beautiful', 0.0025262843425938046), ('traditional', 0.002524371276082079), ('mean', 0.0025242814452110457), ('as', 0.002523913421332458), ('computer', 0.0025223013263525763), ('strong', 0.0025219961530189706), ('incredible', 0.002517568804243094), ('ever', 0.0025150040801195025), ('nearly', 0.002514305025450897), ('therefore', 0.002513451112442391), ('little', 0.002506072311279282), ('single', 0.002505291173635525), ('movie', 0.002503536342636675), ('whole', 0.0025007145988450545), ('older', 0.002498322493840491), ('almost', 0.0024977229944715953), ('slowly', 0.0024942831852932394), ('general', 0.00249251121682983), ('there', 0.0024915178366652618), ('only', 0.002490669377909969), ('well', 0.0024880154640546824), ('major', 0.0024870428132837117), ('emotionally', 0.0024865103239952464), ('tough', 0.002486159146802659), ('merely', 0.002481832205084195), ('too', 0.0024813276898331686), ('seemingly', 0.0024776359121710554), ('occasionally', 0.0024768922155345282), ('able', 0.002476601756585323), ('re', 0.002472412146757419), ('extremely', 0.002471343494995713), ('once', 0.002471033221713364), ('away', 0.002466459287948913), ('important', 0.0024656807279107595), ('ll', 0.0024621555414009303), ('so', 0.0024614535416055787), ('thankfully', 0.002460228506662616), ('key', 0.0024601953216293253), ('instead', 0.002459485644152606), ('effectively', 0.0024586780396513193), ('robert', 0.0024573908936458004), ('most', 0.0024567708406460402), ('intense', 0.0024540340246803705), ('solid', 0.002452225134191952), ('top', 0.002451460433879865), ('last', 0.0024507916818635083), ('special', 0.0024474442315980055), ('creative', 0.0024472815239837834), ('tim', 0.0024459044680487695), ('psychological', 0.00244476670365184), ('clear', 0.00244394486836231), ('disappointing', 0.0024420242374239755), ('sometimes', 0.0024397798656203814), ('emotional', 0.002437342965428747), ('hearted', 0.002435117648234929), ('and', 0.002434775894362365), ('minor', 0.0024326013810353738), ('responsible', 0.0024319820376394546), ('here', 0.0024311048832102297), ('other', 0.0024299883603013535), ('normal', 0.0024298569825335708), ('willing', 0.0024261246443667258), ('much', 0.0024261229493739082), ('hot', 0.0024196044249443906), ('remarkable', 0.0024195606507870326), ('barely', 0.0024175002418381566), ('effective', 0.0024169124922279123), ('animated', 0.0024109840734820127), ('outstanding', 0.0024107809176646656), ('half', 0.0024084639659253332), ('subtle', 0.0024083972202186494), ('different', 0.002407773631130719), ('fascinating', 0.002403685035026676), ('least', 0.002403331373138169), ('lame', 0.002402847501293744), ('seriously', 0.002401916898466411), ('total', 0.002401279300347136), ('violent', 0.002400535983510144), ('far', 0.0024005213011034504), ('doesn', 0.002400464865939389), ('absolutely', 0.002400431236130183), ('unfortunately', 0.002399753462836699), ('usually', 0.002399454857408294), ('aware', 0.0023960326566863843), ('previously', 0.0023958414023801155), ('moral', 0.002395235049660093), ('ago', 0.002393629653696946), ('awful', 0.002393420826940957), ('surely', 0.002391676453026381), ('biggest', 0.002390865343509711), ('greater', 0.0023903319537811043), ('truly', 0.002386967231988023), ('successful', 0.0023869373994698283), ('real', 0.002385190944733403), ('powerful', 0.0023847176086883894), ('back', 0.0023806187028304715), ('short', 0.0023774499236466013), ('eventually', 0.0023698523692254974), ('quickly', 0.0023688699328822584), ('secret', 0.0023676820230258654), ('intelligent', 0.002366708277842242), ('robin', 0.0023651976511104983), ('dull', 0.00235964477869644), ('many', 0.0023590119795897794), ('michael', 0.0023589064413142794), ('off', 0.0023580212371917464), ('natural', 0.002356857167653292), ('personal', 0.0023566081734952646), ('more', 0.00235539519766598), ('high', 0.0023548555530815605), ('hardly', 0.0023485136786730067), ('superior', 0.002347312160007487), ('big', 0.0023463307195843493), ('alive', 0.002345887108786913), ('now', 0.0023368253650724344), ('otherwise', 0.0023316118219038357), ('enough', 0.002331558244168371), ('such', 0.0023313159079823265), ('fairly', 0.0023300790747863434), ('even', 0.0023295884605102823), ('capable', 0.002329072615940134), ('tight', 0.0023289164961650087), ('mad', 0.0023252643113845844), ('several', 0.002324876098041507), ('ahead', 0.00232478891174848), ('private', 0.002323938810211654), ('indeed', 0.0023233119218103422), ('no', 0.002321993798982203), ('somewhat', 0.0023197875411107224), ('serial', 0.002316222892701973), ('simple', 0.0023154877313890524), ('we', 0.0023152809960851227), ('immediately', 0.002314386462682709), ('old', 0.002313901722294), ('new', 0.0023136153967201574), ('occasional', 0.002313367300043446), ('honest', 0.002310746954588941), ('suddenly', 0.0023091385180307403), ('visual', 0.002304223240593406), ('main', 0.0023038137423872776), ('apparent', 0.002302475481884322), ('entertaining', 0.0023023063162522428), ('potential', 0.002300250289886611), ('soon', 0.0022988242749470353), ('silly', 0.002298284087838806), ('gary', 0.0022899832288248365), ('she', 0.0022875778111765914), ('don', 0.0022863504541293096), ('relatively', 0.002285570771709969), ('pathetic', 0.002280629178427796), ('bright', 0.002280352499605687), ('attractive', 0.002280010685796501), ('initially', 0.002279932158451899), ('up', 0.0022772831624123012), ('unexpected', 0.0022765307261015653), ('difficult', 0.0022753479475171212), ('hilarious', 0.0022747276808401406), ('surprisingly', 0.002269646495223617), ('average', 0.0022684700842012396), ('small', 0.002266828621142853), ('typical', 0.0022574522552556682), ('however', 0.00225704169246753), ('better', 0.0022567705132825926), ('what', 0.002254185440970014), ('brief', 0.002253217282469064), ('around', 0.0022473890284868177), ('guilty', 0.002245444106373558), ('impossible', 0.002245287470222215), ('obvious', 0.0022451600892185787), ('impressive', 0.0022442582461890724), ('perhaps', 0.002244145584435339), ('serious', 0.002243385622669684), ('decent', 0.0022328653519831255), ('certainly', 0.0022321207870243474), ('innocent', 0.00223192076106074), ('international', 0.0022311108082540463), ('ultimate', 0.002227728559313895), ('fair', 0.0022262788637288523), ('human', 0.002224433969022101), ('peter', 0.002223954022722312), ('less', 0.0022137925950437287), ('deep', 0.00221186169544016), ('happy', 0.0022112088452389167), ('possible', 0.0022095882316891602), ('english', 0.002206408349555691), ('rare', 0.0022058513645247073), ('tom', 0.0022030707241931058), ('common', 0.0022012160069994173), ('genuinely', 0.002198948512828223), ('basic', 0.0021977498874032183), ('standard', 0.0021941975079702502), ('actual', 0.002193307780211677), ('national', 0.0021906365425489404), ('star', 0.002187352805003062), ('numerous', 0.002184433922212774), ('sharp', 0.002180467631924875), ('true', 0.002179804625173467), ('lee', 0.002177677031010695), ('spectacular', 0.002177609291983217), ('worse', 0.0021744261458589323), ('quite', 0.0021718602677202976), ('lead', 0.0021657650352752524), ('favorite', 0.00216567820348113), ('finally', 0.002161142727699993), ('empty', 0.002158321847172612), ('strange', 0.0021572049950062994), ('clever', 0.0021557059229457523), ('few', 0.0021545920236595304), ('hard', 0.0021528800039478093), ('red', 0.002151662251742031), ('own', 0.002147796474615853), ('of', 0.0021453161208263016), ('frankly', 0.0021452378161904186), ('fast', 0.0021423510935287874), ('complete', 0.002138681263627506), ('particular', 0.002135804166824032), ('full', 0.002134401343947513), ('grand', 0.00213226141731942), ('dead', 0.0021316555748768394), ('fantastic', 0.0021304672573844597), ('social', 0.0021272845860293007), ('you', 0.002126860652727663), ('safe', 0.002125503559035235), ('virtually', 0.0021228974792839818), ('popular', 0.0021225651112164565), ('equally', 0.0021202583856818067), ('often', 0.0021192912537311985), ('latest', 0.002118743729347657), ('unique', 0.002118664223964338), ('political', 0.002116365997444799), ('unfortunate', 0.002115077048141771), ('early', 0.0021141573314535653), ('funniest', 0.0021140007773808992), ('ready', 0.0021139804915892495), ('accidentally', 0.0021139204676988393), ('sci', 0.0021135829268120196), ('recent', 0.0021111884479918692), ('meanwhile', 0.002110182756405934), ('sexual', 0.0021044354971384597), ('cinematic', 0.002104275709167208), ('fi', 0.0021041621438043457), ('along', 0.0021024956883385973), ('easy', 0.0021016454330641684), ('physical', 0.0021010909668950396), ('teen', 0.0020943036771299685), ('unusual', 0.002093759530768171), ('thin', 0.002092294837065608), ('rich', 0.002090055886096352), ('rather', 0.0020870186335897253), ('various', 0.0020820717486230663), ('modern', 0.0020804173582543413), ('supposedly', 0.002080402822714092), ('ultimately', 0.0020762703131997516), ('william', 0.002075711512092768), ('practically', 0.0020746764726317962), ('white', 0.0020720529883677505), ('simply', 0.0020681061456906506), ('overly', 0.00206257446442149), ('female', 0.0020538277325249763), ('previous', 0.002052852117674875), ('tiny', 0.002048222963157593), ('thoroughly', 0.002047877902676961), ('due', 0.0020475837898075817), ('can', 0.0020473404084645776), ('american', 0.0020471900845285005), ('successfully', 0.002045906484840632), ('running', 0.0020436877607170455), ('aside', 0.0020407291382383754), ('apparently', 0.0020399939153677798), ('worthy', 0.0020367988916105664), ('unbelievable', 0.002035088832233652), ('british', 0.002032234003322862), ('ridiculous', 0.0020301322616863367), ('fresh', 0.0020269257906189767), ('clearly', 0.0020242662191528346), ('young', 0.0020212947952003455), ('directly', 0.0020190399267392815), ('worst', 0.002018321777439585), ('constant', 0.0020156129690804677), ('either', 0.002014093341491322), ('dangerous', 0.0020136403323678005), ('former', 0.0020132344660249227), ('complex', 0.0020126673738765162), ('david', 0.0020110649362481653), ('interested', 0.0020088882476282043), ('predictable', 0.002008212956503268), ('exciting', 0.002007826204078083), ('to', 0.0020072741565426836), ('unable', 0.002005718101470375), ('graphic', 0.002005376865260159), ('comedic', 0.0020048017378319193), ('highly', 0.0019973415101339166), ('fine', 0.0019967923686048336), ('cheap', 0.001994060677197784), ('mysterious', 0.0019907877971670116), ('open', 0.00199015894799177), ('mostly', 0.0019884061604660734), ('latter', 0.0019880256460866478), ('live', 0.001986795686860045), ('rarely', 0.001984649462275682), ('ugly', 0.0019837783774960035), ('weak', 0.0019707188745212005), ('possibly', 0.0019706075424450655), ('thus', 0.001968906729120521), ('available', 0.001967797578191252), ('ex', 0.001966281279950839), ('sweet', 0.001964407940363985), ('two', 0.0019635062673976737), ('particularly', 0.0019613888258522964), ('be', 0.0019605016669826157), ('younger', 0.0019603510166971935), ('the', 0.001954997718634469), ('nowhere', 0.0019516959316690215), ('middle', 0.0019498844504859775), ('slow', 0.0019495347271947126), ('heavily', 0.0019485833263928904), ('offensive', 0.0019482222698296579), ('giant', 0.0019457706206236593), ('on', 0.0019406685251326917), ('poorly', 0.0019374121631600583), ('screen', 0.001929547443973368), ('enjoyable', 0.001923709317142371), ('talented', 0.0019224756187088095), ('dramatic', 0.001921834165809007), ('subject', 0.001920647737495943), ('time', 0.0019061790328508419), ('dark', 0.0019053830824877353), ('bizarre', 0.0018987347112403435), ('life', 0.0018962494533953567), ('ill', 0.0018936014090258438), ('unnecessary', 0.0018922404730965067), ('likely', 0.0018907534543162957), ('terrible', 0.0018903140224032968), ('film', 0.001887782153527676), ('critical', 0.001879942478999445), ('further', 0.0018762931089131496), ('loud', 0.0018747595872502252), ('billy', 0.001845104604459428), ('sudden', 0.0018410687770122414), ('dimensional', 0.0018398050325275027), ('romantic', 0.0018293731001022885), ('central', 0.0018247775451691005), ('essentially', 0.0018232053751663139), ('large', 0.001822843226009556), ('married', 0.0018222599660140423), ('detective', 0.0018207588729618088), ('surprising', 0.0018139300761870559), ('unfunny', 0.0018076086167803196), ('initial', 0.0018063052931670369), ('public', 0.0018019761571857404), ('one', 0.0018005131509604578), ('sole', 0.001795965620240488), ('double', 0.001788234721973658), ('flat', 0.0017738175378711758), ('down', 0.0017713157865333839), ('largely', 0.0017605355035379654), ('wide', 0.0017602308875251382), ('in', 0.0017575874535586198), ('somehow', 0.0017483060301847226), ('worth', 0.0017339160020791128), ('painfully', 0.0017203082812571528), ('odd', 0.0017070813783993787), ('limited', 0.0017021925664718698), ('extraordinary', 0.001701478305054484), ('frequently', 0.0017001988086572553), ('familiar', 0.001692156191506733), ('year', 0.001689555348394929), ('angry', 0.0016884671186636357), ('free', 0.001682755737495471), ('foreign', 0.0016816202029281), ('chris', 0.001681151841632327), ('naturally', 0.0016765910783645183), ('low', 0.0016725509847084342), ('crazy', 0.001661263296294774), ('genuine', 0.001660915690145844), ('recently', 0.0016572720876909868), ('appropriate', 0.0016503542247951867), ('fellow', 0.0016395777540513942), ('cute', 0.001631364584010221), ('steve', 0.0016290806706737333), ('military', 0.0016101588895060573), ('humorous', 0.001605772161822982), ('positive', 0.0016056163317255743), ('local', 0.0016000580609697656), ('laughable', 0.0015909884579457525), ('bottom', 0.0015875328878313534), ('self', 0.0015710444112117714), ('ten', 0.0015664543675963226), ('alone', 0.0015632633997377963), ('unlikely', 0.0015330664386263545), ('jean', 0.001500748385125891), ('narrative', 0.0014983497303010325), ('near', 0.0014588298150256776), ('united', 0.0014333462534527634), ('current', 0.0013395265610777166), ('oddly', 0.001328033675835792), ('heavy', 0.0012452460985996902)]
posscores=seed_score(['good','great','perfect','cool'])
negscores=seed_score(['bad','terrible','wrong',"crap","long","boring"])
## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms:
sentscores[w] = posscores[w] - negscores[w]
sorted(sentscores.items(),key=itemgetter(1),reverse=False)
[('terrible', -0.011337935788715956), ('boring', -0.009940296206073694), ('wrong', -0.0038957762492763384), ('bad', -0.0028038772074515574), ('laughable', -0.0027406189540628715), ('unfunny', -0.0027135011838022864), ('worst', -0.0026471332624708405), ('frankly', -0.002587338946017292), ('terribly', -0.0025576562935382737), ('horrible', -0.0024121874515479237), ('awful', -0.0021579647379428284), ('ugly', -0.0020067310545418757), ('oddly', -0.0019930152819700904), ('exciting', -0.001958238856113128), ('running', -0.0019564731340303175), ('total', -0.0018477526013480836), ('painfully', -0.0018460471673599388), ('successfully', -0.0018190646800686494), ('ridiculous', -0.0018031715407181696), ('sadly', -0.0017829759978268338), ('bottom', -0.001769537540949412), ('we', -0.0017453041163577711), ('current', -0.001733764414992582), ('dull', -0.0016994388711972343), ('positive', -0.0016958052468327694), ('fair', -0.00168358034258908), ('ten', -0.0016615516042460942), ('poorly', -0.0016255490883917791), ('longer', -0.0016219407642753233), ('supposedly', -0.0016197365007933275), ('long', -0.0016152178963842242), ('foreign', -0.0015749275429113633), ('responsible', -0.001571347852518935), ('complete', -0.001569859685280305), ('pathetic', -0.001562411460165396), ('sole', -0.0014762838822532996), ('stupid', -0.001403251021265037), ('particular', -0.001402573126479277), ('low', -0.0013894492381992967), ('worse', -0.0013611681182488936), ('giant', -0.0013523241357618141), ('chinese', -0.001337497643663317), ('unbelievable', -0.0013303905386979195), ('unnecessary', -0.00131660406790726), ('doesn', -0.0013067653462435543), ('down', -0.0013009964808874883), ('weak', -0.001290743094020337), ('seriously', -0.001280361005356888), ('guilty', -0.0012759457766346595), ('huge', -0.0012214986373703203), ('worth', -0.0012163717863471562), ('silly', -0.0012130592923207716), ('double', -0.001210481211132458), ('that', -0.0012070521514938042), ('disappointing', -0.0011807670388760487), ('nowhere', -0.0011606225325787114), ('possible', -0.0011491794774951733), ('frequently', -0.0011469801859383087), ('desperately', -0.0011424453715553045), ('shallow', -0.0011384244435511679), ('predictable', -0.0011120428233618845), ('to', -0.0011058198988504373), ('completely', -0.0011032444587055746), ('offensive', -0.0010970960168989908), ('poor', -0.0010965921061633229), ('public', -0.0010840280896041703), ('gary', -0.001067755005125484), ('absolutely', -0.001026522206518719), ('graphic', -0.001024011269851582), ('overly', -0.0010201841232575807), ('thankfully', -0.0010149951549212268), ('angry', -0.0010110188791198813), ('no', -0.0010020614536842705), ('you', -0.0009949959520265542), ('lame', -0.000994378561571215), ('attractive', -0.0009846101524466954), ('re', -0.0009818428273642007), ('utterly', -0.0009700899144521528), ('middle', -0.0009700496756428923), ('they', -0.0009699456032204955), ('modern', -0.0009602179163986603), ('heavy', -0.000959430853419093), ('loud', -0.0009593880159746828), ('international', -0.000947889571331048), ('due', -0.0009449777026130446), ('flat', -0.000943498486118813), ('slow', -0.000936043860322225), ('superior', -0.0009253200744961609), ('surprising', -0.000919772111017412), ('ex', -0.0009109836158998105), ('equally', -0.0008988178254337453), ('subject', -0.0008972107221829695), ('military', -0.0008941009608490051), ('apart', -0.0008921972866551358), ('talented', -0.0008917010667096683), ('standard', -0.0008872939756282405), ('hardly', -0.0008776965462900465), ('practically', -0.0008750915480863412), ('obvious', -0.0008701220382583389), ('female', -0.0008680989032270711), ('cheap', -0.0008564492852698932), ('physical', -0.0008498830824418741), ('possibly', -0.0008469268319057162), ('aren', -0.0008397280021778201), ('ve', -0.0008365152484233855), ('steve', -0.0008355973078291424), ('accidentally', -0.0008353770315996422), ('basic', -0.0008346153861572892), ('alien', -0.0008185638092304041), ('essentially', -0.0008082539162401208), ('dumb', -0.0008028376104339493), ('aside', -0.0008012383951446411), ('unique', -0.0008003752649116298), ('sweet', -0.0007971894394690261), ('anyway', -0.0007960303504455316), ('largely', -0.0007877402329115519), ('peter', -0.0007858145554954835), ('up', -0.0007819843592687314), ('rich', -0.0007799646645649497), ('didn', -0.0007710518174823778), ('fi', -0.0007680731803592316), ('recently', -0.0007668801474510802), ('hard', -0.0007643704792226866), ('sci', -0.0007583426737492777), ('even', -0.0007573246173436503), ('unfortunately', -0.0007564513391728083), ('plain', -0.0007550527694653989), ('either', -0.0007528454883448344), ('of', -0.0007525540145690811), ('can', -0.0007441589482778972), ('chris', -0.0007405228629783673), ('entertaining', -0.0007355219005145538), ('dark', -0.0007340251639409206), ('potential', -0.000731803390525919), ('near', -0.0007309453448941252), ('totally', -0.0007257501605525398), ('sudden', -0.0007167056368809609), ('there', -0.0007132810375892829), ('better', -0.0007121005858083877), ('bizarre', -0.0007102499887883572), ('fascinating', -0.0007101600372832039), ('extremely', -0.000702831348441181), ('crazy', -0.0007001875489527892), ('odd', -0.0007000656180773442), ('half', -0.000697018395206481), ('apparently', -0.0006909176358485037), ('free', -0.0006857205953191604), ('appropriate', -0.0006853377884564126), ('complex', -0.000684395886488557), ('funny', -0.0006838122964675361), ('rather', -0.0006768704066807464), ('indeed', -0.0006748263747000886), ('safe', -0.0006704534362490105), ('easy', -0.0006698501733494017), ('least', -0.0006696310548561795), ('narrative', -0.0006693801173021394), ('brief', -0.0006678116323338111), ('now', -0.0006673157481890471), ('somehow', -0.0006641988757246794), ('twice', -0.0006641287075245849), ('too', -0.0006627836222457967), ('alone', -0.0006580188413029815), ('painful', -0.000657722958170764), ('otherwise', -0.0006568516671700813), ('wide', -0.0006545396855330964), ('dead', -0.000653123385060093), ('honest', -0.0006515573730144176), ('big', -0.0006496164697680088), ('lead', -0.0006494149269442007), ('central', -0.0006470915436464454), ('though', -0.0006442513977681823), ('so', -0.0006428789535567973), ('one', -0.0006418320196895418), ('interested', -0.0006401133976040372), ('time', -0.0006364896936115636), ('rarely', -0.0006363730750465154), ('merely', -0.0006308553338537993), ('tiny', -0.0006298494923142961), ('else', -0.0006277954394166021), ('critical', -0.0006250500330304016), ('be', -0.0006191685537481752), ('local', -0.0006158657943370705), ('major', -0.0006120423051833371), ('directly', -0.0006075876863173642), ('such', -0.0006049323327150537), ('various', -0.0006040198222932126), ('likely', -0.0005997093672459831), ('future', -0.000599449654797211), ('robin', -0.0005869254763942828), ('genuine', -0.0005823523828022148), ('only', -0.0005811959336262402), ('ultimate', -0.0005785256606300397), ('oh', -0.0005775859845771575), ('cute', -0.0005756849515812109), ('finally', -0.0005751148206935329), ('special', -0.0005682017753211362), ('former', -0.0005632603785982321), ('few', -0.0005574191632068473), ('aware', -0.0005555379716066988), ('much', -0.0005551280489618786), ('mostly', -0.0005543440959590073), ('latter', -0.000552853559619679), ('early', -0.0005520667304396258), ('available', -0.0005509522206528497), ('climactic', -0.0005506226915431576), ('already', -0.0005462021598010699), ('truly', -0.0005428115437271551), ('seemingly', -0.0005419132742930663), ('mary', -0.0005418292000601045), ('whole', -0.0005385437695464993), ('just', -0.0005370653451239643), ('along', -0.0005340060418226622), ('ll', -0.0005337104369835871), ('fellow', -0.0005223175816882971), ('familiar', -0.0005145307593608793), ('somewhere', -0.0005102507291033648), ('perhaps', -0.0005082367122989867), ('simple', -0.0005067682313010637), ('tough', -0.0005045886225163313), ('common', -0.000503857413382454), ('dramatic', -0.0005035860765712917), ('it', -0.000503461169935823), ('entirely', -0.0004971688457674106), ('main', -0.0004953088674902852), ('forever', -0.0004948695741177358), ('bright', -0.0004927768773291159), ('simply', -0.0004890112391900646), ('impressive', -0.0004887583209977186), ('large', -0.00048589159722396864), ('enough', -0.0004840620655300397), ('typical', -0.00048259380175424233), ('here', -0.00047975496205869637), ('bigger', -0.00047644760254505905), ('romantic', -0.0004672569278175954), ('therefore', -0.00046688494803137984), ('short', -0.00046492991647513904), ('average', -0.0004646315755049298), ('mainly', -0.00046421322270260136), ('deep', -0.00046330482043274637), ('ultimately', -0.00045854112467534234), ('decent', -0.00045700466155881945), ('unlikely', -0.00045562672156462466), ('cinematic', -0.00045532681631786295), ('straight', -0.00045515277058060053), ('incredibly', -0.00045403526773077135), ('far', -0.0004522929759844356), ('really', -0.0004501960190467589), ('single', -0.00044602936593327903), ('strange', -0.00044431205594842394), ('instead', -0.0004435105420098496), ('ever', -0.0004386290841801757), ('full', -0.0004378995395570035), ('self', -0.00043537616380091284), ('thus', -0.000435262083165897), ('ahead', -0.0004322469438230391), ('important', -0.00043116312351972746), ('back', -0.0004307387709850414), ('spectacular', -0.0004297835719436231), ('relatively', -0.0004260154655909695), ('the', -0.0004236841293395956), ('occasional', -0.0004234868072372209), ('maybe', -0.0004234443406997417), ('thin', -0.0004225904058378794), ('not', -0.0004213831255542376), ('dimensional', -0.0004203062524537585), ('happy', -0.00041551703086272753), ('away', -0.00041521498971391945), ('never', -0.00041427663947311974), ('certainly', -0.00041414089937138153), ('successful', -0.0004137343964682209), ('alive', -0.0004120879048515641), ('usual', -0.0004085566481352165), ('young', -0.00040793611859985153), ('easily', -0.0004066581498735587), ('don', -0.00040568204613010764), ('empty', -0.0004019144488789264), ('then', -0.00040178677514092313), ('apparent', -0.00039784352360013606), ('screen', -0.00039005452168201543), ('previous', -0.00038622574033567647), ('often', -0.0003804195026694524), ('different', -0.00037876868648600674), ('difficult', -0.0003764750831032368), ('old', -0.0003725818400590591), ('impossible', -0.0003710713261974004), ('on', -0.00036269638020895115), ('tom', -0.00036243731322769916), ('real', -0.0003623537714778427), ('however', -0.00036185401254073147), ('married', -0.0003600511230321961), ('other', -0.00035950949260183697), ('unable', -0.00035852239590076053), ('funniest', -0.00035829495399516183), ('ill', -0.0003565244942728442), ('many', -0.00035609966284931224), ('previously', -0.0003550544177524568), ('as', -0.00035229762209222117), ('quite', -0.00034903953306314765), ('desperate', -0.0003465632012755802), ('film', -0.0003457790164877331), ('violent', -0.000339928133520359), ('english', -0.00033205306151005455), ('small', -0.00032812818234550893), ('worthy', -0.00032729125644853033), ('fast', -0.0003202755225513534), ('likable', -0.00031919523251016345), ('barely', -0.0003171112852659013), ('naturally', -0.00031484551933644864), ('constant', -0.0003128648272295361), ('top', -0.0003118794791431177), ('jean', -0.00031142759612205057), ('eventually', -0.00030643819158205103), ('human', -0.0003004701642169835), ('evil', -0.00029877898921396723), ('believable', -0.0002976062581400152), ('white', -0.00029256051223548133), ('serious', -0.0002915094823329181), ('wild', -0.0002904852948483836), ('he', -0.0002895451984794026), ('year', -0.0002894236019519831), ('quickly', -0.00028486728778277575), ('david', -0.0002841836580306811), ('two', -0.0002803594837186671), ('hilarious', -0.0002792293408196405), ('billy', -0.0002784563033864521), ('fairly', -0.00027817137110490225), ('clear', -0.0002757689850125377), ('more', -0.00027483851236456084), ('little', -0.0002738519948942077), ('united', -0.0002712073575661905), ('social', -0.0002692116370561332), ('badly', -0.0002656577126927363), ('first', -0.0002656045421871446), ('soon', -0.00026406576498318787), ('limited', -0.00026077833129929196), ('pretty', -0.0002556044683908313), ('recent', -0.00025431639723114043), ('in', -0.0002529341542810658), ('comedic', -0.00024923527505396137), ('nearly', -0.00024558280477428654), ('entire', -0.00024552533171258266), ('personal', -0.0002429041811083158), ('genuinely', -0.00023638211781148218), ('ready', -0.00023297057991365734), ('interesting', -0.0002319057727939739), ('new', -0.0002288779420982481), ('basically', -0.00022794420212937563), ('certain', -0.00022774660798796816), ('exactly', -0.00022187133498753273), ('around', -0.0002212138544179596), ('national', -0.00021892498564345803), ('next', -0.0002158841794732266), ('less', -0.0002053370589299667), ('popular', -0.00020410840527040636), ('out', -0.00020327091044571987), ('immediately', -0.00020113758416102478), ('heavily', -0.0001978586638053557), ('virtually', -0.00019741273624147545), ('like', -0.00019201990770032797), ('effectively', -0.00018865752036787429), ('initial', -0.00018847850036724087), ('right', -0.00018667331121528363), ('again', -0.00018442600209907407), ('able', -0.00018271056934554336), ('obviously', -0.0001757623199258253), ('meanwhile', -0.0001700542754715078), ('favorite', -0.00016790618953759634), ('ago', -0.00016776635504992367), ('well', -0.00016722410208565514), ('further', -0.00016372379001952783), ('most', -0.00015933307977614858), ('close', -0.00015749033148161314), ('fantastic', -0.00015367714248264988), ('fine', -0.00014661878484316512), ('famous', -0.00014606101687336627), ('about', -0.00014439417455700205), ('own', -0.00014057528409486264), ('star', -0.00014039861913854286), ('suddenly', -0.00013654192505882642), ('tight', -0.00013514609804481277), ('younger', -0.0001342270569405857), ('french', -0.00013271869553648733), ('several', -0.00013224135315127988), ('over', -0.00013178399190350585), ('subtle', -0.0001304908962331229), ('extraordinary', -0.00013034796186768647), ('third', -0.00012689187862063335), ('teen', -0.00012547607896861357), ('open', -0.0001251771370265795), ('good', -0.00012464300361279563), ('mad', -0.00012423836030612785), ('what', -0.00011744539314099403), ('sure', -0.00011140597894659064), ('actual', -0.00010631033966002502), ('mental', -0.00010554178290598766), ('intense', -0.00010320983741859144), ('isn', -0.0001019431777130762), ('quick', -9.144819523439858e-05), ('cold', -8.453928225732842e-05), ('thoroughly', -8.446311268023136e-05), ('high', -8.108657336266872e-05), ('private', -8.082521935904967e-05), ('mysterious', -7.89915855672666e-05), ('psychological', -7.079354944579917e-05), ('last', -6.040588450787962e-05), ('emotional', -5.9231389028841126e-05), ('dangerous', -5.840343082568443e-05), ('numerous', -5.6507297730663385e-05), ('red', -5.518097392980415e-05), ('probably', -5.510557060454035e-05), ('enjoyable', -5.304302448897163e-05), ('almost', -5.268059132795041e-05), ('necessarily', -5.181284608931757e-05), ('live', -5.161664924643525e-05), ('incredible', -5.0838444616687316e-05), ('yet', -4.993545030473239e-05), ('actually', -4.970290822225044e-05), ('technical', -4.964327727627381e-05), ('once', -4.108719790097684e-05), ('very', -3.4591909186824834e-05), ('clearly', -3.0778856868112024e-05), ('occasionally', -2.9330789441932553e-05), ('life', -2.7919699401356872e-05), ('key', -2.5608117934161883e-05), ('rare', -2.1669136957346707e-05), ('later', -1.826701189417898e-05), ('off', -1.7711394330840666e-05), ('powerful', -1.5583342484341237e-05), ('together', -9.288051209772625e-06), ('true', -4.87041375588906e-06), ('similar', -3.89795186164733e-06), ('michael', -2.5784077662730463e-06), ('robert', -2.421454439686249e-07), ('regular', 9.074791722952363e-07), ('nice', 5.07184371876428e-06), ('particularly', 1.608419229506765e-05), ('terrific', 1.7873781404614906e-05), ('smart', 1.830028423949106e-05), ('visual', 1.9897264289668766e-05), ('american', 2.1072590232642293e-05), ('general', 2.142930413662073e-05), ('realistic', 3.266246171006355e-05), ('intelligent', 3.9886651389091365e-05), ('mean', 5.054165961934035e-05), ('humorous', 5.432574384601043e-05), ('usually', 5.541394191310462e-05), ('innocent', 6.0470827055490034e-05), ('capable', 6.280583117186135e-05), ('originally', 6.378534225396108e-05), ('sexual', 6.454153211762643e-05), ('grand', 6.640245105755038e-05), ('she', 7.44441831502362e-05), ('also', 7.641971683174965e-05), ('same', 7.726204521844339e-05), ('surprisingly', 7.814986440048783e-05), ('strong', 7.969296011480593e-05), ('still', 8.722930070878283e-05), ('willing', 8.872637216204328e-05), ('necessary', 8.891509448837737e-05), ('inevitable', 8.992341325145042e-05), ('deadly', 9.701581319515075e-05), ('literally', 9.852399016292528e-05), ('fresh', 0.0001081531161383144), ('witty', 0.00011555628335896363), ('movie', 0.00011610771682999971), ('before', 0.0001294858280637642), ('biggest', 0.00013372314741634094), ('secret', 0.00014397664031058025), ('late', 0.00015327727974057386), ('tim', 0.00015454685279179918), ('original', 0.000166646316708497), ('greater', 0.00016828610277696545), ('slightly', 0.00016914106338807022), ('soft', 0.00016963763368304206), ('lee', 0.00018716734589623767), ('beautiful', 0.00018782224248475215), ('sympathetic', 0.00018964480188837264), ('nevertheless', 0.00019220598277431417), ('slowly', 0.00019659574810863443), ('non', 0.0001988968383649769), ('sometimes', 0.00019992977542792213), ('british', 0.00020169386175819472), ('clever', 0.00020410422641186548), ('blue', 0.00021376094430168493), ('and', 0.00021441861464148188), ('nasty', 0.00021891481020010596), ('minor', 0.00021980135793128047), ('william', 0.0002221413294301671), ('normal', 0.0002222509905768803), ('especially', 0.00024404651952017204), ('moral', 0.00024919162284080124), ('always', 0.0002532388991124115), ('political', 0.0002623995973431354), ('weird', 0.0002690142191958027), ('latest', 0.0002748801528105005), ('highly', 0.0002803689422022479), ('wonderful', 0.0002952821816267323), ('serial', 0.0002977511634248602), ('initially', 0.0003026285852441848), ('memorable', 0.0003047299564975484), ('unfortunate', 0.00030614891738813574), ('final', 0.0003100101631852959), ('fun', 0.0003142947725700519), ('older', 0.0003187683928988676), ('brilliant', 0.0003215879003835055), ('co', 0.00032396165017799846), ('natural', 0.0003269851375293292), ('comic', 0.0003289997380387633), ('hot', 0.00033107428643060514), ('intriguing', 0.0003346515095666398), ('danny', 0.0003500028138064999), ('black', 0.00035883217167973543), ('unusual', 0.0003729001202466308), ('overall', 0.00038221150638625533), ('forward', 0.0003894123135382067), ('effective', 0.00039077394646117947), ('surely', 0.0003991479944951801), ('visually', 0.00039954294262994645), ('second', 0.0004001165402682627), ('extra', 0.00040435182000170397), ('fake', 0.0004087926346521024), ('lucky', 0.000414792487349712), ('detective', 0.0004258121327198735), ('scary', 0.00044858555977865663), ('all', 0.00046052119261720406), ('light', 0.0004666619849259899), ('unexpected', 0.0004755937806912272), ('present', 0.0004883839287330521), ('remarkable', 0.0004909132424787039), ('pure', 0.0004930297335705856), ('animated', 0.0005004334032993982), ('constantly', 0.0005005278442861616), ('solid', 0.0005104773753039919), ('sean', 0.0005354051081864117), ('fully', 0.000538860116644163), ('green', 0.0005626488560537576), ('sad', 0.0005690037657206867), ('classic', 0.0006113368528880467), ('best', 0.0006232564824836927), ('computer', 0.0006261920366367954), ('somewhat', 0.0006418769897289041), ('creative', 0.0006670668130810274), ('excellent', 0.0006877090591556421), ('steven', 0.000700126493842716), ('wonderfully', 0.0007011900004185701), ('tony', 0.000713347461311362), ('sharp', 0.000714782515078788), ('anti', 0.0007395501537729543), ('definitely', 0.0007444490420821688), ('musical', 0.0007462502942548184), ('friendly', 0.000761119245625841), ('professional', 0.0007843680588404591), ('perfectly', 0.0008016637337694504), ('emotionally', 0.0008558689372788594), ('john', 0.0008935668249893006), ('past', 0.0009797392305912716), ('nicely', 0.000980917414583356), ('outstanding', 0.0009961964716528754), ('hearted', 0.0010123986481938667), ('traditional', 0.0010483779957726606), ('generally', 0.0010503112142917228), ('amazing', 0.0010694376494112855), ('suspenseful', 0.001114814515979623), ('man', 0.0011264816790070354), ('earlier', 0.0011453103516269971), ('known', 0.0011615678408455577), ('lovely', 0.0013584838796366254), ('quiet', 0.0013999389580393261), ('convincing', 0.001406054680451586), ('looking', 0.001440831372975097), ('great', 0.001442680743868476), ('eccentric', 0.0015794026794116564), ('frank', 0.0016222794801935836), ('greatest', 0.0016856586307944627), ('perfect', 0.004148090426618199), ('cool', 0.009074307660516472)]
Now let's apply this methodology to real (and important!) scenario where we don't have any sentiment labels: the Kardashians
## Loading the Kardashian data
with open("kardashian-transcripts.json", "rb") as f:
transcripts = json.load(f)
msgs = [m['text'].lower() for transcript in transcripts
for m in transcript ]
#msgs_pos_tagged = [pos_tag(tokenizer.tokenize(m)) for m in msgs]
msgs_adj_adv_only_tokenized=[[w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]]
for m in msgs_pos_tagged]
msgs_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
for m in msgs_pos_tagged]
msgs[23]
'and then if you could take out the trash, and then if you go to dash, maybe tomorrow or whatever, later today and just...'
msgs_adj_adv_only[23]
'then then maybe later just'
vec = CountVectorizer(min_df = 10)
X = vec.fit_transform(msgs_adj_adv_only)
terms_kard = vec.get_feature_names()
len(terms_kard)
358
pmi_matrix_kard=getcollocations_matrix(X)
getcollocations("good",pmi_matrix_kard,terms_kard)
[('good', 0.0013962375073486185), ('positive', 0.0005952380952380953), ('horrible', 0.0003968253968253968), ('awful', 0.00030525030525030525), ('nude', 0.0003006253006253006), ('proud', 0.00024366471734892786), ('extremely', 0.0002204585537918871), ('willing', 0.00018896447467876037), ('pretty', 0.0001670843776106934), ('bruce', 0.00016534391534391533), ('strong', 0.00015873015873015873), ('such', 0.00013598378084359391), ('anywhere', 0.00013227513227513228), ('dramatic', 0.00012025012025012025), ('everything', 0.00012025012025012025), ('honest', 0.00012025012025012025), ('online', 0.00012025012025012025), ('though', 0.00011671335200746966), ('adrienne', 0.00011337868480725624), ('half', 0.00011022927689594356), ('kimberly', 0.00011022927689594356), ('wish', 0.00010175010175010176), ('very', 9.831259831259832e-05), ('that', 9.101499927188001e-05), ('really', 8.846426043878273e-05), ('he', 8.818342151675484e-05), ('smart', 8.267195767195767e-05), ('all', 7.78089013383131e-05), ('fun', 7.78089013383131e-05), ('rob', 7.78089013383131e-05), ('instead', 7.348618459729571e-05), ('super', 7.348618459729571e-05), ('too', 7.297938332421091e-05), ('black', 7.215007215007215e-05), ('like', 6.961849067112225e-05), ('big', 6.764069264069264e-05), ('actually', 6.705191036988273e-05), ('um', 6.421122925977295e-05), ('sure', 6.081615277017576e-05), ('before', 6.012506012506013e-05), ('hard', 5.922767116796967e-05), ('it', 5.860290670417253e-05), ('about', 5.7510927076144466e-05), ('busy', 5.7510927076144466e-05), ('sometimes', 5.7510927076144466e-05), ('clean', 5.628729032984352e-05), ('real', 5.166997354497354e-05), ('always', 5.069778588942352e-05), ('close', 4.99151442547669e-05), ('they', 4.99151442547669e-05), ('ve', 4.9680800854509775e-05), ('great', 4.95412480431207e-05), ('also', 4.8393341076267905e-05), ('not', 4.468754468754469e-05), ('back', 4.4276194903809964e-05), ('uh', 4.409171075837742e-05), ('we', 4.166145898429363e-05), ('she', 4.101554489151388e-05), ('maybe', 3.968253968253968e-05), ('single', 3.9485114111979786e-05), ('own', 3.834061805076298e-05), ('definitely', 3.7578162578162574e-05), ('still', 3.6743092298647854e-05), ('you', 3.651767455448437e-05), ('healthy', 3.575003575003575e-05), ('armenian', 3.480924533556112e-05), ('so', 3.4052186737495216e-05), ('pregnant', 3.348737525952716e-05), ('best', 3.094155140938767e-05), ('hot', 3.0761658668635414e-05), ('ready', 3.0292015024839453e-05), ('nervous', 3.0062530062530064e-05), ('different', 2.972474882587242e-05), ('least', 2.9394473838918284e-05), ('whole', 2.8446265005404792e-05), ('as', 2.7994736989445982e-05), ('again', 2.775002775002775e-05), ('gorgeous', 2.755731922398589e-05), ('re', 2.6577692729161045e-05), ('absolutely', 2.5936300446104367e-05), ('far', 2.5936300446104367e-05), ('well', 2.5221953188054885e-05), ('let', 2.519526329050139e-05), ('only', 2.4495394865765236e-05), ('just', 2.3795526441029087e-05), ('don', 2.2937884209560507e-05), ('cool', 2.2806057288815908e-05), ('probably', 2.2806057288815908e-05), ('then', 2.2675736961451248e-05), ('now', 2.2102528529263747e-05), ('few', 2.204585537918871e-05), ('right', 2.140374308659098e-05), ('old', 2.133469875405359e-05), ('happy', 2.077619878666999e-05), ('ll', 2.058756922570152e-05), ('here', 2.047975636318077e-05), ('long', 2.035002035002035e-05), ('perfect', 2.035002035002035e-05), ('next', 2.0194676683226302e-05), ('never', 2.010260368922983e-05), ('together', 1.9742557055989893e-05), ('bad', 1.917030902538149e-05), ('better', 1.917030902538149e-05), ('comfortable', 1.8630300320441167e-05), ('beautiful', 1.8119881133579763e-05), ('anymore', 1.740462266778056e-05), ('obviously', 1.6958350291683625e-05), ('last', 1.643169345032699e-05), ('gonna', 1.6380821334381707e-05), ('honestly', 1.5936762924714734e-05), ('first', 1.5873015873015872e-05), ('enough', 1.486237441293621e-05), ('already', 1.3778659611992945e-05), ('ever', 1.3096547750013096e-05), ('new', 1.2270420433685741e-05), ('crazy', 1.1403028644407954e-05), ('up', 9.44822373393802e-06), ('wrong', 9.315150160220583e-06), ('else', 8.877525656049147e-06), ('there', 8.332291796858726e-06), ('little', 6.7401341286691605e-06), ('more', 5.249013185521122e-06), ('even', 3.3572368597749307e-06), ('much', 2.9071457642886215e-06), ('able', 0.0), ('acceptable', 0.0), ('accurate', 0.0), ('active', 0.0), ('afraid', 0.0), ('ago', 0.0), ('ahead', 0.0), ('alcoholic', 0.0), ('almost', 0.0), ('alone', 0.0), ('along', 0.0), ('amazing', 0.0), ('american', 0.0), ('anal', 0.0), ('angry', 0.0), ('annoying', 0.0), ('anxious', 0.0), ('anyway', 0.0), ('apart', 0.0), ('apparently', 0.0), ('around', 0.0), ('atm', 0.0), ('away', 0.0), ('awesome', 0.0), ('awkward', 0.0), ('barely', 0.0), ('basic', 0.0), ('basically', 0.0), ('belly', 0.0), ('bible', 0.0), ('bigger', 0.0), ('biggest', 0.0), ('boring', 0.0), ('bright', 0.0), ('bunim', 0.0), ('can', 0.0), ('certain', 0.0), ('certainly', 0.0), ('clear', 0.0), ('clearly', 0.0), ('cold', 0.0), ('common', 0.0), ('complete', 0.0), ('completely', 0.0), ('constantly', 0.0), ('couple', 0.0), ('cute', 0.0), ('dead', 0.0), ('deep', 0.0), ('delicious', 0.0), ('diaper', 0.0), ('didn', 0.0), ('difficult', 0.0), ('disappointed', 0.0), ('doesn', 0.0), ('double', 0.0), ('down', 0.0), ('dry', 0.0), ('dumb', 0.0), ('early', 0.0), ('easier', 0.0), ('easy', 0.0), ('embarrassing', 0.0), ('emotional', 0.0), ('entire', 0.0), ('eric', 0.0), ('especially', 0.0), ('everyone', 0.0), ('everywhere', 0.0), ('exactly', 0.0), ('excited', 0.0), ('exciting', 0.0), ('extra', 0.0), ('fabulous', 0.0), ('fair', 0.0), ('family', 0.0), ('fast', 0.0), ('fat', 0.0), ('favorite', 0.0), ('female', 0.0), ('finally', 0.0), ('fine', 0.0), ('forever', 0.0), ('forward', 0.0), ('free', 0.0), ('fresh', 0.0), ('full', 0.0), ('funny', 0.0), ('fur', 0.0), ('girlfriend', 0.0), ('glad', 0.0), ('god', 0.0), ('gray', 0.0), ('green', 0.0), ('gross', 0.0), ('grown', 0.0), ('guilty', 0.0), ('guys', 0.0), ('high', 0.0), ('hopefully', 0.0), ('huge', 0.0), ('huh', 0.0), ('hundred', 0.0), ('hungry', 0.0), ('immediately', 0.0), ('important', 0.0), ('incredible', 0.0), ('inside', 0.0), ('interested', 0.0), ('isn', 0.0), ('jealous', 0.0), ('kardashian', 0.0), ('kelly', 0.0), ('khloe', 0.0), ('kim', 0.0), ('kourtney', 0.0), ('kris', 0.0), ('laker', 0.0), ('lamar', 0.0), ('late', 0.0), ('lately', 0.0), ('later', 0.0), ('less', 0.0), ('lily', 0.0), ('literally', 0.0), ('live', 0.0), ('love', 0.0), ('low', 0.0), ('luxurious', 0.0), ('mad', 0.0), ('major', 0.0), ('male', 0.0), ('many', 0.0), ('married', 0.0), ('mean', 0.0), ('miserable', 0.0), ('miss', 0.0), ('moral', 0.0), ('most', 0.0), ('murray', 0.0), ('naked', 0.0), ('natural', 0.0), ('necessary', 0.0), ('nice', 0.0), ('normal', 0.0), ('normally', 0.0), ('off', 0.0), ('often', 0.0), ('oh', 0.0), ('okay', 0.0), ('older', 0.0), ('once', 0.0), ('open', 0.0), ('other', 0.0), ('out', 0.0), ('outside', 0.0), ('past', 0.0), ('people', 0.0), ('personal', 0.0), ('poor', 0.0), ('possible', 0.0), ('possibly', 0.0), ('private', 0.0), ('professional', 0.0), ('public', 0.0), ('quiet', 0.0), ('rather', 0.0), ('red', 0.0), ('regular', 0.0), ('rich', 0.0), ('rid', 0.0), ('ridiculous', 0.0), ('rude', 0.0), ('sad', 0.0), ('safe', 0.0), ('same', 0.0), ('san', 0.0), ('scary', 0.0), ('scott', 0.0), ('second', 0.0), ('secret', 0.0), ('selfish', 0.0), ('sensitive', 0.0), ('serious', 0.0), ('seriously', 0.0), ('sexual', 0.0), ('sexy', 0.0), ('short', 0.0), ('sick', 0.0), ('sister', 0.0), ('small', 0.0), ('somewhere', 0.0), ('soon', 0.0), ('sorry', 0.0), ('special', 0.0), ('straight', 0.0), ('stupid', 0.0), ('sudden', 0.0), ('supportive', 0.0), ('sweet', 0.0), ('tall', 0.0), ('ten', 0.0), ('thebouncedryer', 0.0), ('top', 0.0), ('total', 0.0), ('totally', 0.0), ('touch', 0.0), ('tough', 0.0), ('true', 0.0), ('truly', 0.0), ('truthful', 0.0), ('tryclearblue', 0.0), ('twice', 0.0), ('ugly', 0.0), ('uncomfortable', 0.0), ('upset', 0.0), ('usually', 0.0), ('wear', 0.0), ('weird', 0.0), ('welcome', 0.0), ('what', 0.0), ('white', 0.0), ('who', 0.0), ('won', 0.0), ('wonderful', 0.0), ('worried', 0.0), ('worse', 0.0), ('worst', 0.0), ('yeah', 0.0), ('year', 0.0), ('yes', 0.0), ('yet', 0.0), ('young', 0.0), ('younger', 0.0)]
posscores=seed_score(['good',"great"],pmi_matrix_kard,terms_kard)
negscores=seed_score(['bad'],pmi_matrix_kard,terms_kard)
## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms_kard:
sentscores[w]=posscores[w]-negscores[w]
neglexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[:10]
poslexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[-10:]
sorted(sentscores.items(),key=itemgetter(1),reverse=False)
[('bad', -0.004933680845053443), ('horrible', -0.0005693581780538303), ('san', -0.00040257648953301127), ('worried', -0.0003716090672612412), ('worst', -0.00023004370830457787), ('high', -0.00022469385462307607), ('rich', -0.00021003990758244065), ('ready', -0.00019097139906964003), ('normal', -0.00016950589032968896), ('busy', -0.0001525289805062962), ('can', -0.00014639145073927682), ('entire', -0.0001271294177472667), ('sorry', -0.00012326077909859053), ('able', -0.00011670194865114262), ('enough', -9.369757782068481e-05), ('kourtney', -8.350765556432388e-05), ('seriously', -7.791803023219573e-05), ('again', -7.359789968485622e-05), ('probably', -6.0485630200772624e-05), ('around', -5.891363261458702e-05), ('long', -5.397179310222788e-05), ('still', -5.271834981979909e-05), ('other', -5.242651351769991e-05), ('now', -4.557302009966452e-05), ('obviously', -4.4976494251856576e-05), ('too', -4.3628979161213044e-05), ('away', -4.052409175844072e-05), ('fast', -3.877141151200752e-05), ('especially', -3.5593426961842966e-05), ('little', -3.0184078924040153e-05), ('never', -2.7248013774370832e-05), ('right', -2.11946200471519e-05), ('not', -1.594014173398639e-05), ('more', -1.3921295839860365e-05), ('hard', -1.2875580688689063e-05), ('ever', -1.0818887271749949e-05), ('just', -8.23636069950847e-06), ('nice', -7.982349428942723e-06), ('together', -4.291860229563018e-06), ('much', -4.250653284081997e-06), ('acceptable', 0.0), ('accurate', 0.0), ('active', 0.0), ('afraid', 0.0), ('ago', 0.0), ('ahead', 0.0), ('alcoholic', 0.0), ('almost', 0.0), ('alone', 0.0), ('american', 0.0), ('anal', 0.0), ('angry', 0.0), ('annoying', 0.0), ('anxious', 0.0), ('anyway', 0.0), ('apart', 0.0), ('apparently', 0.0), ('atm', 0.0), ('awesome', 0.0), ('awkward', 0.0), ('barely', 0.0), ('basic', 0.0), ('basically', 0.0), ('belly', 0.0), ('bible', 0.0), ('bigger', 0.0), ('biggest', 0.0), ('boring', 0.0), ('bright', 0.0), ('bunim', 0.0), ('certain', 0.0), ('clear', 0.0), ('clearly', 0.0), ('cold', 0.0), ('common', 0.0), ('complete', 0.0), ('completely', 0.0), ('constantly', 0.0), ('cute', 0.0), ('dead', 0.0), ('deep', 0.0), ('delicious', 0.0), ('diaper', 0.0), ('didn', 0.0), ('difficult', 0.0), ('disappointed', 0.0), ('doesn', 0.0), ('double', 0.0), ('down', 0.0), ('dry', 0.0), ('dumb', 0.0), ('early', 0.0), ('easier', 0.0), ('embarrassing', 0.0), ('emotional', 0.0), ('everyone', 0.0), ('everywhere', 0.0), ('exactly', 0.0), ('excited', 0.0), ('exciting', 0.0), ('extra', 0.0), ('fabulous', 0.0), ('fair', 0.0), ('family', 0.0), ('fat', 0.0), ('favorite', 0.0), ('finally', 0.0), ('fine', 0.0), ('forever', 0.0), ('forward', 0.0), ('free', 0.0), ('full', 0.0), ('funny', 0.0), ('fur', 0.0), ('girlfriend', 0.0), ('glad', 0.0), ('god', 0.0), ('gray', 0.0), ('green', 0.0), ('gross', 0.0), ('grown', 0.0), ('guilty', 0.0), ('guys', 0.0), ('huge', 0.0), ('huh', 0.0), ('hundred', 0.0), ('hungry', 0.0), ('immediately', 0.0), ('important', 0.0), ('incredible', 0.0), ('inside', 0.0), ('isn', 0.0), ('kardashian', 0.0), ('kelly', 0.0), ('khloe', 0.0), ('kim', 0.0), ('kris', 0.0), ('laker', 0.0), ('lamar', 0.0), ('late', 0.0), ('later', 0.0), ('lily', 0.0), ('literally', 0.0), ('live', 0.0), ('low', 0.0), ('luxurious', 0.0), ('mad', 0.0), ('major', 0.0), ('male', 0.0), ('mean', 0.0), ('miserable', 0.0), ('miss', 0.0), ('moral', 0.0), ('most', 0.0), ('murray', 0.0), ('natural', 0.0), ('necessary', 0.0), ('normally', 0.0), ('off', 0.0), ('often', 0.0), ('oh', 0.0), ('older', 0.0), ('open', 0.0), ('outside', 0.0), ('past', 0.0), ('people', 0.0), ('personal', 0.0), ('poor', 0.0), ('possible', 0.0), ('possibly', 0.0), ('private', 0.0), ('professional', 0.0), ('public', 0.0), ('quiet', 0.0), ('rather', 0.0), ('red', 0.0), ('regular', 0.0), ('rid', 0.0), ('ridiculous', 0.0), ('rude', 0.0), ('sad', 0.0), ('safe', 0.0), ('scary', 0.0), ('scott', 0.0), ('second', 0.0), ('secret', 0.0), ('selfish', 0.0), ('sensitive', 0.0), ('serious', 0.0), ('sexual', 0.0), ('sexy', 0.0), ('short', 0.0), ('sick', 0.0), ('sister', 0.0), ('somewhere', 0.0), ('soon', 0.0), ('special', 0.0), ('straight', 0.0), ('stupid', 0.0), ('supportive', 0.0), ('sweet', 0.0), ('tall', 0.0), ('ten', 0.0), ('thebouncedryer', 0.0), ('top', 0.0), ('total', 0.0), ('touch', 0.0), ('tough', 0.0), ('true', 0.0), ('truly', 0.0), ('truthful', 0.0), ('tryclearblue', 0.0), ('twice', 0.0), ('ugly', 0.0), ('uncomfortable', 0.0), ('upset', 0.0), ('usually', 0.0), ('wear', 0.0), ('weird', 0.0), ('welcome', 0.0), ('what', 0.0), ('white', 0.0), ('who', 0.0), ('won', 0.0), ('wonderful', 0.0), ('worse', 0.0), ('yeah', 0.0), ('year', 0.0), ('yes', 0.0), ('yet', 0.0), ('young', 0.0), ('younger', 0.0), ('as', 2.4343249556039977e-06), ('only', 2.582367470460253e-06), ('so', 4.5872906918408804e-06), ('really', 5.177341671014243e-06), ('there', 6.622686249872567e-06), ('even', 7.352463528271139e-06), ('wrong', 9.315150160220583e-06), ('last', 9.688839274325689e-06), ('crazy', 1.1403028644407954e-05), ('already', 1.3778659611992945e-05), ('honestly', 1.5936762924714734e-05), ('anymore', 1.740462266778056e-05), ('beautiful', 1.8119881133579763e-05), ('comfortable', 1.8630300320441167e-05), ('it', 1.9896932173619897e-05), ('next', 2.0194676683226302e-05), ('perfect', 2.035002035002035e-05), ('then', 2.096979485492291e-05), ('old', 2.133469875405359e-05), ('few', 2.204585537918871e-05), ('whole', 2.2609708433704742e-05), ('cool', 2.2806057288815908e-05), ('don', 2.2937884209560507e-05), ('far', 2.5936300446104367e-05), ('first', 2.6511891191910728e-05), ('gorgeous', 2.755731922398589e-05), ('same', 2.8810141169691734e-05), ('least', 2.9394473838918284e-05), ('nervous', 3.0062530062530064e-05), ('here', 3.1602702288756697e-05), ('armenian', 3.480924533556112e-05), ('healthy', 3.575003575003575e-05), ('up', 3.6200497677223196e-05), ('re', 3.647728651862571e-05), ('actually', 3.97868532420839e-05), ('happy', 4.038519539431358e-05), ('uh', 4.409171075837742e-05), ('ll', 4.5509892890229304e-05), ('new', 4.701363334704312e-05), ('easy', 4.801690194948622e-05), ('close', 4.99151442547669e-05), ('they', 4.99151442547669e-05), ('real', 5.166997354497354e-05), ('best', 5.2843997912662086e-05), ('totally', 5.382384186372808e-05), ('clean', 5.628729032984352e-05), ('about', 5.7510927076144466e-05), ('sometimes', 5.7510927076144466e-05), ('back', 5.767585427992636e-05), ('you', 5.783005781507255e-05), ('definitely', 5.885838048759397e-05), ('else', 5.915025521390049e-05), ('different', 5.960923005872315e-05), ('let', 6.0864961881548294e-05), ('big', 6.147251353650963e-05), ('um', 6.421122925977295e-05), ('like', 6.961849067112225e-05), ('black', 7.215007215007215e-05), ('better', 7.3450285142192e-05), ('instead', 7.348618459729571e-05), ('super', 7.348618459729571e-05), ('many', 7.416471984277079e-05), ('maybe', 7.713572320313893e-05), ('all', 7.78089013383131e-05), ('fun', 7.78089013383131e-05), ('rob', 7.78089013383131e-05), ('pregnant', 8.089646832357684e-05), ('smart', 8.267195767195767e-05), ('ve', 8.484810932455602e-05), ('amazing', 8.512087163772558e-05), ('naked', 8.917424647761726e-05), ('jealous', 9.134922809902256e-05), ('single', 9.538538802332195e-05), ('gonna', 9.754871131710454e-05), ('well', 9.76719213100689e-05), ('absolutely', 9.93739151923774e-05), ('we', 0.00010064285035531607), ('wish', 0.00010175010175010176), ('half', 0.00011022927689594356), ('kimberly', 0.00011022927689594356), ('very', 0.00011096570085334131), ('adrienne', 0.00011337868480725624), ('pretty', 0.00011623261051178672), ('though', 0.00011671335200746966), ('dramatic', 0.00012025012025012025), ('everything', 0.00012025012025012025), ('honest', 0.00012025012025012025), ('online', 0.00012025012025012025), ('sure', 0.00012539060711603653), ('always', 0.00012899712426001428), ('fresh', 0.00012914890869172155), ('anywhere', 0.00013227513227513228), ('he', 0.00013812099954422052), ('own', 0.00013903390707904913), ('she', 0.00015714944728096892), ('strong', 0.00015873015873015873), ('bruce', 0.00016534391534391533), ('such', 0.0001856981514926353), ('less', 0.00018726591760299626), ('okay', 0.00018726591760299626), ('married', 0.00019206760779794487), ('interested', 0.00019712201852946972), ('sudden', 0.00019712201852946972), ('once', 0.00021100385082027748), ('lately', 0.0002203128442388191), ('extremely', 0.0002204585537918871), ('before', 0.0002303668034005113), ('hopefully', 0.00023408239700374532), ('proud', 0.00024366471734892786), ('along', 0.00024968789013732833), ('small', 0.00024968789013732833), ('hot', 0.0002920629390449092), ('nude', 0.0003006253006253006), ('awful', 0.00030525030525030525), ('that', 0.00031435967164242603), ('out', 0.00032102728731942215), ('eric', 0.0003745318352059925), ('also', 0.0004005512349072824), ('love', 0.0004406256884776382), ('female', 0.0005350454788657035), ('positive', 0.0005952380952380953), ('certainly', 0.0006242197253433209), ('willing', 0.0007240099535444639), ('couple', 0.0009363295880149813), ('good', 0.0014266084463663577), ('great', 0.004029259646779759)]
We (roughly) calculate the each sentence's sentiment score by comparing the number of words with positive sentiment score vs negative sentiment score (according to our automatically induced lexicon)
final_message_sentiment = {}
for k, m in enumerate(msgs_adj_adv_only_tokenized):
m_sent_score = sum([sentscores.get(w,0)>0 for w in m])-sum([sentscores.get(w,0)<0 for w in m])
final_message_sentiment[msgs[k]]=m_sent_score
sorted(final_message_sentiment.items(), key=itemgetter(1), reverse=False)[-10:]
[("i know i'm setting myself up here, but, honestly, the warm nuts in new york are so good.", 5), ('this whole experience in new york has really opened my eyes to so many different things.', 6), ('we were like, "oh, yeah, good to see you." and then over text message, it was back before like, "oh, so good to look in your eyes."', 6), ('my first memory of bruce was it was my 11th birthday party and it was at tower lane, and i thought it was so cool that you four: you, casey, brody, everyone was coming, and i was, like, "we have four new, like, brothers and sisters, and they\'re all coming." and that\'s why it was so much fun \'cause we had such a good time.', 6), ("i wouldn't be a good manager or a good mom if i didn't find out who's really single out there and who would be a great match for kim.", 6), ("i definitely feel protective over summer because she's so young and new to the industry, but i think the smart thing to do is let her learn her own lessons and kind of feel her way through on her own.", 6), ("with sex, there's a real fine li from being fun to turning trashy, and we are the ones that have to make sure all the pictures and everything else with carmen looks really, really fun and sexy.", 6), ('once you\'ve broken trust, and you start thinking in your mind, "i want to be a better person, i want to be a better dad," people aren\'t automatically gonna go, "oh, scott, that\'s so great!', 6), ("this is a great time to tell khloe that it's not always all about us and that maybe once in a while it's a great thing to help somebody else out.", 6), ("so, tonight, khloe, i ask you to honor that very same promise to his grandmother, that you will always support lamar and stand by him because you have realized very quickly what the rest of us already know: it's very easy to love lamar.", 8)]
sorted(final_message_sentiment.items(), key=itemgetter(1))[:10]
[('now, i do not know what case you have him on, but whatever it is, it is going bad, and it sounds like it is going bad right now.', -6), ('all you can do right now is just dote on her, take care of her and just realize that this stage of pregnancy, you just got to get through it.', -5), ('look, i promise i will never, ever, ever lie ever again.', -5), ("i need scott to start taking things a little bit more seriously when it comes to the baby, like getting the room together, reading baby books, just being more involved in what i'm going through.", -5), ("i couldn't be any more sorry, and i'll never excuse the way i acted the other night in vegas, but, like, i don't know what i ever did so bad to, like, deserve you to, like, hate me so much.", -5), ("do you think it's 'cause you guys are spending way too much time together now that you guys live together?", -5), ('now too much! too much!', -5), ("i'm just saying it's probably not the right thing to do.", -4), ("you're going too fast, you're going too fast.", -4), ('a little more, a little more!', -4)]
Pretty good considering that we had absolutely no sentiment labels to start with!