%matplotlib inline
from __future__ import print_function
import json
from operator import itemgetter
from collections import defaultdict
from matplotlib import pyplot as plt
import numpy as np
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist,pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB
tokenizer = TweetTokenizer()
Using the movie review data, but this time we will not use the sentiment labels (we will pretend we don't have labels).
## loading movie review data:
## http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
data = load_files('txt_sentoken')
print(data.data[0])
b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . \nanyway , this definitely doesn't seem like an arnold movie . \nit just wasn't the type of film you can see him doing . \nsure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . \nit's understandable , especially when the ending had to be changed according to some sources . \naside form that , he still walked through it , much like he has in the past few films . \ni'm sorry to say this arnold but maybe these are the end of your action days . \nspeaking of action , where was it in this film ? \nthere was hardly any explosions or fights . \nthe devil made a few places explode , but arnold wasn't kicking some devil butt . \nthe ending was changed to make it more spiritual , which undoubtedly ruined the film . \ni was at least hoping for a cool ending if nothing else occurred , but once again i was let down . \ni also don't know why the film took so long and cost so much . \nthere was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . \nthe budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . \nit's pitiful to see how scripts like these get bought and made into a movie . \ndo they even read these things anymore ? \nit sure doesn't seem like it . \nthankfully gabriel's performance gave some light to this poor film . \nwhen he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . \nthe guy is creepy looking anyway ! \nwhen it's all over , you're just glad it's the end of the movie . \ndon't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . \nit's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . \nsave your money and see the world is not enough for an entertaining experience . \n"
## building the term documnet matrix
vec = CountVectorizer(min_df = 50)
X = vec.fit_transform(data.data)
terms = vec.get_feature_names()
len(terms)
2153
# PMI type measure via matrix multiplication
def getcollocations_matrix(X):
XX=X.T.dot(X) ## multiply X with it's transpose to get number docs in which both w1 (row) and w2 (column) occur
term_freqs = np.asarray(X.sum(axis=0)) ## number of docs in which a word occurs
pmi = XX.toarray() * 1.0 ## Casting to float, making it an array to use simple operations
pmi /= term_freqs.T ## dividing by the number of documents in which w1 occurs
pmi /= term_freqs ## dividing by the number of documents in which w2 occurs
return pmi # this is not technically PMI beacuse we are ignoring some normalization factor and not taking the log
# but it's sufficient for ranking
pmi_matrix = getcollocations_matrix(X)
pmi_matrix.shape
(2153, 2153)
def getcollocations(w,PMI_MATRIX=pmi_matrix,TERMS=terms):
if w not in TERMS:
return []
idx = TERMS.index(w)
col = PMI_MATRIX[:,idx].ravel().tolist()
return sorted([(TERMS[i],val) for i,val in enumerate(col)],key=itemgetter(1),reverse=True)
getcollocations("good")
[('good', 0.0012711337380982813),
('trek', 0.0010038914000850665),
('sean', 0.0009922470727116103),
('nudity', 0.0009374840201587473),
('nicely', 0.0009268742752181751),
('trash', 0.0009217014608968155),
('showed', 0.000916850400576306),
('compared', 0.00091151987499156),
('fairly', 0.0008716089901959017),
('comparison', 0.0008698557537213697),
('laughed', 0.0008665639627895953),
('crap', 0.0008473706979212659),
('pulp', 0.0008450365730278281),
('parts', 0.0008435572066033899),
('fifteen', 0.0008424927416009955),
('sorry', 0.0008413817621615216),
('pretty', 0.0008334590198961828),
('nights', 0.0008333717375608706),
('chris', 0.000833301911692621),
('doctor', 0.0008330167404996009),
('rating', 0.0008322781072402701),
('average', 0.0008295313148071339),
('forward', 0.0008295313148071339),
('watched', 0.0008295313148071339),
('cool', 0.0008275372491465399),
('stupid', 0.0008213343650560753),
('sadly', 0.0008174507616788748),
('matt', 0.0008162941129751053),
('hate', 0.0008140549843070009),
('kills', 0.0008135787895223813),
('terrific', 0.0008122494124153186),
('horrible', 0.0008093970595933685),
('agrees', 0.0008091330037872864),
('subplot', 0.0008082612810941305),
('totally', 0.0008068044294699522),
('sad', 0.0008064887782847135),
('technical', 0.0008033355890763824),
('therefore', 0.0008002537389904116),
('handled', 0.0007999051964211649),
('scientist', 0.0007949675100235034),
('lovely', 0.0007943816828237808),
('barry', 0.000792934345036231),
('villain', 0.0007926632563712613),
('event', 0.0007924384511368963),
('producers', 0.0007895539020453444),
('okay', 0.0007863956864371629),
('fit', 0.000785871771922548),
('mentioned', 0.0007854073087003715),
('detail', 0.0007852359533368501),
('information', 0.0007839070924927416),
('allen', 0.0007790149847387508),
('seven', 0.0007784369946922017),
('shouldn', 0.0007783256780906441),
('naturally', 0.0007776856076316881),
('comments', 0.0007747509449613798),
('entertain', 0.0007747509449613798),
('jail', 0.0007734819016444898),
('fbi', 0.0007733651320337342),
('climactic', 0.0007732919036337689),
('bad', 0.0007712559966343031),
('ended', 0.0007711136165812794),
('judge', 0.0007694203499660373),
('ones', 0.0007682591154179706),
('nice', 0.0007668341805484552),
('kill', 0.000764042000480255),
('critics', 0.0007636954961716471),
('danny', 0.0007634624490260348),
('presented', 0.0007617330823469355),
('rent', 0.0007604037052398728),
('sub', 0.0007604037052398728),
('genius', 0.0007595059440766616),
('thankfully', 0.0007594300769361086),
('wanted', 0.0007590994107197358),
('breaking', 0.0007584286306808082),
('batman', 0.0007559768139867969),
('total', 0.0007557951979353888),
('wasn', 0.0007554151148587302),
('bigger', 0.0007552449284064951),
('ensemble', 0.000752202124443757),
('steals', 0.000752202124443757),
('lot', 0.0007517244632705712),
('kiss', 0.0007491175649023608),
('directing', 0.0007486014304357063),
('perspective', 0.0007479380707277437),
('badly', 0.0007476696718985352),
('crash', 0.0007476696718985352),
('adds', 0.0007473255088352558),
('really', 0.0007456730464617401),
('job', 0.0007452354168096464),
('army', 0.000744825652379645),
('brown', 0.0007446186605355376),
('mainly', 0.0007441383853416938),
('pay', 0.0007420807243907193),
('dumb', 0.000741226368392181),
('explosions', 0.0007406529596492268),
('yeah', 0.0007402779454924423),
('driver', 0.0007399796387768184),
('recommend', 0.0007395349929176807),
('blame', 0.0007389503091672745),
('twice', 0.0007382828701783492),
('gary', 0.0007379596761595932),
('wouldn', 0.0007373611687174524),
('cares', 0.0007367547861773888),
('killed', 0.000736304171629268),
('fiction', 0.0007362894228326887),
('price', 0.0007357152732515652),
('murphy', 0.0007352663926699596),
('hits', 0.0007338161630986185),
('accent', 0.0007334803204610447),
('acts', 0.0007329420521241115),
('saw', 0.0007328073077446421),
('suspenseful', 0.0007327526614129683),
('guilty', 0.0007299875570302779),
('advice', 0.000729680323209979),
('ending', 0.0007295169009651391),
('aren', 0.0007290304055131928),
('jackson', 0.000728289303944846),
('ok', 0.0007282513286969607),
('actor', 0.0007277390106091889),
('news', 0.0007277251988989857),
('fights', 0.0007273426745772697),
('thinks', 0.00072659677209384),
('throw', 0.0007247925124324958),
('saying', 0.0007246200014638788),
('cop', 0.0007238458347956482),
('loves', 0.0007235356468040002),
('extra', 0.0007228772886176453),
('villains', 0.0007228772886176453),
('performance', 0.0007227970372811597),
('range', 0.0007226977363850031),
('flash', 0.0007224950161223425),
('gives', 0.0007223630680223859),
('thrills', 0.0007222643344441425),
('said', 0.0007214337497047879),
('surprised', 0.0007209945072622753),
('treat', 0.0007200792663256371),
('guys', 0.0007196493682561889),
('writing', 0.0007196184155951888),
('particular', 0.0007191066917321584),
('witty', 0.0007183570148845284),
('natural', 0.000717863637813866),
('acted', 0.0007174324884818456),
('liked', 0.0007171432011881029),
('cliched', 0.0007164134082425248),
('grace', 0.0007158548012965267),
('national', 0.000715805247454543),
('acting', 0.0007155453571609738),
('aliens', 0.0007152403336559289),
('chemistry', 0.0007146731327569155),
('guess', 0.0007139107996902104),
('instance', 0.0007133969307341352),
('violent', 0.0007131582166867087),
('mediocre', 0.0007118275471655811),
('alien', 0.0007110268412632576),
('scary', 0.0007110268412632576),
('ask', 0.0007106124899571602),
('probably', 0.0007102573316947909),
('nevertheless', 0.0007100225660637334),
('mean', 0.0007095577775416395),
('allowed', 0.0007091154787867436),
('loud', 0.0007090011237667811),
('flick', 0.0007089106899499742),
('fun', 0.000708794736453356),
('slightly', 0.000708672447749141),
('plain', 0.0007081364882499924),
('allows', 0.0007078066110039132),
('prison', 0.0007071414486880486),
('trailer', 0.00070639776026545),
('stuff', 0.0007058992438503015),
('fantastic', 0.0007056402742839906),
('dog', 0.0007054282047178778),
('critic', 0.0007051016175860639),
('hey', 0.0007051016175860639),
('overall', 0.0007051016175860639),
('working', 0.0007049068919253111),
('developed', 0.0007046989324817885),
('person', 0.0007034519814486633),
('visuals', 0.0007030783704767782),
('emotion', 0.0007022736699219486),
('menace', 0.0007016129344864077),
('murdered', 0.0007013310207005769),
('requires', 0.0007008109383715443),
('track', 0.0007006720814390355),
('usual', 0.0007006493308681725),
('lines', 0.0006999170468685193),
('saving', 0.0006999170468685193),
('yes', 0.0006999170468685193),
('able', 0.0006998405782738653),
('get', 0.0006997175379902146),
('maybe', 0.0006996916307503651),
('think', 0.0006990926451643161),
('bring', 0.0006989242567311171),
('remember', 0.0006988801327250104),
('de', 0.0006986421524297788),
('annoying', 0.0006978596775361603),
('wonderfully', 0.0006977822236318833),
('disappointing', 0.00069756042381509),
('included', 0.0006972871921567213),
('friends', 0.0006965130357913436),
('tell', 0.0006964706559291111),
('williams', 0.0006959289155473312),
('realistic', 0.0006955301024152124),
('except', 0.0006951165184263484),
('episode', 0.0006940976307569897),
('impressive', 0.0006938603053760607),
('terribly', 0.0006936598063473447),
('very', 0.0006935024277037634),
('language', 0.000693488179178764),
('doing', 0.000693268245804233),
('feeling', 0.0006931194985944053),
('somewhere', 0.0006923647194453244),
('study', 0.0006912760956726117),
('theatre', 0.0006912760956726117),
('dull', 0.0006902443403059361),
('decided', 0.000689946718565549),
('hotel', 0.000689668476845466),
('seemingly', 0.0006890461727833452),
('thrillers', 0.0006890461727833452),
('mood', 0.0006884254725976731),
('confused', 0.0006883344952654942),
('anti', 0.0006881339316013725),
('brilliant', 0.0006881339316013725),
('reason', 0.000688112360680975),
('smart', 0.0006876378004322295),
('direction', 0.0006873678209267594),
('jackie', 0.0006873678209267594),
('actually', 0.0006873117883139156),
('drop', 0.000686248633158629),
('planet', 0.0006861555320009626),
('brian', 0.0006859585872443607),
('above', 0.000685583233708249),
('lawyer', 0.000685264999188502),
('better', 0.0006851280870126166),
('warm', 0.0006849917675301333),
('biggest', 0.0006847808840354194),
('hundred', 0.0006846925138090629),
('screenplay', 0.0006846474207826003),
('did', 0.0006843905146410103),
('lose', 0.0006843633347158854),
('will', 0.0006842884672687007),
('direct', 0.0006840940063669222),
('scene', 0.0006840515924536997),
('george', 0.0006839995051918474),
('considered', 0.000683922094654818),
('sheer', 0.0006838028405842591),
('criminal', 0.0006834206854945138),
('general', 0.0006833962645302296),
('develops', 0.000683143435723522),
('rules', 0.000683143435723522),
('guy', 0.0006827392523224378),
('talent', 0.0006825963958166327),
('looks', 0.0006825376168108359),
('had', 0.0006825121813937092),
('great', 0.0006824846052572631),
('tension', 0.0006824512944512591),
('learn', 0.0006824341921233109),
('fact', 0.0006821735781395314),
('entertainment', 0.0006821027635973353),
('agent', 0.0006814007228772886),
('explained', 0.0006814007228772886),
('hit', 0.0006810888689995415),
('reasons', 0.0006807222621508924),
('moved', 0.0006804749066777271),
('offensive', 0.0006802156781418498),
('threatening', 0.0006799437006615852),
('feel', 0.0006798736033728572),
('huge', 0.000679823000596379),
('running', 0.0006792911231160586),
('master', 0.0006792539027043922),
('cops', 0.0006791217906937525),
('why', 0.000678920708442264),
('gore', 0.0006787074393876551),
('failure', 0.0006781978992679947),
('soundtrack', 0.0006781978992679947),
('besides', 0.0006781089319455142),
('either', 0.0006780236523876963),
('aforementioned', 0.0006779823246019844),
('feels', 0.000677834616034533),
('me', 0.0006772941322099265),
('definitely', 0.0006772162428792703),
('capable', 0.0006770439407617049),
('intelligent', 0.0006762483544623375),
('rated', 0.0006761248387811572),
('flicks', 0.0006759144046576647),
('girls', 0.0006755789965569017),
('care', 0.0006753585539959397),
('anyway', 0.0006753514107461222),
('well', 0.0006750278432664558),
('relief', 0.0006745639263266804),
('done', 0.0006741033421380078),
('asking', 0.0006739941932807963),
('evil', 0.0006738733408165179),
('jump', 0.0006732428062202827),
('supporting', 0.0006732428062202827),
('gets', 0.0006727355113724907),
('feet', 0.0006727296638374928),
('sure', 0.0006725072227117109),
('although', 0.0006724942545826389),
('credit', 0.0006722064102747464),
('weird', 0.0006719203649937786),
('happening', 0.0006718035295973268),
('necessary', 0.0006717400320992552),
('right', 0.0006715253500819656),
('1996', 0.0006704431174468618),
('hurt', 0.0006704431174468618),
('basically', 0.0006703084795005513),
('dies', 0.0006700060619596082),
('roles', 0.0006697805845704244),
('interesting', 0.0006689558957182921),
('star', 0.0006687483070094306),
('usually', 0.0006687411040075134),
('whom', 0.0006685774776057497),
('try', 0.0006682335591501913),
('though', 0.0006680374524563834),
('haunting', 0.0006679343054291209),
('major', 0.0006678430076837096),
('role', 0.0006678107603149175),
('path', 0.0006675134798838656),
('regular', 0.0006675134798838656),
('sign', 0.0006674389889252802),
('loved', 0.0006670457995356336),
('don', 0.0006670226685137735),
('thing', 0.0006670088013375039),
('expecting', 0.0006667751707626963),
('make', 0.0006663531085448293),
('knows', 0.0006663202799443585),
('isn', 0.0006661965036826523),
('amount', 0.0006661387831026985),
('relatively', 0.0006661387831026985),
('bruce', 0.0006656732773143668),
('ideas', 0.0006653532420848887),
('he', 0.0006653376447000585),
('want', 0.0006651063577650056),
('reminded', 0.0006650552782505471),
('subplots', 0.0006650552782505471),
('grow', 0.0006648449508380707),
('rise', 0.0006648449508380707),
('sometimes', 0.0006648229310006633),
('special', 0.0006647811930510132),
('individual', 0.0006646885535313573),
('forever', 0.0006645170210014137),
('scenes', 0.0006644715123710205),
('action', 0.0006642620639475215),
('aspect', 0.0006642051436742437),
('kind', 0.0006640702385978397),
('getting', 0.0006640336879613757),
('just', 0.0006639106047940057),
('believable', 0.0006636250518457072),
('boring', 0.0006636250518457072),
('cliche', 0.0006636250518457072),
('funny', 0.0006636250518457072),
('irritating', 0.0006636250518457072),
('weight', 0.0006636250518457072),
('went', 0.0006636250518457072),
('also', 0.0006635828794351934),
('effects', 0.0006633694181585554),
('jack', 0.0006633457483727081),
('bit', 0.0006630408748634487),
('need', 0.00066283752211646),
('but', 0.0006625489862068561),
('disappointment', 0.0006623869454056965),
('hardly', 0.0006622308815687205),
('tight', 0.0006621697337495544),
('likes', 0.000661949231007713),
('budget', 0.0006618118686439429),
('frightening', 0.0006616499772866426),
('heard', 0.0006615893921774687),
('black', 0.0006614045091292062),
('serves', 0.0006612206132520633),
('typical', 0.0006606624400071102),
('myself', 0.0006603810746369642),
('again', 0.0006602878569010809),
('superb', 0.0006600571752228807),
('we', 0.0006600378894032979),
('musical', 0.0006598544549602202),
('nobody', 0.0006598544549602202),
('afraid', 0.0006595453896417377),
('richard', 0.0006595031571137463),
('system', 0.0006593710451031064),
('him', 0.000658615728676154),
('longer', 0.0006585592117552819),
('terrible', 0.0006584042253888791),
('decides', 0.0006583581863548682),
('knowing', 0.0006583581863548682),
('does', 0.0006581230584311701),
('makes', 0.0006581059926947726),
('wars', 0.0006580639480592907),
('sounds', 0.0006580116820462604),
('nothing', 0.0006577440462556566),
('built', 0.0006576998281685133),
('reading', 0.0006575553105178501),
('confusing', 0.0006574476909907604),
('wasted', 0.0006572981180887036),
('grown', 0.0006572440417318061),
('drawn', 0.0006571006482461006),
('fly', 0.000656712290888981),
('responsible', 0.000656712290888981),
('played', 0.0006564938091899696),
('was', 0.0006564883957972653),
('survive', 0.0006562747743727326),
('childhood', 0.0006560838580747332),
('gave', 0.0006559768907872017),
('too', 0.0006556821711522463),
('basic', 0.0006555973294443478),
('calls', 0.0006553297386976359),
('surprising', 0.0006553297386976359),
('some', 0.0006548712038000038),
('brief', 0.0006547372163299164),
('became', 0.0006544925970037938),
('beat', 0.0006542782201295704),
('started', 0.0006538658599067997),
('anyone', 0.0006534515545886386),
('jerry', 0.0006531742636276646),
('however', 0.0006529728297040989),
('heroes', 0.0006529479161105658),
('like', 0.000652946803213366),
('admit', 0.0006528050781743098),
('shoot', 0.0006528050781743098),
('case', 0.0006527621417708519),
('then', 0.0006527316279785068),
('depth', 0.0006526033071035145),
('script', 0.0006525362013649949),
('movies', 0.0006524133101944996),
('times', 0.0006520875564461009),
('buy', 0.0006517746044913196),
('provide', 0.0006517746044913196),
('performances', 0.0006514996521165078),
('tough', 0.0006513116963915387),
('thrown', 0.0006512548480284078),
('hill', 0.0006512208452691519),
('beginning', 0.000651103824452392),
('loving', 0.0006510856249939714),
('ups', 0.0006510245761777506),
('see', 0.0006509615377774679),
('course', 0.000650951656758376),
('problem', 0.0006504279627465027),
('best', 0.0006503077449163204),
('room', 0.00065000588100559),
('filmmakers', 0.0006499420610860018),
('places', 0.0006493805747227564),
('never', 0.0006493165422671562),
('supposedly', 0.0006487360282466047),
('kevin', 0.0006486366355657029),
('especially', 0.0006485261265981212),
('even', 0.000648425062841444),
('occasionally', 0.0006482891787988526),
('company', 0.0006482129946307112),
('money', 0.0006480713396930734),
('fair', 0.0006478244553731903),
('science', 0.0006477404096472726),
('not', 0.0006476204670378808),
('next', 0.0006475053385230357),
('know', 0.0006468572043976747),
('seems', 0.000646841698041768),
('memories', 0.0006467532284936977),
('unbelievable', 0.0006467532284936977),
('sick', 0.0006463880375120525),
('actors', 0.0006462354435466341),
('supposed', 0.0006459044138513752),
('idea', 0.0006457879795324968),
('likable', 0.0006457743779827689),
('extremely', 0.0006455626764426486),
('ve', 0.0006454666510550725),
('plays', 0.0006453135893114008),
('creature', 0.0006451910226277709),
('held', 0.0006451910226277709),
('mike', 0.0006451910226277709),
('seconds', 0.0006451910226277709),
('time', 0.0006449425340547377),
('entertaining', 0.0006446039516335691),
('my', 0.0006441661752358124),
('help', 0.0006441611885932493),
('awful', 0.0006441436346040245),
('could', 0.0006440929900534719),
('considering', 0.0006440669964559454),
('dr', 0.0006440243119177749),
('should', 0.000643552677141085),
('slowly', 0.0006433766496732495),
('fans', 0.0006433099992381855),
('pull', 0.0006432382652953624),
('mistake', 0.0006431873237997343),
('moral', 0.0006431197833898005),
('occur', 0.0006428867689755288),
('characterization', 0.0006425946804843995),
('entirely', 0.0006425946804843995),
('fire', 0.0006425946804843995),
('bond', 0.0006422177921087489),
('nomination', 0.0006422177921087489),
('doesn', 0.0006421234962308942),
('series', 0.000641827148682892),
('today', 0.000641765780712276),
('albeit', 0.0006417129039074055),
('present', 0.0006415906262961427),
('ahead', 0.0006415042167841836),
('speed', 0.0006414399120310978),
('anywhere', 0.0006410014705327854),
('efforts', 0.0006410014705327854),
('mad', 0.0006410014705327854),
('possible', 0.0006410014705327854),
('realize', 0.0006410014705327854),
('selling', 0.0006410014705327854),
('it', 0.0006405730734197016),
('flashbacks', 0.000640446970990802),
('holes', 0.000640446970990802),
('predictable', 0.0006403799435736392),
('flaw', 0.0006403399623072613),
('generally', 0.0006402693157977394),
('used', 0.0006399878692194824),
('animals', 0.000639924157136932),
('got', 0.0006397980885480555),
('things', 0.0006396737955731069),
('non', 0.0006396386041886335),
('pieces', 0.0006394303884971658),
('everything', 0.000639365173771159),
('so', 0.0006390972320839649),
('hasn', 0.0006390776966116186),
('place', 0.0006386716708311836),
('appearance', 0.0006386074407642222),
('largely', 0.0006386074407642222),
('stuck', 0.0006384594951043672),
('wants', 0.0006382347851870402),
('revolves', 0.0006381010113901031),
('theme', 0.0006378593064615462),
('seemed', 0.0006378000203469945),
('exciting', 0.0006377021982579842),
('fake', 0.0006377021982579842),
('saved', 0.0006376248166054836),
('go', 0.0006376136925584035),
('frank', 0.0006375101771202974),
('helped', 0.0006375101771202974),
('oh', 0.0006375101771202974),
('decent', 0.0006373228394249932),
('difference', 0.0006373228394249932),
('happened', 0.0006373228394249932),
('trust', 0.0006373228394249932),
('directors', 0.0006372308736472984),
('work', 0.0006371939070111661),
('etc', 0.0006370800497718789),
('our', 0.0006369861926514015),
('strikes', 0.000636961545298335),
('seen', 0.0006367336520799815),
('little', 0.0006363792864759592),
('funniest', 0.0006363527894410891),
('damn', 0.0006362882244259267),
('couple', 0.0006362330341904833),
('this', 0.0006362222842527883),
('way', 0.0006359903405904665),
('began', 0.0006359740080188027),
('pulls', 0.0006359740080188027),
('making', 0.0006359280760523128),
('instead', 0.0006357293085159097),
('always', 0.0006355965193658757),
('problems', 0.0006355965193658757),
('or', 0.0006355875258306249),
('entire', 0.000635364058522621),
('turn', 0.0006352884449487142),
('personal', 0.0006352463489707263),
('later', 0.0006351777737724782),
('exact', 0.000635109912899212),
('attention', 0.0006350561310452954),
('happens', 0.0006350094367225153),
('ever', 0.0006349762899425742),
('common', 0.0006349105063331525),
('describe', 0.000634717142390307),
('straight', 0.0006345914558274575),
('minor', 0.000634529550505457),
('been', 0.0006344902934719933),
('face', 0.0006343474760289847),
('fight', 0.0006343474760289847),
('twist', 0.0006343474760289847),
('have', 0.0006342080874479353),
('move', 0.0006342056273089425),
('society', 0.0006341900697073896),
('followed', 0.0006339989334597381),
('combination', 0.0006338871367865835),
('nearly', 0.0006336322258054493),
('hot', 0.0006335790357188346),
('may', 0.0006335218734437214),
('if', 0.0006334845249732937),
('social', 0.0006334602767618115),
('strong', 0.0006329819174554437),
('add', 0.0006326933757003564),
('subtle', 0.0006325922256802605),
('talking', 0.0006325176275404396),
('patrick', 0.0006324149627737556),
('took', 0.0006322647216517789),
('eddie', 0.0006318913035611389),
('government', 0.0006318913035611389),
('put', 0.0006318847691429929),
('before', 0.0006317650285653122),
('learned', 0.000631720001276202),
('together', 0.000631683328804283),
('cross', 0.0006314900549657912),
('deserves', 0.0006314900549657912),
('give', 0.0006313901451384067),
('character', 0.000631182985573547),
('ability', 0.0006310363216211411),
('player', 0.0006309111408392287),
('poor', 0.0006306710681067937),
('formula', 0.0006306130913584845),
('needs', 0.0006305939406678666),
('interested', 0.0006305786823940408),
('do', 0.0006304871168155528),
('game', 0.0006304437992534219),
('suspense', 0.0006303616674400746),
('short', 0.0006301762085067098),
('wild', 0.0006300908072045678),
('follow', 0.0006299954039481207),
('second', 0.0006299059485256166),
('all', 0.0006294991237565685),
('ago', 0.00062944335947677),
('say', 0.0006293887843642655),
('because', 0.0006290448272023219),
('powerful', 0.0006289852826559587),
('seeing', 0.0006288777224349069),
('audiences', 0.0006284328142478287),
('worker', 0.0006284328142478287),
('days', 0.0006283205941024273),
('were', 0.0006281126594862564),
('shot', 0.0006281077627921833),
('charming', 0.0006280737097825443),
('oliver', 0.0006280737097825443),
('film', 0.0006279666236763682),
('singing', 0.0006279091202359556),
('leaves', 0.0006278772935280517),
('films', 0.0006278191103276649),
('quite', 0.0006278043814335809),
('laughable', 0.0006277534274216149),
('battle', 0.0006275584729410491),
('powers', 0.0006275584729410491),
('details', 0.0006273766246440509),
('hell', 0.000627333056822895),
('taking', 0.0006272902091310146),
('mark', 0.0006271456627005742),
('perfectly', 0.0006271456627005742),
('robert', 0.000627137076486493),
('made', 0.0006271226129930316),
('generated', 0.000627086172503012),
('big', 0.0006268262942715562),
('starring', 0.0006266568084684328),
('suppose', 0.0006266568084684328),
('dramatic', 0.0006264244207177584),
('what', 0.0006260189663520424),
('dozen', 0.0006259190829908375),
('touches', 0.0006259190829908375),
('wrong', 0.0006259190829908375),
('seriously', 0.0006257867813457326),
('thoughts', 0.0006257867813457326),
('seem', 0.0006257614273719321),
('back', 0.0006256700813097204),
('loose', 0.0006256634493036858),
('sam', 0.0006256241759718609),
('violence', 0.0006255706449948188),
('any', 0.0006253106130995327),
('gotten', 0.0006251540343474053),
('record', 0.0006251540343474053),
('robin', 0.0006250970571295465),
('surprises', 0.0006250693710166432),
('completely', 0.0006249764337694657),
('join', 0.0006246470744029624),
('results', 0.0006245882840900773),
('people', 0.0006245715157190483),
('bunch', 0.0006245321967800837),
('industry', 0.0006245321967800837),
('cliches', 0.0006244274182888866),
('amazing', 0.0006244026472868915),
('point', 0.0006242677266906242),
('ass', 0.0006242017814390315),
('disturbing', 0.000624161911626727),
('which', 0.0006240510808640825),
('sense', 0.0006240167998774386),
('monster', 0.000623891198951584),
('write', 0.000623891198951584),
('ship', 0.00062371956814097),
('hold', 0.0006237077554940856),
('order', 0.0006236089285609969),
('movie', 0.0006234062228935263),
('unlike', 0.0006233902994508702),
('re', 0.0006232476883776215),
('save', 0.0006229081301665292),
('heart', 0.0006228812876201978),
('killer', 0.0006228611418740851),
('between', 0.0006227931995624544),
('take', 0.0006223776384022585),
('asks', 0.0006221484861053505),
('edge', 0.0006221484861053505),
('finally', 0.0006221484861053505),
('lacking', 0.0006221484861053505),
('quiet', 0.0006221484861053505),
('shooting', 0.0006221484861053505),
('stunning', 0.0006221484861053505),
('tommy', 0.0006221484861053505),
('tradition', 0.0006221484861053505),
('going', 0.0006216814076623284),
('they', 0.000621589734442527),
('cast', 0.0006213394503626908),
('sound', 0.000621302025580037),
('mission', 0.0006211748578015862),
('there', 0.0006210483119477813),
('doubt', 0.0006209634413699117),
('kids', 0.0006208839566620469),
('brought', 0.0006208275763683964),
('inside', 0.0006208275763683964),
('six', 0.0006207377185631616),
('small', 0.0006206982565340094),
('thought', 0.0006206420733060639),
('race', 0.0006205155504462813),
('can', 0.000620277579253773),
('one', 0.0006202348373100844),
('explain', 0.000620135060583974),
('using', 0.0006200746578183327),
('many', 0.0006198587703310405),
('humanity', 0.0006197086881206236),
('much', 0.0006196181929293892),
('fan', 0.0006195233870078595),
('accept', 0.00061938338172266),
('trying', 0.0006192172800459613),
('1995', 0.0006191429378632956),
('lee', 0.00061902994732788),
('car', 0.0006189182239760392),
('claims', 0.0006188566951735761),
('out', 0.0006185562072468923),
('effectively', 0.0006185101908649683),
('frankly', 0.0006183778892198636),
('hard', 0.000618262266146714),
('told', 0.0006182356025449395),
('born', 0.0006181603547841623),
('fully', 0.0006180821561308057),
('air', 0.0006180282974556462),
('still', 0.0006179889451285238),
('rob', 0.0006177360854946742),
('against', 0.000617664533052339),
('silent', 0.0006176401637422683),
('failed', 0.0006175399788008664),
('plot', 0.0006173511305174044),
('important', 0.0006173256296239136),
('none', 0.0006170067630796864),
('broken', 0.0006169639153878059),
('shock', 0.0006169639153878059),
('south', 0.0006169639153878059),
('books', 0.0006168309776770996),
('spend', 0.0006166182773399696),
('means', 0.0006164822885998373),
('girlfriend', 0.0006164407018291546),
('same', 0.0006164275804859909),
('suspects', 0.0006163878519747455),
('five', 0.000616306716282765),
('being', 0.0006162586897745075),
('weren', 0.0006162232624281567),
('obsessed', 0.0006160489911435334),
('whatever', 0.0006160489911435334),
('van', 0.0006160232548778717),
('college', 0.0006159579539052972),
('recently', 0.0006159579539052972),
('logic', 0.0006158641579628722),
('them', 0.0006158507656812686),
('marry', 0.000615458717437551),
('speech', 0.000615458717437551),
('far', 0.00061529015633726),
('would', 0.0006151668925359685),
('shows', 0.0006150671212228505),
('those', 0.0006149973540811511),
('here', 0.0006148167699391258),
('must', 0.0006147659258603031),
('long', 0.0006147065185682051),
('exist', 0.0006146527212125149),
('something', 0.0006145255546073584),
('land', 0.000614467640597877),
('no', 0.0006144303549400738),
('telling', 0.0006143521391616743),
('she', 0.0006141595264514455),
('winner', 0.0006140686356364499),
('almost', 0.0006140554976682077),
('throughout', 0.0006139081088059418),
('liners', 0.0006138531729572792),
('chance', 0.0006137787755299422),
('standing', 0.0006136259041039073),
('that', 0.0006135531164153746),
('fascinating', 0.0006135075349094428),
('ex', 0.0006133504267058808),
('quickly', 0.000613202560161352),
('minutes', 0.000613131841379186),
('obviously', 0.0006130527480043951),
('mess', 0.0006130184244643914),
('cute', 0.0006128626878052707),
('plenty', 0.0006128626878052707),
('comedies', 0.000612721993891633),
('enough', 0.000612576970934499),
('drama', 0.0006122731133100275),
('notice', 0.0006122731133100275),
('terms', 0.0006122731133100275),
('decide', 0.0006121138331036512),
('destroy', 0.0006117793446702613),
('50', 0.0006116035965103445),
('style', 0.0006116035965103445),
('succeeds', 0.0006115134692488488),
('theater', 0.0006115134692488488),
('has', 0.0006113816301969087),
('talented', 0.0006113753521468163),
('superior', 0.0006112336003842039),
('off', 0.000610998871659018),
('introduced', 0.0006108366954488896),
('certain', 0.0006106851136645484),
('remarkable', 0.0006106272178441403),
('taste', 0.0006106272178441403),
('john', 0.0006101940874583805),
('end', 0.0006100413906444177),
('smith', 0.0006098461149111769),
('read', 0.0006098176152095687),
('other', 0.0006097992006179753),
('sweet', 0.0006096555446172912),
('use', 0.0006096429888972028),
('visually', 0.0006093470769262281),
('ed', 0.000609187059311489),
('fox', 0.0006090702897007335),
('play', 0.0006088723226785255),
('credits', 0.000608867812346123),
('tried', 0.0006085813851622432),
('part', 0.0006082067833354827),
('office', 0.0006081900264811919),
('main', 0.0006081150616067336),
('despite', 0.0006080087477847743),
('where', 0.0006079708396391428),
('meeting', 0.000607944182769612),
('ways', 0.0006078840587343283),
('involved', 0.0006076402223690117),
('figures', 0.0006075440615488869),
('door', 0.0006073354269123659),
('halfway', 0.0006073354269123659),
('screenwriter', 0.0006073354269123659),
('willing', 0.0006073354269123659),
('opening', 0.0006072386095320195),
('married', 0.0006071569563196793),
('truth', 0.0006070411277231013),
('humor', 0.0006069741327857078),
('highly', 0.0006068997487008076),
('effort', 0.0006068383443891115),
('comic', 0.0006066880695697419),
('led', 0.0006064640704892491),
('friend', 0.0006064299831964133),
('worse', 0.0006060657361243958),
('than', 0.0006058864534585655),
('driving', 0.0006057761575236307),
('final', 0.0006057761575236307),
('paced', 0.0006057761575236307),
('yet', 0.0006057761575236307),
('points', 0.0006056087513009137),
('editing', 0.0006055578598092078),
('disaster', 0.0006054973100782),
('works', 0.0006053961127561814),
('hope', 0.0006052028074954771),
('conclusion', 0.0006051498935888108),
('manage', 0.0006050699002122624),
('pg', 0.0006050699002122624),
('comes', 0.0006048901606608638),
('generation', 0.0006048665837135352),
('past', 0.0006048665837135352),
('adaptation', 0.0006046583680220675),
('score', 0.0006045405100835009),
('students', 0.000604372815073769),
('value', 0.000604372815073769),
('you', 0.000604281417718327),
('only', 0.000604277821507802),
('watch', 0.0006039208079607493),
('how', 0.0006036931915274067),
('talk', 0.0006036321621141198),
('woody', 0.000603555542842432),
('about', 0.0006033704212867672),
('owner', 0.0006032955016779157),
('is', 0.0006032580875393416),
('are', 0.0006032575189779611),
('ll', 0.0006030994567791901),
('came', 0.0006030916856300515),
('field', 0.0006028570601796031),
('90', 0.0006025840683032954),
('enjoyed', 0.0006025840683032954),
('multiple', 0.0006025840683032954),
('everyone', 0.000602438547840305),
('obvious', 0.0006022624614353165),
('wonderful', 0.0006020791801019521),
('look', 0.000602031109907932),
('wait', 0.0006019862666482326),
('likely', 0.0006019160150124935),
('jeff', 0.0006018168362326266),
('dialogue', 0.0006017698266375452),
('didn', 0.0006017325599637242),
('now', 0.0006016610695602147),
('island', 0.0006015685107379979),
('over', 0.0006014962542014384),
('1998', 0.0006014102032351721),
('agree', 0.0006014102032351721),
('date', 0.0006014102032351721),
('opera', 0.0006014102032351721),
('remake', 0.0006011771888209005),
('be', 0.0006011213317860572),
('chief', 0.0006011096484109667),
('games', 0.0006011096484109667),
('producer', 0.0006010587069153386),
('while', 0.000600992473040906),
('due', 0.0006009869729725155),
('phone', 0.0006009869729725155),
('building', 0.0006006950900327521),
('given', 0.0006006665994669187),
('falls', 0.0006004557215967957),
('mary', 0.0006003950425352334),
('figure', 0.0006003187146630575),
('travel', 0.0006003187146630575),
('naked', 0.0006001903042428087),
('am', 0.0006000416871066832),
('addition', 0.0006000008053702085),
('unnecessary', 0.0005998149507066969),
('super', 0.0005997287208402928),
('hero', 0.0005996418225253119),
('forces', 0.0005995622374348592),
('onto', 0.0005994932191043153),
('stands', 0.0005994932191043153),
('choice', 0.0005994659892160929),
('imagine', 0.0005994659892160929),
('worth', 0.0005993948842101705),
('enjoy', 0.0005993089675258589),
('without', 0.000599238187956086),
('position', 0.00059910594958293),
('for', 0.0005988931282437008),
('released', 0.0005988905987743094),
('several', 0.0005988859731006158),
('law', 0.0005988595053420486),
('technology', 0.000598522594227932),
('otherwise', 0.0005982196981782216),
('early', 0.0005981933738790719),
('who', 0.0005981748562238021),
('version', 0.000598001170434595),
('immediately', 0.0005979750275450198),
('studio', 0.0005979750275450198),
('yourself', 0.0005979538227568091),
('pacing', 0.0005977505062580819),
('witness', 0.0005977505062580819),
('most', 0.0005976870249575253),
('audience', 0.0005976437317292098),
('happen', 0.0005976396063496852),
('similar', 0.0005976193343234191),
('often', 0.000597486744313787),
('stop', 0.0005973863573051375),
('surprisingly', 0.0005971756847433556),
('mental', 0.0005970111735354374),
('every', 0.0005969647212682807),
('creating', 0.0005967546703459484),
('girl', 0.0005964550383015896),
('background', 0.000596414850427027),
('million', 0.0005963657560505341),
('1997', 0.0005962256325176275),
('around', 0.0005961969250385714),
('leave', 0.0005961050611055916),
('unfortunately', 0.0005960899107710949),
('growing', 0.000595860521903716),
('members', 0.0005957036958682102),
('brings', 0.0005954556467674971),
('faces', 0.0005954556467674971),
('apparently', 0.0005953574029716272),
('let', 0.0005953107082733549),
('student', 0.0005950985519268569),
('veteran', 0.0005950985519268569),
('more', 0.0005950716124445559),
('forget', 0.0005949507380788871),
('whole', 0.0005949163974879445),
('career', 0.0005948830146027255),
('catherine', 0.0005948612718024842),
('watching', 0.0005948059224834115),
('anything', 0.0005946316706465377),
('starts', 0.0005945849455816957),
('screen', 0.0005942032822377343),
('up', 0.0005941929153640528),
('model', 0.0005941237795240284),
('and', 0.0005940629658477276),
('capture', 0.0005935439580085528),
('solid', 0.0005935439580085528),
('positive', 0.0005934339405927958),
('lots', 0.000593387363876636),
('buddy', 0.0005932723960329503),
('era', 0.0005932113472167296),
('storyline', 0.0005930761269415491),
('thriller', 0.0005929729550712593),
('old', 0.0005925904737386595),
('as', 0.0005925848590923172),
('cause', 0.0005925223677193814),
('handle', 0.0005925223677193814),
('heroine', 0.0005925223677193814),
('mouth', 0.0005925223677193814),
('provided', 0.0005925223677193814),
('easy', 0.0005922554657519402),
('sets', 0.0005920306479121454),
('twenty', 0.0005920159383452623),
('ben', 0.0005919268678523268),
('us', 0.0005918044934621259),
('haven', 0.0005917675621554076),
('stone', 0.0005917675621554076),
('taken', 0.0005917323378957556),
('fill', 0.0005916510112962647),
('least', 0.0005914705528654417),
('begins', 0.0005914642920627396),
('friendly', 0.0005914251040754566),
...]
##example part of speech (POS) tagging (note that you need to tokenize the sentence first)
pos_tag(tokenizer.tokenize("This was a great day but the time is running out fast"))
[('This', 'DT'),
('was', 'VBD'),
('a', 'DT'),
('great', 'JJ'),
('day', 'NN'),
('but', 'CC'),
('the', 'DT'),
('time', 'NN'),
('is', 'VBZ'),
('running', 'VBG'),
('out', 'RP'),
('fast', 'RB')]
## POS tagging all reviews
## POS tagging is relatively slow, so this will take a while
reviews_pos_tagged=[pos_tag(tokenizer.tokenize(m)) for m in data.data]
## Reconstructing adjective-and-adverb-only reviews
reviews_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
for m in reviews_pos_tagged]
print(data.data[1])
b"good films are hard to find these days . \ngreat films are beyond rare . \nproof of life , russell crowe's one-two punch of a deft kidnap and rescue thriller , is one of those rare gems . \na taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these days , genuine motivation in a story that rings true . \nconsider the strange coincidence of russell crowe's character in proof of life making the moves on a distraught wife played by meg ryan's character in the film -- all while the real russell crowe was hitching up with married woman meg ryan in the outside world . \ni haven't seen this much chemistry between actors since mcqueen and mcgraw teamed up in peckinpah's masterpiece , the getaway . \nbut enough with the gossip , let's get to the review . \nthe film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south america who is kidnapped during a mass ambush of civilians by anti-government soldiers . \nupon discovering his identity , the rebel soldiers decide to ransom him for $6 million . \nthe only problem is that the company peter bowman works for is being auctioned off , and no one will step forward with the money . \nwith no choice available to her , bowman's wife alice ( ryan ) hires terry thorne ( crowe ) , a highly skilled negotiator and rescue operative , to arrange the return of her husband . \nbut when things go wrong -- as they always do in these situations -- terry and his team ( which includes the most surprising casting choice of the year : david caruso ) take matters into their own hands . \nthe film is notable in that it takes this very simple story line and creates a complex and intelligent character-driven vehicle filled with well-written dialogue , shades of motivation , and convincing acting by all the actors . \nthe script is based on both a book ( the long march to freedom ) and a magazine article pertaining to kidnap/ransom situations , and the story has been sharply pieced together by tony gilroy , screenwriter of the devil's advocate and dolores claiborne . \nthe biggest surprise for me was not the chemistry between crowe and ryan , but that between crowe and david caruso . \ndug out from b-movie hell , caruso pulls off a gutsy performance as crowe's right hand gun while providing most of the film's humor . \nryan cries a lot and smokes too many cigarettes , david morse ends up getting everyone at the guerilla camp to hate him , and crowe provides another memorable acting turn as the stoic , gunslinger character of terry thorne . \nthe most memorable pieces of the film lie in its action scenes . \nthe bulk of those scenes , which bookend the movie , work extremely well as establishment and closure devices for all of the story's characters . \nthe scenes are skillfully crafted and executed with amazing accuracy and poise . \ndirector taylor hackford mixes both his old-school style of filmmaking with the dizziness of a lars von trier film . \nproof of life is a thinking man's action movie . \nit is a film about the choices men and women make in the face of love and war , and the sacrifices one makes for those choices -- the sacrifices that help you sleep at night . \n"
## It kind of works:
reviews_adj_adv_only[1]
"good hard great rare crowe's one-two rare taut strong subtle intelligent masterful together virtually unheard genuine true strange distraught real married outside much enough let's david american south anti-government only forward available bowman's ryan terry highly skilled wrong always most surprising own notable very simple complex intelligent character-driven well-written long sharply together tony biggest not gutsy right most film's ryan too many david memorable gunslinger terry most memorable extremely well skillfully amazing old-school trier man's"
## term doc matrix only for adj/adv
X = vec.fit_transform(reviews_adj_adv_only)
terms = vec.get_feature_names()
len(terms)
576
pmi_matrix=getcollocations_matrix(X)
pmi_matrix.shape # n_words by n_words
(576, 576)
getcollocations("good",pmi_matrix,terms)
[('good', 0.0012832614349917284),
('sean', 0.0009249332576569759),
('nicely', 0.0009142510507387667),
('fairly', 0.000867719006738669),
('pretty', 0.0008609133674701302),
('terrific', 0.000831187604463116),
('he', 0.0008287103623039518),
('sadly', 0.0008245212623420526),
('horrible', 0.0008203986560303424),
('technical', 0.000817968488099229),
('stupid', 0.0008165931732810714),
('forward', 0.0008158591569455379),
('lovely', 0.0008132714383389111),
('robin', 0.0008028131634819533),
('sad', 0.0008020759613116301),
('total', 0.0007996250034466595),
('cool', 0.0007961783439490446),
('totally', 0.0007939970334176773),
('naturally', 0.0007829087048832272),
('thankfully', 0.0007774887114619778),
('they', 0.0007756992159419563),
('bad', 0.0007720796800988853),
('nice', 0.0007720517274657402),
('average', 0.0007712639195805711),
('fun', 0.0007703436484226743),
('climactic', 0.0007673110589637576),
('badly', 0.0007654486534808358),
('dumb', 0.0007600492426269871),
('therefore', 0.0007561876508739784),
('mainly', 0.0007555888597477207),
('bigger', 0.0007541908292930254),
('twice', 0.0007477153143173636),
('really', 0.0007464406345154477),
('suspenseful', 0.0007449621931686967),
('anti', 0.0007446382965629712),
('guilty', 0.0007372021703231894),
('extra', 0.000736855251654802),
('gary', 0.0007360226468506723),
('smart', 0.000732211878708694),
('aren', 0.0007289455060155697),
('ve', 0.0007279344858962693),
('violent', 0.0007274069971383735),
('boring', 0.00072578337925946),
('forever', 0.0007236625698992255),
('co', 0.0007234410631438232),
('longer', 0.0007230160096402135),
('natural', 0.0007226849583537482),
('scary', 0.0007221226337134648),
('fantastic', 0.0007207509218907141),
('though', 0.0007206722287012948),
('nevertheless', 0.0007151637054419488),
('that', 0.000715019519211013),
('slightly', 0.000714275671039496),
('particular', 0.0007133757961783439),
('probably', 0.0007111342630959992),
('looking', 0.0007101544769016765),
('terribly', 0.0007101544769016765),
('intelligent', 0.0007096530262048105),
('witty', 0.0007094402154212624),
('able', 0.0007077140835102619),
('usual', 0.0007077140835102619),
('brilliant', 0.0007048832271762208),
('realistic', 0.0007032908704883227),
('overall', 0.000703118537513442),
('maybe', 0.0007005098084086604),
('impressive', 0.0006997398403157801),
('somewhere', 0.0006987980005684003),
('very', 0.0006972942200561076),
('general', 0.0006970316067780315),
('plain', 0.0006970316067780315),
('disappointing', 0.0006963906581740977),
('capable', 0.0006948465547191663),
('better', 0.0006924412263889158),
('fair', 0.0006912556164518837),
('weird', 0.0006907289455060155),
('sure', 0.0006904748942965504),
('past', 0.0006904264112413089),
('right', 0.0006895282417358496),
('great', 0.0006882164679575816),
('seemingly', 0.0006870005005782542),
('actually', 0.000685391918869078),
('national', 0.0006848845969454146),
('wonderfully', 0.0006844011489946297),
('loud', 0.0006830979414751223),
('necessary', 0.0006830979414751223),
('peter', 0.0006824385805277526),
('evil', 0.0006819790259280704),
('relatively', 0.0006819790259280704),
('biggest', 0.0006811154333917554),
('dull', 0.0006802721088435374),
('believable', 0.0006794055201698514),
('huge', 0.0006784004824181208),
('robert', 0.0006760084925690022),
('funny', 0.0006758333405353084),
('major', 0.000675087264745043),
('fake', 0.0006748559296329996),
('it', 0.000674409891345073),
('there', 0.0006743076615601185),
('isn', 0.0006732595820762097),
('black', 0.0006723283793347488),
('offensive', 0.0006723283793347488),
('definitely', 0.0006717286216368588),
('danny', 0.0006713720089516268),
('well', 0.0006712165858124412),
('sometimes', 0.0006709129511677283),
('hardly', 0.0006708415850416599),
('special', 0.0006705840136359559),
('awful', 0.0006682137625701541),
('also', 0.0006677363413883081),
('brief', 0.0006667411628859836),
('musical', 0.0006664698190407897),
('responsible', 0.0006664307619721632),
('basic', 0.0006652512384996461),
('either', 0.0006651259793698213),
('anyway', 0.0006640466187830329),
('again', 0.0006629841024745483),
('as', 0.0006629024930696557),
('just', 0.0006621746447228981),
('before', 0.0006605331446095777),
('usually', 0.0006603252990637598),
('occasionally', 0.0006601366661314207),
('basically', 0.000660085931918576),
('around', 0.0006583525129797142),
('interesting', 0.0006582175157064767),
('then', 0.000658144236310809),
('regular', 0.0006574892130675981),
('movie', 0.0006571630775452432),
('especially', 0.000655566729988453),
('unbelievable', 0.0006549354060959373),
('next', 0.0006545539933664034),
('terrible', 0.0006535061962626672),
('however', 0.0006534727007700416),
('supposedly', 0.0006532745386248571),
('too', 0.0006526945865931038),
('extremely', 0.0006523525785905076),
('typical', 0.0006521079769487413),
('best', 0.0006519679895476074),
('minor', 0.0006490255985362402),
('never', 0.0006488495909123945),
('social', 0.0006485234510712218),
('ahead', 0.0006484191197566994),
('frank', 0.0006484191197566994),
('together', 0.0006481608436062671),
('earlier', 0.0006481171080567661),
('even', 0.0006475500993617314),
('not', 0.000647194430057688),
('professional', 0.000646173728422413),
('predictable', 0.0006440198159943383),
('personal', 0.000643647334897754),
('tough', 0.0006435774946921443),
('generally', 0.000643126584626801),
('entirely', 0.0006429233575550971),
('always', 0.0006416607690493041),
('entire', 0.0006411056991798842),
('alien', 0.0006396646524035059),
('likable', 0.0006396301969953506),
('surprising', 0.0006393830685506504),
('re', 0.0006393282282497197),
('quite', 0.0006392831469314743),
('second', 0.0006375118821969114),
('little', 0.0006375096023289368),
('quiet', 0.0006369426751592356),
('instead', 0.0006365668977697612),
('nearly', 0.0006362510978789325),
('possible', 0.0006361034884989469),
('poor', 0.0006359642685921708),
('don', 0.0006359458947599255),
('common', 0.0006352968284533978),
('john', 0.0006348405541191062),
('strong', 0.0006345571220687517),
('ever', 0.0006344707113487858),
('short', 0.0006341305662181353),
('straight', 0.0006339096148013345),
('we', 0.0006335726080949011),
('mean', 0.0006334621140927917),
('interested', 0.0006334041047416844),
('wild', 0.0006331171936267478),
('worse', 0.0006331171936267478),
('running', 0.0006329367463846492),
('so', 0.0006323964773073481),
('dramatic', 0.0006314423066345445),
('laughable', 0.0006312044528605038),
('funniest', 0.0006308765544434335),
('powerful', 0.0006299433051025408),
('back', 0.000629555760482866),
('largely', 0.000626832473966232),
('big', 0.0006259819025465529),
('completely', 0.0006258191508398261),
('wrong', 0.0006257682422617052),
('didn', 0.000625466230561772),
('tight', 0.0006249248888354765),
('finally', 0.0006248104337276312),
('ago', 0.0006243185861020256),
('decent', 0.0006241183259949558),
('more', 0.000621968372132822),
('perfectly', 0.0006215946588903384),
('fully', 0.0006202905790766413),
('hard', 0.0006202687831393604),
('much', 0.0006202111087208874),
('important', 0.00062015503875969),
('mental', 0.0006195398698270162),
('moral', 0.0006194579742725115),
('later', 0.0006191509803223855),
('many', 0.0006188591291767968),
('subtle', 0.0006185532540916463),
('here', 0.0006184390993909081),
('hot', 0.0006183186203300183),
('recently', 0.0006179294609753779),
('frankly', 0.0006176413819725922),
('small', 0.0006176413819725922),
('same', 0.0006169314493496352),
('remarkable', 0.0006160102867737209),
('certain', 0.0006154967938407428),
('still', 0.0006152448508223882),
('almost', 0.000614907621277048),
('enough', 0.0006135664210955028),
('visually', 0.0006133522057088936),
('obviously', 0.000612731403881253),
('other', 0.0006126687777268868),
('present', 0.000611591722914092),
('tony', 0.0006112076175770443),
('long', 0.0006096304083441553),
('quickly', 0.0006094667166229549),
('worth', 0.0006093904474805919),
('far', 0.0006088127147421085),
('positive', 0.0006075453209211171),
('unfunny', 0.000607317434454155),
('comic', 0.0006067717063359034),
('due', 0.0006066120715802245),
('spectacular', 0.0006066120715802245),
('seriously', 0.000605535245417656),
('apparently', 0.0006054510915389225),
('obvious', 0.0006047294823925617),
('superior', 0.0006046339887381151),
('only', 0.0006043969151390608),
('immediately', 0.000604379143709377),
('likely', 0.000604379143709377),
('slowly', 0.0006036040778368514),
('effectively', 0.0006034193764666443),
('incredible', 0.0006034193764666443),
('unfortunately', 0.0006029501089433885),
('future', 0.000602698445311965),
('main', 0.0006023551447621176),
('final', 0.0006023019331768913),
('wonderful', 0.0006022374652947902),
('exciting', 0.0006015569709837226),
('top', 0.0006010365929811415),
('away', 0.0006009182398614209),
('incredibly', 0.0006000947518029163),
('now', 0.0005997020752003287),
('highly', 0.0005994754589733983),
('most', 0.0005991814703546777),
('several', 0.0005989389355912623),
('man', 0.0005986565034283526),
('often', 0.0005983923958135547),
('old', 0.0005982720466243037),
('oh', 0.0005976252260753322),
('star', 0.0005976252260753322),
('mary', 0.0005968833874133718),
('absolutely', 0.0005958495993425108),
('early', 0.0005956009613700224),
('french', 0.00059447983014862),
('pure', 0.00059447983014862),
('emotional', 0.0005941229995182786),
('surprisingly', 0.00059359055590756),
('practically', 0.0005928774586387854),
('easy', 0.0005917276087127468),
('few', 0.0005915726229176027),
('willing', 0.0005914467697907188),
('yet', 0.000591312086826336),
('similar', 0.0005912837020295414),
('least', 0.0005911743392196498),
('ll', 0.0005911494109321011),
('solid', 0.0005910033399138328),
('whole', 0.0005908314711500944),
('along', 0.0005894513353447312),
('happy', 0.0005885547820076039),
('unnecessary', 0.0005879470847623714),
('soon', 0.0005870488322717622),
('about', 0.000586717804716572),
('international', 0.000586391669194217),
('non', 0.0005850436423684831),
('up', 0.0005847487615003539),
('third', 0.0005845194097140311),
('quick', 0.000583864118895966),
('sweet', 0.0005830216021298824),
('entertaining', 0.0005830033855511562),
('intriguing', 0.000582964482349131),
('effective', 0.0005827808830538584),
('simple', 0.0005823475887170154),
('utterly', 0.0005813365685977151),
('first', 0.000581192769237202),
('available', 0.0005810705106715834),
('middle', 0.0005807418508804796),
('double', 0.0005794409058740269),
('apart', 0.0005793995674345695),
('doesn', 0.0005786797017725769),
('last', 0.0005779384843487602),
('she', 0.0005770591757852905),
('what', 0.0005767341635770193),
('deep', 0.0005766355217394896),
('apparent', 0.0005766007375125712),
('pathetic', 0.0005766007375125712),
('honest', 0.0005762814680012133),
('else', 0.000576117518792678),
('single', 0.0005757673899744503),
('secret', 0.0005756074545883464),
('michael', 0.0005753841128657395),
('exactly', 0.0005752567854478683),
('free', 0.0005746851204444231),
('otherwise', 0.0005743176159709176),
('friendly', 0.0005740347566249902),
('rather', 0.0005732484076433122),
('light', 0.000571948524632783),
('constantly', 0.0005711551688047606),
('previous', 0.0005707371641211789),
('certainly', 0.0005704842058212914),
('such', 0.0005690939798374554),
('popular', 0.000569045232629571),
('talented', 0.0005689196710160164),
('already', 0.0005686201044674146),
('appropriate', 0.0005678171135140473),
('flat', 0.000567251746325019),
('known', 0.0005661712668082095),
('normal', 0.0005661712668082095),
('out', 0.0005661712668082095),
('real', 0.0005661712668082095),
('less', 0.0005645254201023716),
('enjoyable', 0.0005644129709485567),
('impossible', 0.0005626326963906581),
('originally', 0.0005621841452109685),
('clever', 0.0005617480537862704),
('convincing', 0.00056160536949524),
('virtually', 0.00056160536949524),
('perhaps', 0.0005612799383692617),
('excellent', 0.0005611161662117076),
('particularly', 0.0005609928710752076),
('aside', 0.0005608300284420943),
('classic', 0.0005605766890729504),
('nasty', 0.0005605095541401274),
('you', 0.0005597375024126616),
('thoroughly', 0.0005594311326795403),
('safe', 0.000559004541911903),
('truly', 0.000559004541911903),
('fascinating', 0.0005583077769914288),
('different', 0.0005582527875521505),
('fast', 0.0005578452187669123),
('once', 0.0005570083587423185),
('easily', 0.0005566899298042442),
('emotionally', 0.0005566075629769897),
('new', 0.0005565231119904089),
('critical', 0.0005564096932425507),
('down', 0.0005563655897475252),
('key', 0.0005559568367369274),
('original', 0.0005557015278771899),
('over', 0.0005551008789097249),
('suddenly', 0.0005548065151022053),
('painful', 0.0005541761128504085),
('military', 0.0005538631957906397),
('serial', 0.0005533037380171138),
('intense', 0.0005525285856803009),
('cute', 0.0005520169851380043),
('nowhere', 0.000551579223849235),
('young', 0.0005511013442752416),
('bright', 0.0005506171111266653),
('dangerous', 0.0005504442871746481),
('silly', 0.0005503408202033747),
('humorous', 0.0005500868558193399),
('necessarily', 0.000549928648498138),
('soft', 0.0005490885130683066),
('somewhat', 0.0005486772108113266),
('crazy', 0.0005482544545674433),
('essentially', 0.0005479076775563317),
('close', 0.0005473421765129823),
('half', 0.0005471174260983178),
('mysterious', 0.0005469790204757277),
('potential', 0.0005464211063381557),
('slow', 0.0005463207498317021),
('familiar', 0.0005452879004095461),
('worst', 0.0005451500564069146),
('mad', 0.0005449398443029017),
('screen', 0.0005445799896841676),
('indeed', 0.0005441534953212235),
('animated', 0.0005440552016985138),
('visual', 0.0005425807973578674),
('low', 0.0005424444362627787),
('serious', 0.0005416989106494434),
('giant', 0.0005416520387180901),
('rarely', 0.0005411931226843178),
('steve', 0.0005411931226843178),
('no', 0.0005411194408432445),
('literally', 0.0005408003845691623),
('favorite', 0.0005407377919320594),
('like', 0.0005404362092260182),
('memorable', 0.0005401736065976285),
('aware', 0.0005397819281010471),
('standard', 0.0005393928960807942),
('chris', 0.0005387077352093038),
('shallow', 0.0005387077352093038),
('true', 0.0005377934893765022),
('sci', 0.0005376344086021505),
('poorly', 0.0005375615485386457),
('computer', 0.0005374203821656051),
('older', 0.0005372849776853417),
('physical', 0.0005366183710132754),
('rare', 0.0005366183710132754),
('clear', 0.0005357231027502098),
('comedic', 0.0005349215540298343),
('simply', 0.0005347540528206044),
('can', 0.0005345322842512801),
('unique', 0.0005337073180233351),
('complex', 0.0005332543326914531),
('female', 0.0005330442246013461),
('oddly', 0.0005325848357263666),
('surely', 0.0005325848357263666),
('fi', 0.0005324705961648636),
('dead', 0.000531612760912124),
('genuinely', 0.0005307855626326964),
('successful', 0.000529644088304454),
('rich', 0.0005296190009565806),
('constant', 0.0005294068988336504),
('jean', 0.0005293313556117849),
('lucky', 0.0005292470537555002),
('psychological', 0.0005290452820994744),
('cold', 0.0005289231571497747),
('mostly', 0.0005281963647661954),
('merely', 0.0005281316348195329),
('clearly', 0.0005275686804349225),
('overly', 0.0005265392781316348),
('life', 0.0005263623496107572),
('lee', 0.0005263000508358004),
('amazing', 0.0005259159703149652),
('ready', 0.0005259159703149652),
('late', 0.0005252946775020133),
('dark', 0.0005251443634163102),
('private', 0.0005248513141063681),
('open', 0.0005247766694708168),
('eventually', 0.0005237084217975938),
('united', 0.0005233792524564262),
('attractive', 0.0005230179690331935),
('graphic', 0.0005230179690331935),
('time', 0.0005229220728159156),
('weak', 0.0005226196308998857),
('large', 0.0005216863815589931),
('strange', 0.0005216863815589931),
('various', 0.0005211349160393747),
('to', 0.0005209980274352141),
('billy', 0.0005185727974747759),
('heavily', 0.0005182964905707506),
('genuine', 0.0005179533841954225),
('own', 0.0005177625711331581),
('barely', 0.000517274657402046),
('hilarious', 0.0005159235668789809),
('lead', 0.0005143386860440776),
('ultimate', 0.0005140239132864007),
('high', 0.0005112873174747606),
('traditional', 0.0005108811040339703),
('difficult', 0.0005095541401273885),
('successfully', 0.000508161915700811),
('possibly', 0.0005075636942675159),
('grand', 0.000506484536873609),
('greatest', 0.0005064087442006763),
('thus', 0.0005060155697098372),
('further', 0.0005057372551826141),
('dimensional', 0.0005055100596501871),
('wide', 0.0005042462845010616),
('recent', 0.0005031142773769634),
('complete', 0.0005025188758652747),
('innocent', 0.0005022487044266374),
('year', 0.000499562882477832),
('thin', 0.0004989384288747346),
('alive', 0.0004978402518485979),
('initially', 0.0004965993738529634),
('english', 0.0004963966388564935),
('david', 0.0004960803527682509),
('perfect', 0.0004944686557157225),
('human', 0.0004931504004900357),
('beautiful', 0.0004920049200492005),
('eccentric', 0.0004909766454352442),
('political', 0.0004908043124603634),
('all', 0.0004905190716743539),
('married', 0.0004905190716743539),
('chinese', 0.0004899559039686428),
('sole', 0.0004887233104995393),
('sympathetic', 0.0004875363686404026),
('sexual', 0.00048720527433232763),
('empty', 0.0004862680638312445),
('off', 0.0004861263635698075),
('tim', 0.0004852896572641795),
('modern', 0.00048489829463735357),
('foreign', 0.00048319789150010983),
('narrative', 0.00048319789150010983),
('outstanding', 0.0004820106730934756),
('lame', 0.00048166809265773043),
('painfully', 0.00048124557678697803),
('hearted', 0.00048071145295036654),
('fresh', 0.00048038774153423834),
('numerous', 0.00047928359714952384),
('ex', 0.0004789413913988051),
('full', 0.000478736230452094),
('unable', 0.0004786720710287589),
('unusual', 0.0004786720710287589),
('public', 0.00047845459166890946),
('blue', 0.00047770700636942675),
('subject', 0.000476979902858971),
('worthy', 0.0004768904131961457),
('ultimately', 0.00047569136499234053),
('equally', 0.00047481181239143023),
('film', 0.0004737351416150324),
('cheap', 0.00047335630503637184),
('former', 0.00047282951741550467),
('occasional', 0.0004725703718923361),
('alone', 0.0004720200181983621),
('sudden', 0.00047095155375410154),
('one', 0.0004706472969156914),
('near', 0.00047035766780989714),
('william', 0.0004702874232358514),
('british', 0.000470124355474674),
('fine', 0.00046983604175244),
('accidentally', 0.00046932618169627895),
('famous', 0.0004684393219425067),
('frequently', 0.00046834020232296744),
('steven', 0.0004639458991900605),
('green', 0.0004615526631588664),
('deadly', 0.00046123435097737757),
('sharp', 0.00046051254448132537),
('meanwhile', 0.0004596493532076958),
('of', 0.0004579326422713459),
('ill', 0.00045780254777070064),
('initial', 0.0004566758803028482),
('local', 0.00045631714041258675),
('ugly', 0.0004549590536851683),
('actual', 0.00045453186208546393),
('be', 0.0004542536908112378),
('ridiculous', 0.0004519259933272672),
('desperately', 0.00045158898662083375),
('somehow', 0.000449853902587711),
('heavy', 0.00044979161751985535),
('self', 0.0004482189195564992),
('american', 0.00044597422497684366),
('limited', 0.00044422668626490285),
('cinematic', 0.00044378462078763795),
('inevitable', 0.0004428268122535639),
('current', 0.0004411724156947087),
('younger', 0.00043862719021954696),
('bottom', 0.0004376939408786542),
('directly', 0.0004361048947036208),
('tiny', 0.00043312101910828024),
('unfortunate', 0.00043295449814745426),
('white', 0.0004315217691013869),
('greater', 0.00043136858423482627),
('tom', 0.0004311611954924057),
('teen', 0.00042832087141142803),
('extraordinary', 0.0004246284501061571),
('fellow', 0.0004246284501061571),
('latest', 0.0004246284501061571),
('unlikely', 0.0004246284501061571),
('odd', 0.00042015867694714494),
('latter', 0.00041755130927105446),
('romantic', 0.00041599779055115387),
('unexpected', 0.0004119529739835853),
('detective', 0.00040998608975766895),
('live', 0.00040875448935452505),
('red', 0.00040864780951076413),
('on', 0.00040208180673768863),
('the', 0.00040030077848549186),
('creative', 0.00039927749786101345),
('ten', 0.0003963198867657466),
('two', 0.0003951403632932295),
('desperate', 0.00039465467715748723),
('central', 0.0003832012842421418),
('in', 0.00038137925611386335),
('previously', 0.0003809166978893468),
('and', 0.00037276543329929824),
('bizarre', 0.00034742327735958313),
('angry', 0.00033777263076626134)]
We can make this better by combining multiple seet terms
def seed_score(pos_seed,PMI_MATRIX=pmi_matrix,TERMS=terms):
score=defaultdict(int)
for seed in pos_seed:
c=dict(getcollocations(seed,PMI_MATRIX,TERMS))
for w in c:
score[w]+=c[w]
return score
sorted(seed_score(['good','great','perfect','cool']).items(),key=itemgetter(1),reverse=True)
[('cool', 0.01233842898097543),
('perfect', 0.006798784836900034),
('great', 0.004248631481147458),
('frank', 0.004199495490947327),
('eccentric', 0.004084710911853615),
('fake', 0.003911762949434756),
('looking', 0.0038393777520473217),
('lovely', 0.0038028477315611427),
('greatest', 0.003730147480106928),
('amazing', 0.0036985631973772324),
('twice', 0.003661242122664058),
('anti', 0.003660653152490723),
('generally', 0.0035927347963626934),
('known', 0.0035564386989262358),
('totally', 0.003546906203489673),
('plain', 0.0035118779583992172),
('earlier', 0.003485008708798782),
('stupid', 0.0033319627079071512),
('sad', 0.0032958074094856364),
('convincing', 0.0032845107876468384),
('overall', 0.003268378795897639),
('nicely', 0.0032643270804776636),
('good', 0.003262124902614077),
('pretty', 0.003249145627494061),
('climactic', 0.003243555286391972),
('man', 0.003190950013225829),
('necessarily', 0.0031860920461633867),
('past', 0.0031797773077773686),
('fun', 0.003153100558314221),
('friendly', 0.0031332047075606456),
('terribly', 0.0031123173387858664),
('intriguing', 0.0031087143840738247),
('necessary', 0.0030837671387598064),
('extra', 0.0030693991457936233),
('actually', 0.0030683303367765517),
('apart', 0.0030613602501711593),
('they', 0.0030539757813460434),
('black', 0.003053138089032602),
('definitely', 0.0030493148349400295),
('best', 0.003043422883124742),
('bigger', 0.00302455408969759),
('musical', 0.003003959069610916),
('quiet', 0.0029999963749380303),
('john', 0.0029968693911054983),
('steven', 0.0029867301907443955),
('pure', 0.0029859258177321285),
('painful', 0.0029827416960734117),
('classic', 0.0029772145231463792),
('perfectly', 0.0029687945292047385),
('basically', 0.0029613324950677937),
('somewhere', 0.0029537299642018867),
('he', 0.002948724291562637),
('horrible', 0.002945643760366221),
('brilliant', 0.0029418273019810766),
('technical', 0.0029277317228736336),
('really', 0.0029263971284278645),
('fully', 0.0029147021133513534),
('forward', 0.002905429173729839),
('mainly', 0.002901635335665364),
('visually', 0.002897170954974414),
('shallow', 0.002869477485777347),
('nasty', 0.002865858196500654),
('maybe', 0.0028622287923194237),
('isn', 0.0028566090192177563),
('all', 0.00285461092390361),
('regular', 0.002851129537218903),
('probably', 0.0028510038317324277),
('scary', 0.0028496778141188024),
('slightly', 0.002839870019252718),
('present', 0.002839191031164896),
('non', 0.002836054321034321),
('green', 0.002833792308872344),
('entire', 0.002819008284677596),
('aren', 0.0028185457174324195),
('interesting', 0.0028106487026091083),
('professional', 0.002801892043622532),
('especially', 0.0027872535559887897),
('excellent', 0.002783733520408009),
('sympathetic', 0.00278352694514409),
('same', 0.0027832670373233006),
('mary', 0.0027826630422002983),
('huge', 0.002781314871475908),
('nevertheless', 0.002771202856021355),
('constantly', 0.002761195678000438),
('weird', 0.0027601413120321343),
('anyway', 0.002757779963983885),
('forever', 0.0027568329027658684),
('tony', 0.002756290460877205),
('future', 0.002749234391960185),
('very', 0.0027481646751586373),
('sure', 0.0027459142198264248),
('though', 0.0027453667922239028),
('soft', 0.002742373141880715),
('nice', 0.002741853177318124),
('blue', 0.002739395424413957),
('wonderful', 0.0027377931144858384),
('light', 0.0027373283200756394),
('sean', 0.0027337378934548374),
('second', 0.0027335395520831232),
('entirely', 0.002733523367163213),
('realistic', 0.002731294773691186),
('memorable', 0.002730908137254546),
('badly', 0.0027307312781661053),
('still', 0.0027191290459714634),
('danny', 0.002718729213354672),
('third', 0.00271541531851348),
('also', 0.0027056370383890453),
('inevitable', 0.002703987742769333),
('famous', 0.0027032161393452437),
('before', 0.0026925347274034638),
('literally', 0.0026894086542412756),
('smart', 0.002683582489884059),
('deadly', 0.0026766545015301318),
('yet', 0.0026766197782430662),
('quick', 0.002675617905789821),
('dumb', 0.0026710891292985525),
('poor', 0.002667166995342299),
('out', 0.0026667066493894324),
('wonderfully', 0.002665905523637924),
('it', 0.0026623466980691805),
('exactly', 0.0026622488358886177),
('always', 0.002652590008331684),
('again', 0.0026501302605318553),
('suspenseful', 0.0026485166277748695),
('straight', 0.002646189752081472),
('that', 0.002642264415676568),
('oh', 0.0026408399178469),
('didn', 0.002634709735067517),
('incredibly', 0.0026306080205473936),
('else', 0.0026236977815681873),
('not', 0.0026197610436078512),
('never', 0.0026185162906170443),
('then', 0.002613898264607873),
('original', 0.0026121619253186364),
('lucky', 0.002611194469540666),
('just', 0.0026082660263522482),
('final', 0.002605105324093498),
('obviously', 0.0025999411138145803),
('mental', 0.002598422855348094),
('chinese', 0.002594664847599488),
('right', 0.0025941719891221897),
('next', 0.002590120626479452),
('sadly', 0.002589445111141814),
('longer', 0.002578784157910541),
('alien', 0.0025777906620490405),
('likable', 0.0025776081267851907),
('together', 0.002577143083714865),
('ve', 0.002576658664841161),
('wrong', 0.002574267820832323),
('similar', 0.002570303153596778),
('about', 0.0025682590836591324),
('over', 0.0025672992416833455),
('easily', 0.0025649029055357614),
('comic', 0.0025648804047939894),
('certain', 0.002563451398133774),
('funny', 0.002562830642541606),
('boring', 0.0025616238435000253),
('desperately', 0.0025568098595270487),
('usual', 0.002556158669214207),
('like', 0.0025558699946088698),
('witty', 0.002554642026201358),
('late', 0.0025543690617334434),
('already', 0.0025527149553354602),
('originally', 0.0025514339703727),
('cold', 0.0025499605916891304),
('french', 0.0025444561222301753),
('believable', 0.002543143621598955),
('later', 0.002542810478900551),
('desperate', 0.002542474373414215),
('completely', 0.0025416873280540166),
('bad', 0.002541534118352169),
('long', 0.0025397945691588344),
('first', 0.002535085273247567),
('co', 0.00253448031405291),
('evil', 0.0025334075067660966),
('close', 0.0025325295336182225),
('terrific', 0.002531797517920264),
('wild', 0.00253043821721538),
('utterly', 0.0025275094878181724),
('beautiful', 0.0025262843425938046),
('traditional', 0.002524371276082079),
('mean', 0.0025242814452110457),
('as', 0.002523913421332458),
('computer', 0.0025223013263525763),
('strong', 0.0025219961530189706),
('incredible', 0.002517568804243094),
('ever', 0.0025150040801195025),
('nearly', 0.002514305025450897),
('therefore', 0.002513451112442391),
('little', 0.002506072311279282),
('single', 0.002505291173635525),
('movie', 0.002503536342636675),
('whole', 0.0025007145988450545),
('older', 0.002498322493840491),
('almost', 0.0024977229944715953),
('slowly', 0.0024942831852932394),
('general', 0.00249251121682983),
('there', 0.0024915178366652618),
('only', 0.002490669377909969),
('well', 0.0024880154640546824),
('major', 0.0024870428132837117),
('emotionally', 0.0024865103239952464),
('tough', 0.002486159146802659),
('merely', 0.002481832205084195),
('too', 0.0024813276898331686),
('seemingly', 0.0024776359121710554),
('occasionally', 0.0024768922155345282),
('able', 0.002476601756585323),
('re', 0.002472412146757419),
('extremely', 0.002471343494995713),
('once', 0.002471033221713364),
('away', 0.002466459287948913),
('important', 0.0024656807279107595),
('ll', 0.0024621555414009303),
('so', 0.0024614535416055787),
('thankfully', 0.002460228506662616),
('key', 0.0024601953216293253),
('instead', 0.002459485644152606),
('effectively', 0.0024586780396513193),
('robert', 0.0024573908936458004),
('most', 0.0024567708406460402),
('intense', 0.0024540340246803705),
('solid', 0.002452225134191952),
('top', 0.002451460433879865),
('last', 0.0024507916818635083),
('special', 0.0024474442315980055),
('creative', 0.0024472815239837834),
('tim', 0.0024459044680487695),
('psychological', 0.00244476670365184),
('clear', 0.00244394486836231),
('disappointing', 0.0024420242374239755),
('sometimes', 0.0024397798656203814),
('emotional', 0.002437342965428747),
('hearted', 0.002435117648234929),
('and', 0.002434775894362365),
('minor', 0.0024326013810353738),
('responsible', 0.0024319820376394546),
('here', 0.0024311048832102297),
('other', 0.0024299883603013535),
('normal', 0.0024298569825335708),
('willing', 0.0024261246443667258),
('much', 0.0024261229493739082),
('hot', 0.0024196044249443906),
('remarkable', 0.0024195606507870326),
('barely', 0.0024175002418381566),
('effective', 0.0024169124922279123),
('animated', 0.0024109840734820127),
('outstanding', 0.0024107809176646656),
('half', 0.0024084639659253332),
('subtle', 0.0024083972202186494),
('different', 0.002407773631130719),
('fascinating', 0.002403685035026676),
('least', 0.002403331373138169),
('lame', 0.002402847501293744),
('seriously', 0.002401916898466411),
('total', 0.002401279300347136),
('violent', 0.002400535983510144),
('far', 0.0024005213011034504),
('doesn', 0.002400464865939389),
('absolutely', 0.002400431236130183),
('unfortunately', 0.002399753462836699),
('usually', 0.002399454857408294),
('aware', 0.0023960326566863843),
('previously', 0.0023958414023801155),
('moral', 0.002395235049660093),
('ago', 0.002393629653696946),
('awful', 0.002393420826940957),
('surely', 0.002391676453026381),
('biggest', 0.002390865343509711),
('greater', 0.0023903319537811043),
('truly', 0.002386967231988023),
('successful', 0.0023869373994698283),
('real', 0.002385190944733403),
('powerful', 0.0023847176086883894),
('back', 0.0023806187028304715),
('short', 0.0023774499236466013),
('eventually', 0.0023698523692254974),
('quickly', 0.0023688699328822584),
('secret', 0.0023676820230258654),
('intelligent', 0.002366708277842242),
('robin', 0.0023651976511104983),
('dull', 0.00235964477869644),
('many', 0.0023590119795897794),
('michael', 0.0023589064413142794),
('off', 0.0023580212371917464),
('natural', 0.002356857167653292),
('personal', 0.0023566081734952646),
('more', 0.00235539519766598),
('high', 0.0023548555530815605),
('hardly', 0.0023485136786730067),
('superior', 0.002347312160007487),
('big', 0.0023463307195843493),
('alive', 0.002345887108786913),
('now', 0.0023368253650724344),
('otherwise', 0.0023316118219038357),
('enough', 0.002331558244168371),
('such', 0.0023313159079823265),
('fairly', 0.0023300790747863434),
('even', 0.0023295884605102823),
('capable', 0.002329072615940134),
('tight', 0.0023289164961650087),
('mad', 0.0023252643113845844),
('several', 0.002324876098041507),
('ahead', 0.00232478891174848),
('private', 0.002323938810211654),
('indeed', 0.0023233119218103422),
('no', 0.002321993798982203),
('somewhat', 0.0023197875411107224),
('serial', 0.002316222892701973),
('simple', 0.0023154877313890524),
('we', 0.0023152809960851227),
('immediately', 0.002314386462682709),
('old', 0.002313901722294),
('new', 0.0023136153967201574),
('occasional', 0.002313367300043446),
('honest', 0.002310746954588941),
('suddenly', 0.0023091385180307403),
('visual', 0.002304223240593406),
('main', 0.0023038137423872776),
('apparent', 0.002302475481884322),
('entertaining', 0.0023023063162522428),
('potential', 0.002300250289886611),
('soon', 0.0022988242749470353),
('silly', 0.002298284087838806),
('gary', 0.0022899832288248365),
('she', 0.0022875778111765914),
('don', 0.0022863504541293096),
('relatively', 0.002285570771709969),
('pathetic', 0.002280629178427796),
('bright', 0.002280352499605687),
('attractive', 0.002280010685796501),
('initially', 0.002279932158451899),
('up', 0.0022772831624123012),
('unexpected', 0.0022765307261015653),
('difficult', 0.0022753479475171212),
('hilarious', 0.0022747276808401406),
('surprisingly', 0.002269646495223617),
('average', 0.0022684700842012396),
('small', 0.002266828621142853),
('typical', 0.0022574522552556682),
('however', 0.00225704169246753),
('better', 0.0022567705132825926),
('what', 0.002254185440970014),
('brief', 0.002253217282469064),
('around', 0.0022473890284868177),
('guilty', 0.002245444106373558),
('impossible', 0.002245287470222215),
('obvious', 0.0022451600892185787),
('impressive', 0.0022442582461890724),
('perhaps', 0.002244145584435339),
('serious', 0.002243385622669684),
('decent', 0.0022328653519831255),
('certainly', 0.0022321207870243474),
('innocent', 0.00223192076106074),
('international', 0.0022311108082540463),
('ultimate', 0.002227728559313895),
('fair', 0.0022262788637288523),
('human', 0.002224433969022101),
('peter', 0.002223954022722312),
('less', 0.0022137925950437287),
('deep', 0.00221186169544016),
('happy', 0.0022112088452389167),
('possible', 0.0022095882316891602),
('english', 0.002206408349555691),
('rare', 0.0022058513645247073),
('tom', 0.0022030707241931058),
('common', 0.0022012160069994173),
('genuinely', 0.002198948512828223),
('basic', 0.0021977498874032183),
('standard', 0.0021941975079702502),
('actual', 0.002193307780211677),
('national', 0.0021906365425489404),
('star', 0.002187352805003062),
('numerous', 0.002184433922212774),
('sharp', 0.002180467631924875),
('true', 0.002179804625173467),
('lee', 0.002177677031010695),
('spectacular', 0.002177609291983217),
('worse', 0.0021744261458589323),
('quite', 0.0021718602677202976),
('lead', 0.0021657650352752524),
('favorite', 0.00216567820348113),
('finally', 0.002161142727699993),
('empty', 0.002158321847172612),
('strange', 0.0021572049950062994),
('clever', 0.0021557059229457523),
('few', 0.0021545920236595304),
('hard', 0.0021528800039478093),
('red', 0.002151662251742031),
('own', 0.002147796474615853),
('of', 0.0021453161208263016),
('frankly', 0.0021452378161904186),
('fast', 0.0021423510935287874),
('complete', 0.002138681263627506),
('particular', 0.002135804166824032),
('full', 0.002134401343947513),
('grand', 0.00213226141731942),
('dead', 0.0021316555748768394),
('fantastic', 0.0021304672573844597),
('social', 0.0021272845860293007),
('you', 0.002126860652727663),
('safe', 0.002125503559035235),
('virtually', 0.0021228974792839818),
('popular', 0.0021225651112164565),
('equally', 0.0021202583856818067),
('often', 0.0021192912537311985),
('latest', 0.002118743729347657),
('unique', 0.002118664223964338),
('political', 0.002116365997444799),
('unfortunate', 0.002115077048141771),
('early', 0.0021141573314535653),
('funniest', 0.0021140007773808992),
('ready', 0.0021139804915892495),
('accidentally', 0.0021139204676988393),
('sci', 0.0021135829268120196),
('recent', 0.0021111884479918692),
('meanwhile', 0.002110182756405934),
('sexual', 0.0021044354971384597),
('cinematic', 0.002104275709167208),
('fi', 0.0021041621438043457),
('along', 0.0021024956883385973),
('easy', 0.0021016454330641684),
('physical', 0.0021010909668950396),
('teen', 0.0020943036771299685),
('unusual', 0.002093759530768171),
('thin', 0.002092294837065608),
('rich', 0.002090055886096352),
('rather', 0.0020870186335897253),
('various', 0.0020820717486230663),
('modern', 0.0020804173582543413),
('supposedly', 0.002080402822714092),
('ultimately', 0.0020762703131997516),
('william', 0.002075711512092768),
('practically', 0.0020746764726317962),
('white', 0.0020720529883677505),
('simply', 0.0020681061456906506),
('overly', 0.00206257446442149),
('female', 0.0020538277325249763),
('previous', 0.002052852117674875),
('tiny', 0.002048222963157593),
('thoroughly', 0.002047877902676961),
('due', 0.0020475837898075817),
('can', 0.0020473404084645776),
('american', 0.0020471900845285005),
('successfully', 0.002045906484840632),
('running', 0.0020436877607170455),
('aside', 0.0020407291382383754),
('apparently', 0.0020399939153677798),
('worthy', 0.0020367988916105664),
('unbelievable', 0.002035088832233652),
('british', 0.002032234003322862),
('ridiculous', 0.0020301322616863367),
('fresh', 0.0020269257906189767),
('clearly', 0.0020242662191528346),
('young', 0.0020212947952003455),
('directly', 0.0020190399267392815),
('worst', 0.002018321777439585),
('constant', 0.0020156129690804677),
('either', 0.002014093341491322),
('dangerous', 0.0020136403323678005),
('former', 0.0020132344660249227),
('complex', 0.0020126673738765162),
('david', 0.0020110649362481653),
('interested', 0.0020088882476282043),
('predictable', 0.002008212956503268),
('exciting', 0.002007826204078083),
('to', 0.0020072741565426836),
('unable', 0.002005718101470375),
('graphic', 0.002005376865260159),
('comedic', 0.0020048017378319193),
('highly', 0.0019973415101339166),
('fine', 0.0019967923686048336),
('cheap', 0.001994060677197784),
('mysterious', 0.0019907877971670116),
('open', 0.00199015894799177),
('mostly', 0.0019884061604660734),
('latter', 0.0019880256460866478),
('live', 0.001986795686860045),
('rarely', 0.001984649462275682),
('ugly', 0.0019837783774960035),
('weak', 0.0019707188745212005),
('possibly', 0.0019706075424450655),
('thus', 0.001968906729120521),
('available', 0.001967797578191252),
('ex', 0.001966281279950839),
('sweet', 0.001964407940363985),
('two', 0.0019635062673976737),
('particularly', 0.0019613888258522964),
('be', 0.0019605016669826157),
('younger', 0.0019603510166971935),
('the', 0.001954997718634469),
('nowhere', 0.0019516959316690215),
('middle', 0.0019498844504859775),
('slow', 0.0019495347271947126),
('heavily', 0.0019485833263928904),
('offensive', 0.0019482222698296579),
('giant', 0.0019457706206236593),
('on', 0.0019406685251326917),
('poorly', 0.0019374121631600583),
('screen', 0.001929547443973368),
('enjoyable', 0.001923709317142371),
('talented', 0.0019224756187088095),
('dramatic', 0.001921834165809007),
('subject', 0.001920647737495943),
('time', 0.0019061790328508419),
('dark', 0.0019053830824877353),
('bizarre', 0.0018987347112403435),
('life', 0.0018962494533953567),
('ill', 0.0018936014090258438),
('unnecessary', 0.0018922404730965067),
('likely', 0.0018907534543162957),
('terrible', 0.0018903140224032968),
('film', 0.001887782153527676),
('critical', 0.001879942478999445),
('further', 0.0018762931089131496),
('loud', 0.0018747595872502252),
('billy', 0.001845104604459428),
('sudden', 0.0018410687770122414),
('dimensional', 0.0018398050325275027),
('romantic', 0.0018293731001022885),
('central', 0.0018247775451691005),
('essentially', 0.0018232053751663139),
('large', 0.001822843226009556),
('married', 0.0018222599660140423),
('detective', 0.0018207588729618088),
('surprising', 0.0018139300761870559),
('unfunny', 0.0018076086167803196),
('initial', 0.0018063052931670369),
('public', 0.0018019761571857404),
('one', 0.0018005131509604578),
('sole', 0.001795965620240488),
('double', 0.001788234721973658),
('flat', 0.0017738175378711758),
('down', 0.0017713157865333839),
('largely', 0.0017605355035379654),
('wide', 0.0017602308875251382),
('in', 0.0017575874535586198),
('somehow', 0.0017483060301847226),
('worth', 0.0017339160020791128),
('painfully', 0.0017203082812571528),
('odd', 0.0017070813783993787),
('limited', 0.0017021925664718698),
('extraordinary', 0.001701478305054484),
('frequently', 0.0017001988086572553),
('familiar', 0.001692156191506733),
('year', 0.001689555348394929),
('angry', 0.0016884671186636357),
('free', 0.001682755737495471),
('foreign', 0.0016816202029281),
('chris', 0.001681151841632327),
('naturally', 0.0016765910783645183),
('low', 0.0016725509847084342),
('crazy', 0.001661263296294774),
('genuine', 0.001660915690145844),
('recently', 0.0016572720876909868),
('appropriate', 0.0016503542247951867),
('fellow', 0.0016395777540513942),
('cute', 0.001631364584010221),
('steve', 0.0016290806706737333),
('military', 0.0016101588895060573),
('humorous', 0.001605772161822982),
('positive', 0.0016056163317255743),
('local', 0.0016000580609697656),
('laughable', 0.0015909884579457525),
('bottom', 0.0015875328878313534),
('self', 0.0015710444112117714),
('ten', 0.0015664543675963226),
('alone', 0.0015632633997377963),
('unlikely', 0.0015330664386263545),
('jean', 0.001500748385125891),
('narrative', 0.0014983497303010325),
('near', 0.0014588298150256776),
('united', 0.0014333462534527634),
('current', 0.0013395265610777166),
('oddly', 0.001328033675835792),
('heavy', 0.0012452460985996902)]
posscores=seed_score(['good','great','perfect','cool'])
negscores=seed_score(['bad','terrible','wrong',"crap","long","boring"])
## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms:
sentscores[w] = posscores[w] - negscores[w]
sorted(sentscores.items(),key=itemgetter(1),reverse=False)
[('terrible', -0.011337935788715956),
('boring', -0.009940296206073694),
('wrong', -0.0038957762492763384),
('bad', -0.0028038772074515574),
('laughable', -0.0027406189540628715),
('unfunny', -0.0027135011838022864),
('worst', -0.0026471332624708405),
('frankly', -0.002587338946017292),
('terribly', -0.0025576562935382737),
('horrible', -0.0024121874515479237),
('awful', -0.0021579647379428284),
('ugly', -0.0020067310545418757),
('oddly', -0.0019930152819700904),
('exciting', -0.001958238856113128),
('running', -0.0019564731340303175),
('total', -0.0018477526013480836),
('painfully', -0.0018460471673599388),
('successfully', -0.0018190646800686494),
('ridiculous', -0.0018031715407181696),
('sadly', -0.0017829759978268338),
('bottom', -0.001769537540949412),
('we', -0.0017453041163577711),
('current', -0.001733764414992582),
('dull', -0.0016994388711972343),
('positive', -0.0016958052468327694),
('fair', -0.00168358034258908),
('ten', -0.0016615516042460942),
('poorly', -0.0016255490883917791),
('longer', -0.0016219407642753233),
('supposedly', -0.0016197365007933275),
('long', -0.0016152178963842242),
('foreign', -0.0015749275429113633),
('responsible', -0.001571347852518935),
('complete', -0.001569859685280305),
('pathetic', -0.001562411460165396),
('sole', -0.0014762838822532996),
('stupid', -0.001403251021265037),
('particular', -0.001402573126479277),
('low', -0.0013894492381992967),
('worse', -0.0013611681182488936),
('giant', -0.0013523241357618141),
('chinese', -0.001337497643663317),
('unbelievable', -0.0013303905386979195),
('unnecessary', -0.00131660406790726),
('doesn', -0.0013067653462435543),
('down', -0.0013009964808874883),
('weak', -0.001290743094020337),
('seriously', -0.001280361005356888),
('guilty', -0.0012759457766346595),
('huge', -0.0012214986373703203),
('worth', -0.0012163717863471562),
('silly', -0.0012130592923207716),
('double', -0.001210481211132458),
('that', -0.0012070521514938042),
('disappointing', -0.0011807670388760487),
('nowhere', -0.0011606225325787114),
('possible', -0.0011491794774951733),
('frequently', -0.0011469801859383087),
('desperately', -0.0011424453715553045),
('shallow', -0.0011384244435511679),
('predictable', -0.0011120428233618845),
('to', -0.0011058198988504373),
('completely', -0.0011032444587055746),
('offensive', -0.0010970960168989908),
('poor', -0.0010965921061633229),
('public', -0.0010840280896041703),
('gary', -0.001067755005125484),
('absolutely', -0.001026522206518719),
('graphic', -0.001024011269851582),
('overly', -0.0010201841232575807),
('thankfully', -0.0010149951549212268),
('angry', -0.0010110188791198813),
('no', -0.0010020614536842705),
('you', -0.0009949959520265542),
('lame', -0.000994378561571215),
('attractive', -0.0009846101524466954),
('re', -0.0009818428273642007),
('utterly', -0.0009700899144521528),
('middle', -0.0009700496756428923),
('they', -0.0009699456032204955),
('modern', -0.0009602179163986603),
('heavy', -0.000959430853419093),
('loud', -0.0009593880159746828),
('international', -0.000947889571331048),
('due', -0.0009449777026130446),
('flat', -0.000943498486118813),
('slow', -0.000936043860322225),
('superior', -0.0009253200744961609),
('surprising', -0.000919772111017412),
('ex', -0.0009109836158998105),
('equally', -0.0008988178254337453),
('subject', -0.0008972107221829695),
('military', -0.0008941009608490051),
('apart', -0.0008921972866551358),
('talented', -0.0008917010667096683),
('standard', -0.0008872939756282405),
('hardly', -0.0008776965462900465),
('practically', -0.0008750915480863412),
('obvious', -0.0008701220382583389),
('female', -0.0008680989032270711),
('cheap', -0.0008564492852698932),
('physical', -0.0008498830824418741),
('possibly', -0.0008469268319057162),
('aren', -0.0008397280021778201),
('ve', -0.0008365152484233855),
('steve', -0.0008355973078291424),
('accidentally', -0.0008353770315996422),
('basic', -0.0008346153861572892),
('alien', -0.0008185638092304041),
('essentially', -0.0008082539162401208),
('dumb', -0.0008028376104339493),
('aside', -0.0008012383951446411),
('unique', -0.0008003752649116298),
('sweet', -0.0007971894394690261),
('anyway', -0.0007960303504455316),
('largely', -0.0007877402329115519),
('peter', -0.0007858145554954835),
('up', -0.0007819843592687314),
('rich', -0.0007799646645649497),
('didn', -0.0007710518174823778),
('fi', -0.0007680731803592316),
('recently', -0.0007668801474510802),
('hard', -0.0007643704792226866),
('sci', -0.0007583426737492777),
('even', -0.0007573246173436503),
('unfortunately', -0.0007564513391728083),
('plain', -0.0007550527694653989),
('either', -0.0007528454883448344),
('of', -0.0007525540145690811),
('can', -0.0007441589482778972),
('chris', -0.0007405228629783673),
('entertaining', -0.0007355219005145538),
('dark', -0.0007340251639409206),
('potential', -0.000731803390525919),
('near', -0.0007309453448941252),
('totally', -0.0007257501605525398),
('sudden', -0.0007167056368809609),
('there', -0.0007132810375892829),
('better', -0.0007121005858083877),
('bizarre', -0.0007102499887883572),
('fascinating', -0.0007101600372832039),
('extremely', -0.000702831348441181),
('crazy', -0.0007001875489527892),
('odd', -0.0007000656180773442),
('half', -0.000697018395206481),
('apparently', -0.0006909176358485037),
('free', -0.0006857205953191604),
('appropriate', -0.0006853377884564126),
('complex', -0.000684395886488557),
('funny', -0.0006838122964675361),
('rather', -0.0006768704066807464),
('indeed', -0.0006748263747000886),
('safe', -0.0006704534362490105),
('easy', -0.0006698501733494017),
('least', -0.0006696310548561795),
('narrative', -0.0006693801173021394),
('brief', -0.0006678116323338111),
('now', -0.0006673157481890471),
('somehow', -0.0006641988757246794),
('twice', -0.0006641287075245849),
('too', -0.0006627836222457967),
('alone', -0.0006580188413029815),
('painful', -0.000657722958170764),
('otherwise', -0.0006568516671700813),
('wide', -0.0006545396855330964),
('dead', -0.000653123385060093),
('honest', -0.0006515573730144176),
('big', -0.0006496164697680088),
('lead', -0.0006494149269442007),
('central', -0.0006470915436464454),
('though', -0.0006442513977681823),
('so', -0.0006428789535567973),
('one', -0.0006418320196895418),
('interested', -0.0006401133976040372),
('time', -0.0006364896936115636),
('rarely', -0.0006363730750465154),
('merely', -0.0006308553338537993),
('tiny', -0.0006298494923142961),
('else', -0.0006277954394166021),
('critical', -0.0006250500330304016),
('be', -0.0006191685537481752),
('local', -0.0006158657943370705),
('major', -0.0006120423051833371),
('directly', -0.0006075876863173642),
('such', -0.0006049323327150537),
('various', -0.0006040198222932126),
('likely', -0.0005997093672459831),
('future', -0.000599449654797211),
('robin', -0.0005869254763942828),
('genuine', -0.0005823523828022148),
('only', -0.0005811959336262402),
('ultimate', -0.0005785256606300397),
('oh', -0.0005775859845771575),
('cute', -0.0005756849515812109),
('finally', -0.0005751148206935329),
('special', -0.0005682017753211362),
('former', -0.0005632603785982321),
('few', -0.0005574191632068473),
('aware', -0.0005555379716066988),
('much', -0.0005551280489618786),
('mostly', -0.0005543440959590073),
('latter', -0.000552853559619679),
('early', -0.0005520667304396258),
('available', -0.0005509522206528497),
('climactic', -0.0005506226915431576),
('already', -0.0005462021598010699),
('truly', -0.0005428115437271551),
('seemingly', -0.0005419132742930663),
('mary', -0.0005418292000601045),
('whole', -0.0005385437695464993),
('just', -0.0005370653451239643),
('along', -0.0005340060418226622),
('ll', -0.0005337104369835871),
('fellow', -0.0005223175816882971),
('familiar', -0.0005145307593608793),
('somewhere', -0.0005102507291033648),
('perhaps', -0.0005082367122989867),
('simple', -0.0005067682313010637),
('tough', -0.0005045886225163313),
('common', -0.000503857413382454),
('dramatic', -0.0005035860765712917),
('it', -0.000503461169935823),
('entirely', -0.0004971688457674106),
('main', -0.0004953088674902852),
('forever', -0.0004948695741177358),
('bright', -0.0004927768773291159),
('simply', -0.0004890112391900646),
('impressive', -0.0004887583209977186),
('large', -0.00048589159722396864),
('enough', -0.0004840620655300397),
('typical', -0.00048259380175424233),
('here', -0.00047975496205869637),
('bigger', -0.00047644760254505905),
('romantic', -0.0004672569278175954),
('therefore', -0.00046688494803137984),
('short', -0.00046492991647513904),
('average', -0.0004646315755049298),
('mainly', -0.00046421322270260136),
('deep', -0.00046330482043274637),
('ultimately', -0.00045854112467534234),
('decent', -0.00045700466155881945),
('unlikely', -0.00045562672156462466),
('cinematic', -0.00045532681631786295),
('straight', -0.00045515277058060053),
('incredibly', -0.00045403526773077135),
('far', -0.0004522929759844356),
('really', -0.0004501960190467589),
('single', -0.00044602936593327903),
('strange', -0.00044431205594842394),
('instead', -0.0004435105420098496),
('ever', -0.0004386290841801757),
('full', -0.0004378995395570035),
('self', -0.00043537616380091284),
('thus', -0.000435262083165897),
('ahead', -0.0004322469438230391),
('important', -0.00043116312351972746),
('back', -0.0004307387709850414),
('spectacular', -0.0004297835719436231),
('relatively', -0.0004260154655909695),
('the', -0.0004236841293395956),
('occasional', -0.0004234868072372209),
('maybe', -0.0004234443406997417),
('thin', -0.0004225904058378794),
('not', -0.0004213831255542376),
('dimensional', -0.0004203062524537585),
('happy', -0.00041551703086272753),
('away', -0.00041521498971391945),
('never', -0.00041427663947311974),
('certainly', -0.00041414089937138153),
('successful', -0.0004137343964682209),
('alive', -0.0004120879048515641),
('usual', -0.0004085566481352165),
('young', -0.00040793611859985153),
('easily', -0.0004066581498735587),
('don', -0.00040568204613010764),
('empty', -0.0004019144488789264),
('then', -0.00040178677514092313),
('apparent', -0.00039784352360013606),
('screen', -0.00039005452168201543),
('previous', -0.00038622574033567647),
('often', -0.0003804195026694524),
('different', -0.00037876868648600674),
('difficult', -0.0003764750831032368),
('old', -0.0003725818400590591),
('impossible', -0.0003710713261974004),
('on', -0.00036269638020895115),
('tom', -0.00036243731322769916),
('real', -0.0003623537714778427),
('however', -0.00036185401254073147),
('married', -0.0003600511230321961),
('other', -0.00035950949260183697),
('unable', -0.00035852239590076053),
('funniest', -0.00035829495399516183),
('ill', -0.0003565244942728442),
('many', -0.00035609966284931224),
('previously', -0.0003550544177524568),
('as', -0.00035229762209222117),
('quite', -0.00034903953306314765),
('desperate', -0.0003465632012755802),
('film', -0.0003457790164877331),
('violent', -0.000339928133520359),
('english', -0.00033205306151005455),
('small', -0.00032812818234550893),
('worthy', -0.00032729125644853033),
('fast', -0.0003202755225513534),
('likable', -0.00031919523251016345),
('barely', -0.0003171112852659013),
('naturally', -0.00031484551933644864),
('constant', -0.0003128648272295361),
('top', -0.0003118794791431177),
('jean', -0.00031142759612205057),
('eventually', -0.00030643819158205103),
('human', -0.0003004701642169835),
('evil', -0.00029877898921396723),
('believable', -0.0002976062581400152),
('white', -0.00029256051223548133),
('serious', -0.0002915094823329181),
('wild', -0.0002904852948483836),
('he', -0.0002895451984794026),
('year', -0.0002894236019519831),
('quickly', -0.00028486728778277575),
('david', -0.0002841836580306811),
('two', -0.0002803594837186671),
('hilarious', -0.0002792293408196405),
('billy', -0.0002784563033864521),
('fairly', -0.00027817137110490225),
('clear', -0.0002757689850125377),
('more', -0.00027483851236456084),
('little', -0.0002738519948942077),
('united', -0.0002712073575661905),
('social', -0.0002692116370561332),
('badly', -0.0002656577126927363),
('first', -0.0002656045421871446),
('soon', -0.00026406576498318787),
('limited', -0.00026077833129929196),
('pretty', -0.0002556044683908313),
('recent', -0.00025431639723114043),
('in', -0.0002529341542810658),
('comedic', -0.00024923527505396137),
('nearly', -0.00024558280477428654),
('entire', -0.00024552533171258266),
('personal', -0.0002429041811083158),
('genuinely', -0.00023638211781148218),
('ready', -0.00023297057991365734),
('interesting', -0.0002319057727939739),
('new', -0.0002288779420982481),
('basically', -0.00022794420212937563),
('certain', -0.00022774660798796816),
('exactly', -0.00022187133498753273),
('around', -0.0002212138544179596),
('national', -0.00021892498564345803),
('next', -0.0002158841794732266),
('less', -0.0002053370589299667),
('popular', -0.00020410840527040636),
('out', -0.00020327091044571987),
('immediately', -0.00020113758416102478),
('heavily', -0.0001978586638053557),
('virtually', -0.00019741273624147545),
('like', -0.00019201990770032797),
('effectively', -0.00018865752036787429),
('initial', -0.00018847850036724087),
('right', -0.00018667331121528363),
('again', -0.00018442600209907407),
('able', -0.00018271056934554336),
('obviously', -0.0001757623199258253),
('meanwhile', -0.0001700542754715078),
('favorite', -0.00016790618953759634),
('ago', -0.00016776635504992367),
('well', -0.00016722410208565514),
('further', -0.00016372379001952783),
('most', -0.00015933307977614858),
('close', -0.00015749033148161314),
('fantastic', -0.00015367714248264988),
('fine', -0.00014661878484316512),
('famous', -0.00014606101687336627),
('about', -0.00014439417455700205),
('own', -0.00014057528409486264),
('star', -0.00014039861913854286),
('suddenly', -0.00013654192505882642),
('tight', -0.00013514609804481277),
('younger', -0.0001342270569405857),
('french', -0.00013271869553648733),
('several', -0.00013224135315127988),
('over', -0.00013178399190350585),
('subtle', -0.0001304908962331229),
('extraordinary', -0.00013034796186768647),
('third', -0.00012689187862063335),
('teen', -0.00012547607896861357),
('open', -0.0001251771370265795),
('good', -0.00012464300361279563),
('mad', -0.00012423836030612785),
('what', -0.00011744539314099403),
('sure', -0.00011140597894659064),
('actual', -0.00010631033966002502),
('mental', -0.00010554178290598766),
('intense', -0.00010320983741859144),
('isn', -0.0001019431777130762),
('quick', -9.144819523439858e-05),
('cold', -8.453928225732842e-05),
('thoroughly', -8.446311268023136e-05),
('high', -8.108657336266872e-05),
('private', -8.082521935904967e-05),
('mysterious', -7.89915855672666e-05),
('psychological', -7.079354944579917e-05),
('last', -6.040588450787962e-05),
('emotional', -5.9231389028841126e-05),
('dangerous', -5.840343082568443e-05),
('numerous', -5.6507297730663385e-05),
('red', -5.518097392980415e-05),
('probably', -5.510557060454035e-05),
('enjoyable', -5.304302448897163e-05),
('almost', -5.268059132795041e-05),
('necessarily', -5.181284608931757e-05),
('live', -5.161664924643525e-05),
('incredible', -5.0838444616687316e-05),
('yet', -4.993545030473239e-05),
('actually', -4.970290822225044e-05),
('technical', -4.964327727627381e-05),
('once', -4.108719790097684e-05),
('very', -3.4591909186824834e-05),
('clearly', -3.0778856868112024e-05),
('occasionally', -2.9330789441932553e-05),
('life', -2.7919699401356872e-05),
('key', -2.5608117934161883e-05),
('rare', -2.1669136957346707e-05),
('later', -1.826701189417898e-05),
('off', -1.7711394330840666e-05),
('powerful', -1.5583342484341237e-05),
('together', -9.288051209772625e-06),
('true', -4.87041375588906e-06),
('similar', -3.89795186164733e-06),
('michael', -2.5784077662730463e-06),
('robert', -2.421454439686249e-07),
('regular', 9.074791722952363e-07),
('nice', 5.07184371876428e-06),
('particularly', 1.608419229506765e-05),
('terrific', 1.7873781404614906e-05),
('smart', 1.830028423949106e-05),
('visual', 1.9897264289668766e-05),
('american', 2.1072590232642293e-05),
('general', 2.142930413662073e-05),
('realistic', 3.266246171006355e-05),
('intelligent', 3.9886651389091365e-05),
('mean', 5.054165961934035e-05),
('humorous', 5.432574384601043e-05),
('usually', 5.541394191310462e-05),
('innocent', 6.0470827055490034e-05),
('capable', 6.280583117186135e-05),
('originally', 6.378534225396108e-05),
('sexual', 6.454153211762643e-05),
('grand', 6.640245105755038e-05),
('she', 7.44441831502362e-05),
('also', 7.641971683174965e-05),
('same', 7.726204521844339e-05),
('surprisingly', 7.814986440048783e-05),
('strong', 7.969296011480593e-05),
('still', 8.722930070878283e-05),
('willing', 8.872637216204328e-05),
('necessary', 8.891509448837737e-05),
('inevitable', 8.992341325145042e-05),
('deadly', 9.701581319515075e-05),
('literally', 9.852399016292528e-05),
('fresh', 0.0001081531161383144),
('witty', 0.00011555628335896363),
('movie', 0.00011610771682999971),
('before', 0.0001294858280637642),
('biggest', 0.00013372314741634094),
('secret', 0.00014397664031058025),
('late', 0.00015327727974057386),
('tim', 0.00015454685279179918),
('original', 0.000166646316708497),
('greater', 0.00016828610277696545),
('slightly', 0.00016914106338807022),
('soft', 0.00016963763368304206),
('lee', 0.00018716734589623767),
('beautiful', 0.00018782224248475215),
('sympathetic', 0.00018964480188837264),
('nevertheless', 0.00019220598277431417),
('slowly', 0.00019659574810863443),
('non', 0.0001988968383649769),
('sometimes', 0.00019992977542792213),
('british', 0.00020169386175819472),
('clever', 0.00020410422641186548),
('blue', 0.00021376094430168493),
('and', 0.00021441861464148188),
('nasty', 0.00021891481020010596),
('minor', 0.00021980135793128047),
('william', 0.0002221413294301671),
('normal', 0.0002222509905768803),
('especially', 0.00024404651952017204),
('moral', 0.00024919162284080124),
('always', 0.0002532388991124115),
('political', 0.0002623995973431354),
('weird', 0.0002690142191958027),
('latest', 0.0002748801528105005),
('highly', 0.0002803689422022479),
('wonderful', 0.0002952821816267323),
('serial', 0.0002977511634248602),
('initially', 0.0003026285852441848),
('memorable', 0.0003047299564975484),
('unfortunate', 0.00030614891738813574),
('final', 0.0003100101631852959),
('fun', 0.0003142947725700519),
('older', 0.0003187683928988676),
('brilliant', 0.0003215879003835055),
('co', 0.00032396165017799846),
('natural', 0.0003269851375293292),
('comic', 0.0003289997380387633),
('hot', 0.00033107428643060514),
('intriguing', 0.0003346515095666398),
('danny', 0.0003500028138064999),
('black', 0.00035883217167973543),
('unusual', 0.0003729001202466308),
('overall', 0.00038221150638625533),
('forward', 0.0003894123135382067),
('effective', 0.00039077394646117947),
('surely', 0.0003991479944951801),
('visually', 0.00039954294262994645),
('second', 0.0004001165402682627),
('extra', 0.00040435182000170397),
('fake', 0.0004087926346521024),
('lucky', 0.000414792487349712),
('detective', 0.0004258121327198735),
('scary', 0.00044858555977865663),
('all', 0.00046052119261720406),
('light', 0.0004666619849259899),
('unexpected', 0.0004755937806912272),
('present', 0.0004883839287330521),
('remarkable', 0.0004909132424787039),
('pure', 0.0004930297335705856),
('animated', 0.0005004334032993982),
('constantly', 0.0005005278442861616),
('solid', 0.0005104773753039919),
('sean', 0.0005354051081864117),
('fully', 0.000538860116644163),
('green', 0.0005626488560537576),
('sad', 0.0005690037657206867),
('classic', 0.0006113368528880467),
('best', 0.0006232564824836927),
('computer', 0.0006261920366367954),
('somewhat', 0.0006418769897289041),
('creative', 0.0006670668130810274),
('excellent', 0.0006877090591556421),
('steven', 0.000700126493842716),
('wonderfully', 0.0007011900004185701),
('tony', 0.000713347461311362),
('sharp', 0.000714782515078788),
('anti', 0.0007395501537729543),
('definitely', 0.0007444490420821688),
('musical', 0.0007462502942548184),
('friendly', 0.000761119245625841),
('professional', 0.0007843680588404591),
('perfectly', 0.0008016637337694504),
('emotionally', 0.0008558689372788594),
('john', 0.0008935668249893006),
('past', 0.0009797392305912716),
('nicely', 0.000980917414583356),
('outstanding', 0.0009961964716528754),
('hearted', 0.0010123986481938667),
('traditional', 0.0010483779957726606),
('generally', 0.0010503112142917228),
('amazing', 0.0010694376494112855),
('suspenseful', 0.001114814515979623),
('man', 0.0011264816790070354),
('earlier', 0.0011453103516269971),
('known', 0.0011615678408455577),
('lovely', 0.0013584838796366254),
('quiet', 0.0013999389580393261),
('convincing', 0.001406054680451586),
('looking', 0.001440831372975097),
('great', 0.001442680743868476),
('eccentric', 0.0015794026794116564),
('frank', 0.0016222794801935836),
('greatest', 0.0016856586307944627),
('perfect', 0.004148090426618199),
('cool', 0.009074307660516472)]
Now let's apply this methodology to real (and important!) scenario where we don't have any sentiment labels: the Kardashians
## Loading the Kardashian data
with open("kardashian-transcripts.json", "rb") as f:
transcripts = json.load(f)
msgs = [m['text'].lower() for transcript in transcripts
for m in transcript ]
#msgs_pos_tagged = [pos_tag(tokenizer.tokenize(m)) for m in msgs]
msgs_adj_adv_only_tokenized=[[w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]]
for m in msgs_pos_tagged]
msgs_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
for m in msgs_pos_tagged]
msgs[23]
'and then if you could take out the trash, and then if you go to dash, maybe tomorrow or whatever, later today and just...'
msgs_adj_adv_only[23]
'then then maybe later just'
vec = CountVectorizer(min_df = 10)
X = vec.fit_transform(msgs_adj_adv_only)
terms_kard = vec.get_feature_names()
len(terms_kard)
358
pmi_matrix_kard=getcollocations_matrix(X)
getcollocations("good",pmi_matrix_kard,terms_kard)
[('good', 0.0013962375073486185),
('positive', 0.0005952380952380953),
('horrible', 0.0003968253968253968),
('awful', 0.00030525030525030525),
('nude', 0.0003006253006253006),
('proud', 0.00024366471734892786),
('extremely', 0.0002204585537918871),
('willing', 0.00018896447467876037),
('pretty', 0.0001670843776106934),
('bruce', 0.00016534391534391533),
('strong', 0.00015873015873015873),
('such', 0.00013598378084359391),
('anywhere', 0.00013227513227513228),
('dramatic', 0.00012025012025012025),
('everything', 0.00012025012025012025),
('honest', 0.00012025012025012025),
('online', 0.00012025012025012025),
('though', 0.00011671335200746966),
('adrienne', 0.00011337868480725624),
('half', 0.00011022927689594356),
('kimberly', 0.00011022927689594356),
('wish', 0.00010175010175010176),
('very', 9.831259831259832e-05),
('that', 9.101499927188001e-05),
('really', 8.846426043878273e-05),
('he', 8.818342151675484e-05),
('smart', 8.267195767195767e-05),
('all', 7.78089013383131e-05),
('fun', 7.78089013383131e-05),
('rob', 7.78089013383131e-05),
('instead', 7.348618459729571e-05),
('super', 7.348618459729571e-05),
('too', 7.297938332421091e-05),
('black', 7.215007215007215e-05),
('like', 6.961849067112225e-05),
('big', 6.764069264069264e-05),
('actually', 6.705191036988273e-05),
('um', 6.421122925977295e-05),
('sure', 6.081615277017576e-05),
('before', 6.012506012506013e-05),
('hard', 5.922767116796967e-05),
('it', 5.860290670417253e-05),
('about', 5.7510927076144466e-05),
('busy', 5.7510927076144466e-05),
('sometimes', 5.7510927076144466e-05),
('clean', 5.628729032984352e-05),
('real', 5.166997354497354e-05),
('always', 5.069778588942352e-05),
('close', 4.99151442547669e-05),
('they', 4.99151442547669e-05),
('ve', 4.9680800854509775e-05),
('great', 4.95412480431207e-05),
('also', 4.8393341076267905e-05),
('not', 4.468754468754469e-05),
('back', 4.4276194903809964e-05),
('uh', 4.409171075837742e-05),
('we', 4.166145898429363e-05),
('she', 4.101554489151388e-05),
('maybe', 3.968253968253968e-05),
('single', 3.9485114111979786e-05),
('own', 3.834061805076298e-05),
('definitely', 3.7578162578162574e-05),
('still', 3.6743092298647854e-05),
('you', 3.651767455448437e-05),
('healthy', 3.575003575003575e-05),
('armenian', 3.480924533556112e-05),
('so', 3.4052186737495216e-05),
('pregnant', 3.348737525952716e-05),
('best', 3.094155140938767e-05),
('hot', 3.0761658668635414e-05),
('ready', 3.0292015024839453e-05),
('nervous', 3.0062530062530064e-05),
('different', 2.972474882587242e-05),
('least', 2.9394473838918284e-05),
('whole', 2.8446265005404792e-05),
('as', 2.7994736989445982e-05),
('again', 2.775002775002775e-05),
('gorgeous', 2.755731922398589e-05),
('re', 2.6577692729161045e-05),
('absolutely', 2.5936300446104367e-05),
('far', 2.5936300446104367e-05),
('well', 2.5221953188054885e-05),
('let', 2.519526329050139e-05),
('only', 2.4495394865765236e-05),
('just', 2.3795526441029087e-05),
('don', 2.2937884209560507e-05),
('cool', 2.2806057288815908e-05),
('probably', 2.2806057288815908e-05),
('then', 2.2675736961451248e-05),
('now', 2.2102528529263747e-05),
('few', 2.204585537918871e-05),
('right', 2.140374308659098e-05),
('old', 2.133469875405359e-05),
('happy', 2.077619878666999e-05),
('ll', 2.058756922570152e-05),
('here', 2.047975636318077e-05),
('long', 2.035002035002035e-05),
('perfect', 2.035002035002035e-05),
('next', 2.0194676683226302e-05),
('never', 2.010260368922983e-05),
('together', 1.9742557055989893e-05),
('bad', 1.917030902538149e-05),
('better', 1.917030902538149e-05),
('comfortable', 1.8630300320441167e-05),
('beautiful', 1.8119881133579763e-05),
('anymore', 1.740462266778056e-05),
('obviously', 1.6958350291683625e-05),
('last', 1.643169345032699e-05),
('gonna', 1.6380821334381707e-05),
('honestly', 1.5936762924714734e-05),
('first', 1.5873015873015872e-05),
('enough', 1.486237441293621e-05),
('already', 1.3778659611992945e-05),
('ever', 1.3096547750013096e-05),
('new', 1.2270420433685741e-05),
('crazy', 1.1403028644407954e-05),
('up', 9.44822373393802e-06),
('wrong', 9.315150160220583e-06),
('else', 8.877525656049147e-06),
('there', 8.332291796858726e-06),
('little', 6.7401341286691605e-06),
('more', 5.249013185521122e-06),
('even', 3.3572368597749307e-06),
('much', 2.9071457642886215e-06),
('able', 0.0),
('acceptable', 0.0),
('accurate', 0.0),
('active', 0.0),
('afraid', 0.0),
('ago', 0.0),
('ahead', 0.0),
('alcoholic', 0.0),
('almost', 0.0),
('alone', 0.0),
('along', 0.0),
('amazing', 0.0),
('american', 0.0),
('anal', 0.0),
('angry', 0.0),
('annoying', 0.0),
('anxious', 0.0),
('anyway', 0.0),
('apart', 0.0),
('apparently', 0.0),
('around', 0.0),
('atm', 0.0),
('away', 0.0),
('awesome', 0.0),
('awkward', 0.0),
('barely', 0.0),
('basic', 0.0),
('basically', 0.0),
('belly', 0.0),
('bible', 0.0),
('bigger', 0.0),
('biggest', 0.0),
('boring', 0.0),
('bright', 0.0),
('bunim', 0.0),
('can', 0.0),
('certain', 0.0),
('certainly', 0.0),
('clear', 0.0),
('clearly', 0.0),
('cold', 0.0),
('common', 0.0),
('complete', 0.0),
('completely', 0.0),
('constantly', 0.0),
('couple', 0.0),
('cute', 0.0),
('dead', 0.0),
('deep', 0.0),
('delicious', 0.0),
('diaper', 0.0),
('didn', 0.0),
('difficult', 0.0),
('disappointed', 0.0),
('doesn', 0.0),
('double', 0.0),
('down', 0.0),
('dry', 0.0),
('dumb', 0.0),
('early', 0.0),
('easier', 0.0),
('easy', 0.0),
('embarrassing', 0.0),
('emotional', 0.0),
('entire', 0.0),
('eric', 0.0),
('especially', 0.0),
('everyone', 0.0),
('everywhere', 0.0),
('exactly', 0.0),
('excited', 0.0),
('exciting', 0.0),
('extra', 0.0),
('fabulous', 0.0),
('fair', 0.0),
('family', 0.0),
('fast', 0.0),
('fat', 0.0),
('favorite', 0.0),
('female', 0.0),
('finally', 0.0),
('fine', 0.0),
('forever', 0.0),
('forward', 0.0),
('free', 0.0),
('fresh', 0.0),
('full', 0.0),
('funny', 0.0),
('fur', 0.0),
('girlfriend', 0.0),
('glad', 0.0),
('god', 0.0),
('gray', 0.0),
('green', 0.0),
('gross', 0.0),
('grown', 0.0),
('guilty', 0.0),
('guys', 0.0),
('high', 0.0),
('hopefully', 0.0),
('huge', 0.0),
('huh', 0.0),
('hundred', 0.0),
('hungry', 0.0),
('immediately', 0.0),
('important', 0.0),
('incredible', 0.0),
('inside', 0.0),
('interested', 0.0),
('isn', 0.0),
('jealous', 0.0),
('kardashian', 0.0),
('kelly', 0.0),
('khloe', 0.0),
('kim', 0.0),
('kourtney', 0.0),
('kris', 0.0),
('laker', 0.0),
('lamar', 0.0),
('late', 0.0),
('lately', 0.0),
('later', 0.0),
('less', 0.0),
('lily', 0.0),
('literally', 0.0),
('live', 0.0),
('love', 0.0),
('low', 0.0),
('luxurious', 0.0),
('mad', 0.0),
('major', 0.0),
('male', 0.0),
('many', 0.0),
('married', 0.0),
('mean', 0.0),
('miserable', 0.0),
('miss', 0.0),
('moral', 0.0),
('most', 0.0),
('murray', 0.0),
('naked', 0.0),
('natural', 0.0),
('necessary', 0.0),
('nice', 0.0),
('normal', 0.0),
('normally', 0.0),
('off', 0.0),
('often', 0.0),
('oh', 0.0),
('okay', 0.0),
('older', 0.0),
('once', 0.0),
('open', 0.0),
('other', 0.0),
('out', 0.0),
('outside', 0.0),
('past', 0.0),
('people', 0.0),
('personal', 0.0),
('poor', 0.0),
('possible', 0.0),
('possibly', 0.0),
('private', 0.0),
('professional', 0.0),
('public', 0.0),
('quiet', 0.0),
('rather', 0.0),
('red', 0.0),
('regular', 0.0),
('rich', 0.0),
('rid', 0.0),
('ridiculous', 0.0),
('rude', 0.0),
('sad', 0.0),
('safe', 0.0),
('same', 0.0),
('san', 0.0),
('scary', 0.0),
('scott', 0.0),
('second', 0.0),
('secret', 0.0),
('selfish', 0.0),
('sensitive', 0.0),
('serious', 0.0),
('seriously', 0.0),
('sexual', 0.0),
('sexy', 0.0),
('short', 0.0),
('sick', 0.0),
('sister', 0.0),
('small', 0.0),
('somewhere', 0.0),
('soon', 0.0),
('sorry', 0.0),
('special', 0.0),
('straight', 0.0),
('stupid', 0.0),
('sudden', 0.0),
('supportive', 0.0),
('sweet', 0.0),
('tall', 0.0),
('ten', 0.0),
('thebouncedryer', 0.0),
('top', 0.0),
('total', 0.0),
('totally', 0.0),
('touch', 0.0),
('tough', 0.0),
('true', 0.0),
('truly', 0.0),
('truthful', 0.0),
('tryclearblue', 0.0),
('twice', 0.0),
('ugly', 0.0),
('uncomfortable', 0.0),
('upset', 0.0),
('usually', 0.0),
('wear', 0.0),
('weird', 0.0),
('welcome', 0.0),
('what', 0.0),
('white', 0.0),
('who', 0.0),
('won', 0.0),
('wonderful', 0.0),
('worried', 0.0),
('worse', 0.0),
('worst', 0.0),
('yeah', 0.0),
('year', 0.0),
('yes', 0.0),
('yet', 0.0),
('young', 0.0),
('younger', 0.0)]
posscores=seed_score(['good',"great"],pmi_matrix_kard,terms_kard)
negscores=seed_score(['bad'],pmi_matrix_kard,terms_kard)
## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms_kard:
sentscores[w]=posscores[w]-negscores[w]
neglexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[:10]
poslexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[-10:]
sorted(sentscores.items(),key=itemgetter(1),reverse=False)
[('bad', -0.004933680845053443),
('horrible', -0.0005693581780538303),
('san', -0.00040257648953301127),
('worried', -0.0003716090672612412),
('worst', -0.00023004370830457787),
('high', -0.00022469385462307607),
('rich', -0.00021003990758244065),
('ready', -0.00019097139906964003),
('normal', -0.00016950589032968896),
('busy', -0.0001525289805062962),
('can', -0.00014639145073927682),
('entire', -0.0001271294177472667),
('sorry', -0.00012326077909859053),
('able', -0.00011670194865114262),
('enough', -9.369757782068481e-05),
('kourtney', -8.350765556432388e-05),
('seriously', -7.791803023219573e-05),
('again', -7.359789968485622e-05),
('probably', -6.0485630200772624e-05),
('around', -5.891363261458702e-05),
('long', -5.397179310222788e-05),
('still', -5.271834981979909e-05),
('other', -5.242651351769991e-05),
('now', -4.557302009966452e-05),
('obviously', -4.4976494251856576e-05),
('too', -4.3628979161213044e-05),
('away', -4.052409175844072e-05),
('fast', -3.877141151200752e-05),
('especially', -3.5593426961842966e-05),
('little', -3.0184078924040153e-05),
('never', -2.7248013774370832e-05),
('right', -2.11946200471519e-05),
('not', -1.594014173398639e-05),
('more', -1.3921295839860365e-05),
('hard', -1.2875580688689063e-05),
('ever', -1.0818887271749949e-05),
('just', -8.23636069950847e-06),
('nice', -7.982349428942723e-06),
('together', -4.291860229563018e-06),
('much', -4.250653284081997e-06),
('acceptable', 0.0),
('accurate', 0.0),
('active', 0.0),
('afraid', 0.0),
('ago', 0.0),
('ahead', 0.0),
('alcoholic', 0.0),
('almost', 0.0),
('alone', 0.0),
('american', 0.0),
('anal', 0.0),
('angry', 0.0),
('annoying', 0.0),
('anxious', 0.0),
('anyway', 0.0),
('apart', 0.0),
('apparently', 0.0),
('atm', 0.0),
('awesome', 0.0),
('awkward', 0.0),
('barely', 0.0),
('basic', 0.0),
('basically', 0.0),
('belly', 0.0),
('bible', 0.0),
('bigger', 0.0),
('biggest', 0.0),
('boring', 0.0),
('bright', 0.0),
('bunim', 0.0),
('certain', 0.0),
('clear', 0.0),
('clearly', 0.0),
('cold', 0.0),
('common', 0.0),
('complete', 0.0),
('completely', 0.0),
('constantly', 0.0),
('cute', 0.0),
('dead', 0.0),
('deep', 0.0),
('delicious', 0.0),
('diaper', 0.0),
('didn', 0.0),
('difficult', 0.0),
('disappointed', 0.0),
('doesn', 0.0),
('double', 0.0),
('down', 0.0),
('dry', 0.0),
('dumb', 0.0),
('early', 0.0),
('easier', 0.0),
('embarrassing', 0.0),
('emotional', 0.0),
('everyone', 0.0),
('everywhere', 0.0),
('exactly', 0.0),
('excited', 0.0),
('exciting', 0.0),
('extra', 0.0),
('fabulous', 0.0),
('fair', 0.0),
('family', 0.0),
('fat', 0.0),
('favorite', 0.0),
('finally', 0.0),
('fine', 0.0),
('forever', 0.0),
('forward', 0.0),
('free', 0.0),
('full', 0.0),
('funny', 0.0),
('fur', 0.0),
('girlfriend', 0.0),
('glad', 0.0),
('god', 0.0),
('gray', 0.0),
('green', 0.0),
('gross', 0.0),
('grown', 0.0),
('guilty', 0.0),
('guys', 0.0),
('huge', 0.0),
('huh', 0.0),
('hundred', 0.0),
('hungry', 0.0),
('immediately', 0.0),
('important', 0.0),
('incredible', 0.0),
('inside', 0.0),
('isn', 0.0),
('kardashian', 0.0),
('kelly', 0.0),
('khloe', 0.0),
('kim', 0.0),
('kris', 0.0),
('laker', 0.0),
('lamar', 0.0),
('late', 0.0),
('later', 0.0),
('lily', 0.0),
('literally', 0.0),
('live', 0.0),
('low', 0.0),
('luxurious', 0.0),
('mad', 0.0),
('major', 0.0),
('male', 0.0),
('mean', 0.0),
('miserable', 0.0),
('miss', 0.0),
('moral', 0.0),
('most', 0.0),
('murray', 0.0),
('natural', 0.0),
('necessary', 0.0),
('normally', 0.0),
('off', 0.0),
('often', 0.0),
('oh', 0.0),
('older', 0.0),
('open', 0.0),
('outside', 0.0),
('past', 0.0),
('people', 0.0),
('personal', 0.0),
('poor', 0.0),
('possible', 0.0),
('possibly', 0.0),
('private', 0.0),
('professional', 0.0),
('public', 0.0),
('quiet', 0.0),
('rather', 0.0),
('red', 0.0),
('regular', 0.0),
('rid', 0.0),
('ridiculous', 0.0),
('rude', 0.0),
('sad', 0.0),
('safe', 0.0),
('scary', 0.0),
('scott', 0.0),
('second', 0.0),
('secret', 0.0),
('selfish', 0.0),
('sensitive', 0.0),
('serious', 0.0),
('sexual', 0.0),
('sexy', 0.0),
('short', 0.0),
('sick', 0.0),
('sister', 0.0),
('somewhere', 0.0),
('soon', 0.0),
('special', 0.0),
('straight', 0.0),
('stupid', 0.0),
('supportive', 0.0),
('sweet', 0.0),
('tall', 0.0),
('ten', 0.0),
('thebouncedryer', 0.0),
('top', 0.0),
('total', 0.0),
('touch', 0.0),
('tough', 0.0),
('true', 0.0),
('truly', 0.0),
('truthful', 0.0),
('tryclearblue', 0.0),
('twice', 0.0),
('ugly', 0.0),
('uncomfortable', 0.0),
('upset', 0.0),
('usually', 0.0),
('wear', 0.0),
('weird', 0.0),
('welcome', 0.0),
('what', 0.0),
('white', 0.0),
('who', 0.0),
('won', 0.0),
('wonderful', 0.0),
('worse', 0.0),
('yeah', 0.0),
('year', 0.0),
('yes', 0.0),
('yet', 0.0),
('young', 0.0),
('younger', 0.0),
('as', 2.4343249556039977e-06),
('only', 2.582367470460253e-06),
('so', 4.5872906918408804e-06),
('really', 5.177341671014243e-06),
('there', 6.622686249872567e-06),
('even', 7.352463528271139e-06),
('wrong', 9.315150160220583e-06),
('last', 9.688839274325689e-06),
('crazy', 1.1403028644407954e-05),
('already', 1.3778659611992945e-05),
('honestly', 1.5936762924714734e-05),
('anymore', 1.740462266778056e-05),
('beautiful', 1.8119881133579763e-05),
('comfortable', 1.8630300320441167e-05),
('it', 1.9896932173619897e-05),
('next', 2.0194676683226302e-05),
('perfect', 2.035002035002035e-05),
('then', 2.096979485492291e-05),
('old', 2.133469875405359e-05),
('few', 2.204585537918871e-05),
('whole', 2.2609708433704742e-05),
('cool', 2.2806057288815908e-05),
('don', 2.2937884209560507e-05),
('far', 2.5936300446104367e-05),
('first', 2.6511891191910728e-05),
('gorgeous', 2.755731922398589e-05),
('same', 2.8810141169691734e-05),
('least', 2.9394473838918284e-05),
('nervous', 3.0062530062530064e-05),
('here', 3.1602702288756697e-05),
('armenian', 3.480924533556112e-05),
('healthy', 3.575003575003575e-05),
('up', 3.6200497677223196e-05),
('re', 3.647728651862571e-05),
('actually', 3.97868532420839e-05),
('happy', 4.038519539431358e-05),
('uh', 4.409171075837742e-05),
('ll', 4.5509892890229304e-05),
('new', 4.701363334704312e-05),
('easy', 4.801690194948622e-05),
('close', 4.99151442547669e-05),
('they', 4.99151442547669e-05),
('real', 5.166997354497354e-05),
('best', 5.2843997912662086e-05),
('totally', 5.382384186372808e-05),
('clean', 5.628729032984352e-05),
('about', 5.7510927076144466e-05),
('sometimes', 5.7510927076144466e-05),
('back', 5.767585427992636e-05),
('you', 5.783005781507255e-05),
('definitely', 5.885838048759397e-05),
('else', 5.915025521390049e-05),
('different', 5.960923005872315e-05),
('let', 6.0864961881548294e-05),
('big', 6.147251353650963e-05),
('um', 6.421122925977295e-05),
('like', 6.961849067112225e-05),
('black', 7.215007215007215e-05),
('better', 7.3450285142192e-05),
('instead', 7.348618459729571e-05),
('super', 7.348618459729571e-05),
('many', 7.416471984277079e-05),
('maybe', 7.713572320313893e-05),
('all', 7.78089013383131e-05),
('fun', 7.78089013383131e-05),
('rob', 7.78089013383131e-05),
('pregnant', 8.089646832357684e-05),
('smart', 8.267195767195767e-05),
('ve', 8.484810932455602e-05),
('amazing', 8.512087163772558e-05),
('naked', 8.917424647761726e-05),
('jealous', 9.134922809902256e-05),
('single', 9.538538802332195e-05),
('gonna', 9.754871131710454e-05),
('well', 9.76719213100689e-05),
('absolutely', 9.93739151923774e-05),
('we', 0.00010064285035531607),
('wish', 0.00010175010175010176),
('half', 0.00011022927689594356),
('kimberly', 0.00011022927689594356),
('very', 0.00011096570085334131),
('adrienne', 0.00011337868480725624),
('pretty', 0.00011623261051178672),
('though', 0.00011671335200746966),
('dramatic', 0.00012025012025012025),
('everything', 0.00012025012025012025),
('honest', 0.00012025012025012025),
('online', 0.00012025012025012025),
('sure', 0.00012539060711603653),
('always', 0.00012899712426001428),
('fresh', 0.00012914890869172155),
('anywhere', 0.00013227513227513228),
('he', 0.00013812099954422052),
('own', 0.00013903390707904913),
('she', 0.00015714944728096892),
('strong', 0.00015873015873015873),
('bruce', 0.00016534391534391533),
('such', 0.0001856981514926353),
('less', 0.00018726591760299626),
('okay', 0.00018726591760299626),
('married', 0.00019206760779794487),
('interested', 0.00019712201852946972),
('sudden', 0.00019712201852946972),
('once', 0.00021100385082027748),
('lately', 0.0002203128442388191),
('extremely', 0.0002204585537918871),
('before', 0.0002303668034005113),
('hopefully', 0.00023408239700374532),
('proud', 0.00024366471734892786),
('along', 0.00024968789013732833),
('small', 0.00024968789013732833),
('hot', 0.0002920629390449092),
('nude', 0.0003006253006253006),
('awful', 0.00030525030525030525),
('that', 0.00031435967164242603),
('out', 0.00032102728731942215),
('eric', 0.0003745318352059925),
('also', 0.0004005512349072824),
('love', 0.0004406256884776382),
('female', 0.0005350454788657035),
('positive', 0.0005952380952380953),
('certainly', 0.0006242197253433209),
('willing', 0.0007240099535444639),
('couple', 0.0009363295880149813),
('good', 0.0014266084463663577),
('great', 0.004029259646779759)]
We (roughly) calculate the each sentence's sentiment score by comparing the number of words with positive sentiment score vs negative sentiment score (according to our automatically induced lexicon)
final_message_sentiment = {}
for k, m in enumerate(msgs_adj_adv_only_tokenized):
m_sent_score = sum([sentscores.get(w,0)>0 for w in m])-sum([sentscores.get(w,0)<0 for w in m])
final_message_sentiment[msgs[k]]=m_sent_score
sorted(final_message_sentiment.items(), key=itemgetter(1), reverse=False)[-10:]
[("i know i'm setting myself up here, but, honestly, the warm nuts in new york are so good.",
5),
('this whole experience in new york has really opened my eyes to so many different things.',
6),
('we were like, "oh, yeah, good to see you." and then over text message, it was back before like, "oh, so good to look in your eyes."',
6),
('my first memory of bruce was it was my 11th birthday party and it was at tower lane, and i thought it was so cool that you four: you, casey, brody, everyone was coming, and i was, like, "we have four new, like, brothers and sisters, and they\'re all coming." and that\'s why it was so much fun \'cause we had such a good time.',
6),
("i wouldn't be a good manager or a good mom if i didn't find out who's really single out there and who would be a great match for kim.",
6),
("i definitely feel protective over summer because she's so young and new to the industry, but i think the smart thing to do is let her learn her own lessons and kind of feel her way through on her own.",
6),
("with sex, there's a real fine li from being fun to turning trashy, and we are the ones that have to make sure all the pictures and everything else with carmen looks really, really fun and sexy.",
6),
('once you\'ve broken trust, and you start thinking in your mind, "i want to be a better person, i want to be a better dad," people aren\'t automatically gonna go, "oh, scott, that\'s so great!',
6),
("this is a great time to tell khloe that it's not always all about us and that maybe once in a while it's a great thing to help somebody else out.",
6),
("so, tonight, khloe, i ask you to honor that very same promise to his grandmother, that you will always support lamar and stand by him because you have realized very quickly what the rest of us already know: it's very easy to love lamar.",
8)]
sorted(final_message_sentiment.items(), key=itemgetter(1))[:10]
[('now, i do not know what case you have him on, but whatever it is, it is going bad, and it sounds like it is going bad right now.',
-6),
('all you can do right now is just dote on her, take care of her and just realize that this stage of pregnancy, you just got to get through it.',
-5),
('look, i promise i will never, ever, ever lie ever again.', -5),
("i need scott to start taking things a little bit more seriously when it comes to the baby, like getting the room together, reading baby books, just being more involved in what i'm going through.",
-5),
("i couldn't be any more sorry, and i'll never excuse the way i acted the other night in vegas, but, like, i don't know what i ever did so bad to, like, deserve you to, like, hate me so much.",
-5),
("do you think it's 'cause you guys are spending way too much time together now that you guys live together?",
-5),
('now too much! too much!', -5),
("i'm just saying it's probably not the right thing to do.", -4),
("you're going too fast, you're going too fast.", -4),
('a little more, a little more!', -4)]
Pretty good considering that we had absolutely no sentiment labels to start with!