# Proto Information Retrieval System¶

We're going to build a very basic proto-IR system from scratch. In the first part we will preprocess the "documents" (in this case sentences from a Wikipedia article); in the second part we will implement a method for searching for the set of documents which are closest to a given query.

## Lecture: Basic text processing: sentences, types, tokens, similarity¶

In [1]:
import re

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

In [3]:
# Copy and pasted background info from the show's Wikipedia page
# (http://en.wikipedia.org/wiki/Keeping_Up_with_the_Kardashians)

background = u"""
Robert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987). The couple divorced in 1991.[3] In 1994, Robert entered the media spotlight when he defended O.J. Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial. Kris married former Olympic champion Bruce Jenner (born 1949) in 1991. Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997). Robert died in 2003 eight weeks after being diagnosed with esophageal cancer.[4] In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan.[5] Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California. Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey.[5] In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive.[6] In February 2007, a home sex video that Kim made with Ray J years earlier was leaked.[7] Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21.[7] Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for$5 million.[8]

In August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer.[6] Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [...] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain." The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending.[6] Keeping Up With the Kardashians premiered on October 14, 2007.[9]

On November 13, 2007, it was announced that E! renewed the show for its second season.[9] The following year, it was renewed for a third season. Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television's most-talked-about shows [...] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics."[10] The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010.[3] In 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony,[11] but filed for divorce 72 days later.[12] This caused widespread backlash from the public and media.[13] Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures.[14] A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split.[15] As of December 2013, eight seasons of the series have aired. On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine.[16] The deal was estimated at$40 million.[17][18] On January. 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale.

The show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show. As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show. Kris' second husband[19] 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began.
Since the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections.
Kim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010.[20] In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage. In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June.
Khloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008. The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship. In 2012, she served as a co-host during the second season of the American version of The X Factor.
Rob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons.
Kendall and Kylie have also established careers in the modeling industry.
In the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons.
The family earns an alleged total of $10 million per season of the series.[21] """  The simplest way to identify individual words is to split the string at every whitespace. In [4]: background.split()  Out[4]: ['Robert', 'Kardashian', '(1944—2003)', 'and', 'Kristen', 'Mary', '"Kris"', 'Houghton', '(born', '1955)', 'married', 'in', '1978', 'and', 'had', 'four', 'children', 'together,', 'daughters', 'Kourtney', '(born', '1979),', 'Kim', '(born', '1980),', 'and', 'Khloé,', '(born', '1984),', 'and', 'son', 'Rob', '(born', '1987).', 'The', 'couple', 'divorced', 'in', '1991.[3]', 'In', '1994,', 'Robert', 'entered', 'the', 'media', 'spotlight', 'when', 'he', 'defended', 'O.J.', 'Simpson', 'for', 'the', 'murders', 'of', 'Nicole', 'Brown', 'Simpson', 'and', 'Ronald', 'Goldman', 'during', 'a', 'lengthy', 'trial.', 'Kris', 'married', 'former', 'Olympic', 'champion', 'Bruce', 'Jenner', '(born', '1949)', 'in', '1991.', 'Bruce', 'and', 'Kris', 'had', 'two', 'daughters', 'together,', 'Kendall', '(born', '1995)', 'and', 'Kylie', '(born', '1997).', 'Robert', 'died', 'in', '2003', 'eight', 'weeks', 'after', 'being', 'diagnosed', 'with', 'esophageal', 'cancer.[4]', 'In', '2004,', 'Kim', 'became', 'a', 'personal', 'stylist', 'to', 'recording', 'artist', 'Brandy', 'Norwood;', 'she', 'eventually', 'developed', 'into', 'a', 'full-time', 'stylist,', 'and', 'was', 'a', 'personal', 'shopper', 'and', 'stylist', 'to', 'actress', 'Lindsay', 'Lohan.[5]', 'Khloé,', 'Kim', 'and', 'Kourtney', 'further', 'ventured', 'into', 'fashion,', 'opening', 'a', 'high', 'fashion', 'boutique', 'D-A-S-H', 'in', 'Calabasas,', 'California.', 'Throughout', "Kim's", 'early', 'career,', 'she', 'involved', 'herself', 'in', 'some', 'high-profile', 'relationships', 'including', "Norwood's", 'brother,', 'rapper', 'Ray', 'J', 'and', 'later,', 'singer', 'Nick', 'Lachey.[5]', 'In', '2006,', 'Kourtney', 'starred', 'in', 'her', 'first', 'reality', 'television', 'series,', 'Filthy', 'Rich:', 'Cattle', 'Drive.[6]', 'In', 'February', '2007,', 'a', 'home', 'sex', 'video', 'that', 'Kim', 'made', 'with', 'Ray', 'J', 'years', 'earlier', 'was', 'leaked.[7]', 'Vivid', 'Entertainment', 'bought', 'the', 'rights', 'for', '$1',
'million',
'and',
'released',
'the',
'film',
'as',
'Kim',
'Kardashian:',
'Superstar',
'on',
'February',
'21.[7]',
'Kim',
'sued',
'Vivid',
'for',
'ownership',
'of',
'the',
'tape,',
'but',
'dropped',
'the',
'suit',
'in',
'April',
'2007',
'and',
'settled',
'with',
'Vivid',
'Entertainment',
'for',
'$5', 'million.[8]', 'In', 'August', '2007,', 'it', 'was', 'announced', 'that', 'the', 'Kardashians', 'and', 'Jenners', 'would', 'star', 'in', 'a', 'yet-to-be-titled', 'reality', 'show', 'on', 'E!,', 'with', 'Ryan', 'Seacrest', 'serving', 'as', 'executive', 'producer.[6]', 'Seacrest', 'said', '"At', 'the', 'heart', 'of', 'the', 'series—despite', 'the', 'catfights', 'and', 'endless', 'sarcasm—is', 'a', 'family', 'that', 'truly', 'loves', 'and', 'supports', 'one', 'another', '[...]', 'The', 'familiar', 'dynamics', 'of', 'this', 'family', 'make', 'them', 'one', 'Hollywood', 'bunch', 'that', 'is', 'sure', 'to', 'entertain."', 'The', 'series', 'announcement', 'came', 'one', 'week', 'after', 'Paris', 'Hilton', 'and', 'her', 'friend', 'Nicole', 'Richie', 'announced', 'that', 'their', 'popular', 'E!', 'series', 'entitled', 'The', 'Simple', 'Life', 'was', 'ending.[6]', 'Keeping', 'Up', 'With', 'the', 'Kardashians', 'premiered', 'on', 'October', '14,', '2007.[9]', 'On', 'November', '13,', '2007,', 'it', 'was', 'announced', 'that', 'E!', 'renewed', 'the', 'show', 'for', 'its', 'second', 'season.[9]', 'The', 'following', 'year,', 'it', 'was', 'renewed', 'for', 'a', 'third', 'season.', 'Lisa', 'Berger,', 'executive', 'vice', 'president', 'of', 'original', 'programming', 'and', 'series', 'development', 'for', 'the', 'network,', 'said', '"viewers', 'have', 'embraced', 'the', 'Kardashian', 'family', 'and', 'the', 'series', 'has', 'become', 'one', 'of', "television's", 'most-talked-about', 'shows', '[...]', 'We', 'are', 'fortunate', 'to', 'work', 'with', 'Seacrest', 'and', 'Bunim-Murray,', 'which', 'have', 'an', 'exceptional', 'ability', 'to', 'capture', 'the', "Kardashians'", 'hilarious,', 'chaotic', 'and', 'always', 'entertaining', 'personalities', 'and', 'family', 'dynamics."[10]', 'The', 'Hollywood', 'Reporter', 'reported', 'that', 'the', 'family', 'made', 'an', 'estimated', '$65',
'million',
'throughout',
'2010.[3]',
'In',
'2011,',
'Kim',
'married',
'NBA',
'player',
'Kris',
'Humphries',
'in',
'a',
'highly',
'publicized',
'wedding',
'ceremony,[11]',
'but',
'filed',
'for',
'divorce',
'72',
'days',
'later.[12]',
'This',
'caused',
'backlash',
'from',
'the',
'public',
'and',
'media.[13]',
'Several',
'news',
'outlets',
'surmised',
'that',
"Kardashian's",
'marriage',
'to',
'Humphries',
'was',
'merely',
'a',
'publicity',
'stunt,',
'to',
'promote',
'the',
'Kardashian',
"family's",
'brand',
'and',
'their',
'subsequent',
'television',
'ventures.[14]',
'A',
'widely',
'circulated',
'petition',
'to',
'remove',
'all',
'Kardashian-related',
'programming',
'from',
'the',
'air',
'has',
'followed',
'since',
'their',
'split.[15]',
'As',
'of',
'December',
'2013,',
'eight',
'seasons',
'of',
'the',
'series',
'have',
'aired.',
'On',
'April',
'24,',
'2012,',
'E!',
'signed',
'a',
'three-year',
'deal',
'with',
'the',
'Kardashian',
'family',
'that',
'will',
'keep',
'the',
'series',
'airing',
'through',
'seasons',
'seven,',
'eight',
'and',
'nine.[16]',
'The',
'deal',
'was',
'estimated',
'at',
'$40', 'million.[17][18]', 'On', 'January.', '4,', '2015,', 'a', 'tenth', 'season', 'was', 'announced', 'to', 'premiere', 'on', 'Feb', '8,', '2015', 'after', 'the', 'Kourtney', 'and', 'Khloé', 'Take', 'The', 'Hamptons', 'season', 'finale.', 'The', 'show', 'revolves', 'around', 'the', 'children', 'of', 'Kris', 'Jenner,', 'and', 'originally', 'focused', 'on', 'her', 'children', 'from', 'her', 'first', 'marriage', 'to', 'deceased', 'attorney', 'Robert', 'Kardashian:', 'Kourtney,', 'Kim,', 'Khloé,', 'and', 'Rob', "Kourtney's", 'boyfriend', 'Scott', 'Disick', 'is', 'a', 'main', 'character', 'on', 'the', 'show.', 'As', 'the', 'series', 'progressed,', "Kris'", 'children', 'Kendall', 'and', 'Kylie', 'also', 'became', 'recurring', 'cast', 'members', 'of', 'the', 'show.', "Kris'", 'second', 'husband[19]', '1976', 'Summer', 'Olympics', 'decathlon', 'champion', 'Bruce', 'Jenner,', 'is', 'also', 'frequently', 'featured', 'on', 'the', 'show,', 'and', 'has', 'been', 'a', 'recurring', 'cast', 'member', 'since', 'the', 'show', 'began.', 'Since', 'the', "series'", 'premiere,', 'the', 'Kardashian', 'sisters', 'have', 'established', 'careers', 'in', 'the', 'fashion', 'industry,', 'co-owning', 'the', 'fashion', 'boutique', 'D-A-S-H', 'and', 'launching', 'several', 'fragrances', 'and', 'clothing', 'collections.', 'Kim', 'gained', 'notoriety', 'as', 'the', 'subject', 'of', 'a', 'sex', 'tape', 'in', '2007,', 'and', 'later', 'became', 'involved', 'in', 'a', 'relationship', 'with', 'New', 'Orleans', 'Saints', 'running', 'back', 'Reggie', 'Bush', 'from', '2007-', 'March', '2010.[20]', 'In', '2011,', 'she', 'received', 'widespread', 'criticism', 'after', 'filing', 'for', 'divorce', 'from', 'New', 'Jersey', 'Nets', 'power', 'forward', 'Kris', 'Humphries', 'after', 'a', '72-day', 'marriage.', 'In', '2012,', 'while', 'still', 'married', 'Kim', 'became', 'pregnant', 'by', 'rapper', 'Kanye', 'West;', 'after', 'suffering', 'from', 'preeclampsia', 'she', 'gave', 'birth', 'prematurely', 'to', 'their', 'daughter', 'North', 'the', 'following', 'June.', 'Khloé', 'attained', 'notoriety', 'in', 'her', 'own', 'right', 'after', 'being', 'arrested', 'for', 'driving', 'under', 'the', 'influence', 'in', '2007,', 'for', 'which', 'she', 'was', 'jailed', 'for', 'approximately', 'three', 'hours', 'in', '2008.', 'The', 'following', 'year', '(during', 'the', 'fourth', 'season),', 'she', 'married', 'Los', 'Angeles', 'Lakers', 'forward', 'Lamar', 'Odom', 'after', 'a', 'one-month', 'relationship.', 'In', '2012,', 'she', 'served', 'as', 'a', 'co-host', 'during', 'the', 'second', 'season', 'of', 'the', 'American', 'version', 'of', 'The', 'X', 'Factor.', 'Rob', 'launched', 'the', 'sock', 'line', '"Arthur', 'George"', 'in', '2012,', 'and', 'was', 'involved', 'in', 'a', 'relationship', 'with', 'singer', 'Adrienne', 'Bailon', 'in', 'the', 'second', 'and', 'third', 'seasons.', 'Kendall', 'and', 'Kylie', 'have', 'also', 'established', 'careers', 'in', 'the', 'modeling', 'industry.', 'In', 'the', 'eighth', 'season,', "Bruce's", 'sons', 'Brandon', 'and', 'Brody', 'Jenner,', 'and', "Brandon's", 'wife', 'Leah', 'Felder', '(daughter', 'of', 'Eagles', 'band', 'member', 'Don', 'Felder),', 'were', 'integrated', 'into', 'the', 'supporting', 'cast,', 'while', 'Kourtney,', 'Khloé,', 'and', "Kim's", 'friends', 'Malika', 'Haqq', 'and', 'Jonathan', 'Cheban', 'joined', 'the', 'series', 'in', 'the', 'second', 'and', 'third', 'seasons.', 'The', 'family', 'earns', 'an', 'alleged', 'total', 'of', '$10',
'million',
'per',
'season',
'of',
'the',
'series.[21]']

The output is not satisfactory, especially around punctuation. Let's try to catch a higher-level structure first: sentences. A simple idea is to break at periods.

In [5]:
background.split(".")

Out[5]:
['\nRobert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987)',
' The couple divorced in 1991',
'[3] In 1994, Robert entered the media spotlight when he defended O',
'J',
' Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial',
' Kris married former Olympic champion Bruce Jenner (born 1949) in 1991',
' Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997)',
' Robert died in 2003 eight weeks after being diagnosed with esophageal cancer',
'[4] In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan',
'[5] Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California',
" Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey",
'[5] In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive',
'[6] In February 2007, a home sex video that Kim made with Ray J years earlier was leaked',
'[7] Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21', '[7] Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for$5 million',
'[8]\n\nIn August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer',
'[6] Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [',
'',
'',
'] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain',
'" The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending',
'[6] Keeping Up With the Kardashians premiered on October 14, 2007',
'[9]\n\nOn November 13, 2007, it was announced that E! renewed the show for its second season',
'[9] The following year, it was renewed for a third season',
' Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television\'s most-talked-about shows [',
'',
'',
"] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics",
'"[10] The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010', '[3]\n\nIn 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony,[11] but filed for divorce 72 days later', '[12] This caused widespread backlash from the public and media', "[13] Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures", '[14] A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split', '[15] As of December 2013, eight seasons of the series have aired', ' On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine', '[16] The deal was estimated at$40 million',
'[17][18] On January',
' 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale',
"\n\nThe show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show",
" As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show",
" Kris' second husband[19] 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began",
"\nSince the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections",
'\nKim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010',
'[20] In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage',
' In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June',
'\nKhloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008',
' The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship',
' In 2012, she served as a co-host during the second season of the American version of The X Factor',
'\nRob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons',
'\nKendall and Kylie have also established careers in the modeling industry',
"\nIn the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons",
'\nThe family earns an alleged total of $10 million per season of the series', '[21]\n'] This is not perfect either. One problem consists of the Wikipedia footnotes such as "[9]", which are irrelevant to us. Let's remove them with a regular expression. In [6]: re.findall(r"$\d+$", background)  Out[6]: ['[3]', '[4]', '[5]', '[5]', '[6]', '[7]', '[7]', '[8]', '[6]', '[6]', '[9]', '[9]', '[10]', '[3]', '[11]', '[12]', '[13]', '[14]', '[15]', '[16]', '[17]', '[18]', '[19]', '[20]', '[21]'] In [7]: background_nofoot = re.sub(r"$\d+$", "", background)  We can now try to split sentences again. We also want to split on other punctuation marks, not just the period. This is possible with regular expressions. In [8]: splitter = re.compile(r""" [.!?] # split on punctuation """, re.VERBOSE)  While this looks better, it messes up som instances: O.J. Simpson's name, the "E!" network's name, and ellipses (...). To deal with this, we need to think about the decision process of whether a punctuation mark is an end of sentence or not. In class, we came up with three requirements: • It is succeeded by at least one whitespace character (not O.J) • The first character after the whitespace should be uppercase (not E! renewed) • The last character before it should not be uppercase (not J. Simpson) Are these rules complete? In [9]: #Note that we do not need to catch that last period, since we are using split splitter = re.compile(r""" (?<![A-Z]) # last character cannot be uppercase [.!?] # match punctuation \s+ # followed by whitespace (?=[A-Z]) # next character must be uppercase """, re.VERBOSE)  Running our sentence splitter on the text, it seems to do a good job. In [10]: for sentence in splitter.split(background_nofoot): print(sentence.strip()) print("--")  Robert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987) -- The couple divorced in 1991 -- In 1994, Robert entered the media spotlight when he defended O.J. Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial -- Kris married former Olympic champion Bruce Jenner (born 1949) in 1991 -- Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997) -- Robert died in 2003 eight weeks after being diagnosed with esophageal cancer -- In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan -- Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California -- Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey -- In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive -- In February 2007, a home sex video that Kim made with Ray J years earlier was leaked -- Vivid Entertainment bought the rights for$1 million and released the film as Kim Kardashian: Superstar on February 21
--
Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for $5 million -- In August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer -- Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [...] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain." The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending -- Keeping Up With the Kardashians premiered on October 14, 2007 -- On November 13, 2007, it was announced that E! renewed the show for its second season -- The following year, it was renewed for a third season -- Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television's most-talked-about shows [...] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics." The Hollywood Reporter reported that the family made an estimated$65 million throughout 2010
--
In 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony, but filed for divorce 72 days later
--
This caused widespread backlash from the public and media
--
Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures
--
A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split
--
As of December 2013, eight seasons of the series have aired
--
On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine
--
The deal was estimated at $40 million -- On January. 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale -- The show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show -- As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show -- Kris' second husband 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began -- Since the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections -- Kim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010 -- In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage -- In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June -- Khloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008 -- The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship -- In 2012, she served as a co-host during the second season of the American version of The X Factor -- Rob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons -- Kendall and Kylie have also established careers in the modeling industry -- In the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons -- The family earns an alleged total of$10 million per season of the series.
--


Now let's identify individual words within each sentence. Rather than splitting at whitespace, let's match all sequences of word characters.

In [11]:
word_splitter = re.compile(r"""
(\w+)
""", re.VERBOSE)

In [12]:
sent_words = [word_splitter.findall(sent)
for sent in splitter.split(background_nofoot)]

In [13]:
sent_words_lower = [[w.lower() for w in sent]
for sent in sent_words]


How many sentences do we have?

In [14]:
len(sent_words_lower)

Out[14]:
41

How many words are there in total? ("tokens")

In [15]:
allwords=[w for sent in sent_words_lower for w in sent]

In [16]:
sorted(allwords)

Out[16]:
['1',
'10',
'13',
'14',
'1944',
'1949',
'1955',
'1976',
'1978',
'1979',
'1980',
'1984',
'1987',
'1991',
'1991',
'1994',
'1995',
'1997',
'2003',
'2003',
'2004',
'2006',
'2007',
'2007',
'2007',
'2007',
'2007',
'2007',
'2007',
'2007',
'2008',
'2010',
'2010',
'2011',
'2011',
'2012',
'2012',
'2012',
'2012',
'2013',
'2015',
'2015',
'21',
'24',
'4',
'40',
'5',
'65',
'72',
'72',
'8',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'a',
'ability',
'actress',
'after',
'after',
'after',
'after',
'after',
'after',
'after',
'after',
'air',
'aired',
'airing',
'all',
'alleged',
'also',
'also',
'also',
'always',
'american',
'an',
'an',
'an',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'and',
'angeles',
'announced',
'announced',
'announced',
'announced',
'announcement',
'another',
'approximately',
'april',
'april',
'are',
'around',
'arrested',
'arthur',
'artist',
'as',
'as',
'as',
'as',
'as',
'as',
'at',
'at',
'attained',
'attorney',
'august',
'back',
'backlash',
'bailon',
'band',
'be',
'became',
'became',
'became',
'became',
'become',
'been',
'began',
'being',
'being',
'berger',
'birth',
'born',
'born',
'born',
'born',
'born',
'born',
'born',
'born',
'bought',
'boutique',
'boutique',
'boyfriend',
'brand',
'brandon',
'brandon',
'brandy',
'brody',
'brother',
'brown',
'bruce',
'bruce',
'bruce',
'bruce',
'bunch',
'bunim',
'bush',
'but',
'but',
'by',
'calabasas',
'california',
'came',
'cancer',
'capture',
'career',
'careers',
'careers',
'cast',
'cast',
'cast',
'catfights',
'cattle',
'caused',
'ceremony',
'champion',
'champion',
'chaotic',
'character',
'cheban',
'children',
'children',
'children',
'children',
'circulated',
'clothing',
'co',
'co',
'collections',
'couple',
'criticism',
'd',
'd',
'daughter',
'daughter',
'daughters',
'daughters',
'day',
'days',
'deal',
'deal',
'decathlon',
'deceased',
'december',
'defended',
'despite',
'developed',
'development',
'diagnosed',
'died',
'disick',
'divorce',
'divorce',
'divorced',
'don',
'drive',
'driving',
'dropped',
'during',
'during',
'during',
'dynamics',
'dynamics',
'e',
'e',
'e',
'e',
'eagles',
'earlier',
'early',
'earns',
'eight',
'eight',
'eight',
'eighth',
'embraced',
'ending',
'endless',
'entered',
'entertain',
'entertaining',
'entertainment',
'entertainment',
'entitled',
'esophageal',
'established',
'established',
'estimated',
'estimated',
'eventually',
'exceptional',
'executive',
'executive',
'factor',
'familiar',
'family',
'family',
'family',
'family',
'family',
'family',
'family',
'family',
'fashion',
'fashion',
'fashion',
'fashion',
'featured',
'feb',
'february',
'february',
'felder',
'felder',
'filed',
'filing',
'film',
'filthy',
'finale',
'first',
'first',
'focused',
'followed',
'following',
'following',
'following',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'for',
'former',
'fortunate',
'forward',
'forward',
'four',
'fourth',
'fragrances',
'frequently',
'friend',
'friends',
'from',
'from',
'from',
'from',
'from',
'from',
'full',
'further',
'gained',
'gave',
'george',
'goldman',
'h',
'h',
'hamptons',
'haqq',
'has',
'has',
'has',
'have',
'have',
'have',
'have',
'have',
'he',
'heart',
'her',
'her',
'her',
'her',
'her',
'herself',
'high',
'high',
'highly',
'hilarious',
'hilton',
'hollywood',
'hollywood',
'home',
'host',
'houghton',
'hours',
'humphries',
'humphries',
'humphries',
'husband',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'in',
'including',
'industry',
'industry',
'influence',
'integrated',
'into',
'into',
'into',
'involved',
'involved',
'involved',
'is',
'is',
'is',
'is',
'it',
'it',
'it',
'its',
'j',
'j',
'j',
'jailed',
'january',
'jenner',
'jenner',
'jenner',
'jenner',
'jenners',
'jersey',
'joined',
'jonathan',
'june',
'kanye',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashian',
'kardashians',
'kardashians',
'kardashians',
'keep',
'keeping',
'kendall',
'kendall',
'kendall',
'khloé',
'khloé',
'khloé',
'khloé',
'khloé',
'khloé',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kim',
'kourtney',
'kourtney',
'kourtney',
'kourtney',
'kourtney',
'kourtney',
'kourtney',
'kris',
'kris',
'kris',
'kris',
'kris',
'kris',
'kris',
'kris',
'kristen',
'kylie',
'kylie',
'kylie',
'lachey',
'lakers',
'lamar',
'later',
'later',
'later',
'launched',
'launching',
'leah',
'leaked',
'lengthy',
'life',
'lindsay',
'line',
'lisa',
'lohan',
'los',
'loves',
'main',
'make',
'malika',
'march',
'marriage',
'marriage',
'marriage',
'married',
'married',
'married',
'married',
'married',
'mary',
'media',
'media',
'member',
'member',
'members',
'merely',
'million',
'million',
'million',
'million',
'million',
'modeling',
'month',
'most',
'murders',
'murray',
'nba',
'nets',
'network',
'new',
'new',
'news',
'nick',
'nicole',
'nicole',
'nine',
'north',
'norwood',
'norwood',
'notoriety',
'notoriety',
'november',
'o',
'october',
'odom',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'of',
'olympic',
'olympics',
'on',
'on',
'on',
'on',
'on',
'on',
'on',
'on',
'on',
'on',
'one',
'one',
'one',
'one',
'one',
'opening',
'original',
'originally',
'orleans',
'outlets',
'own',
'ownership',
'owning',
'paris',
'per',
'personal',
'personal',
'personalities',
'petition',
'player',
'popular',
'power',
'preeclampsia',
'pregnant',
'prematurely',
'premiere',
'premiere',
'premiered',
'president',
'producer',
'profile',
'programming',
'programming',
'progressed',
'promote',
'public',
'publicity',
'publicized',
'rapper',
'rapper',
'ray',
'ray',
'reality',
'reality',
'recording',
'recurring',
'recurring',
'reggie',
'related',
'relationship',
'relationship',
'relationship',
'relationships',
'released',
'remove',
'renewed',
'renewed',
'reported',
'reporter',
'revolves',
'rich',
'richie',
'right',
'rights',
'rob',
'rob',
'rob',
'robert',
'robert',
'robert',
'robert',
'ronald',
'running',
'ryan',
's',
's',
's',
's',
's',
's',
's',
's',
's',
's',
's',
'said',
'said',
'saints',
'sarcasm',
'scott',
'seacrest',
'seacrest',
'seacrest',
'season',
'season',
'season',
'season',
'season',
'season',
'season',
'season',
'seasons',
'seasons',
'seasons',
'seasons',
'second',
'second',
'second',
'second',
'second',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'series',
'served',
'serving',
'settled',
'seven',
'several',
'several',
'sex',
'sex',
'she',
'she',
'she',
'she',
'she',
'she',
'she',
'shopper',
'show',
'show',
'show',
'show',
'show',
'show',
'show',
'shows',
'signed',
'simple',
'simpson',
'simpson',
'since',
'since',
'since',
'singer',
'singer',
'sisters',
'sock',
'some',
'son',
'sons',
'split',
'spotlight',
'star',
'starred',
'still',
'stunt',
'stylist',
'stylist',
'stylist',
'subject',
'subsequent',
'sued',
'suffering',
'suit',
'summer',
'superstar',
'supporting',
'supports',
'sure',
'surmised',
'take',
'talked',
'tape',
'tape',
'television',
'television',
'television',
'tenth',
'that',
'that',
'that',
'that',
'that',
'that',
'that',
'that',
'that',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'the',
'their',
'their',
'their',
'their',
'them',
'third',
'third',
'third',
'this',
'this',
'three',
'three',
'through',
'throughout',
'throughout',
'time',
'titled',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'to',
'together',
'together',
'total',
'trial',
'truly',
'two',
'under',
'up',
'ventured',
'ventures',
'version',
'vice',
'video',
'viewers',
'vivid',
'vivid',
'vivid',
'was',
'was',
'was',
'was',
'was',
'was',
'was',
'was',
'was',
'was',
'was',
'we',
'wedding',
'week',
'weeks',
'were',
'west',
'when',
'which',
'which',
'while',
'while',
'widely',
'wife',
'will',
'with',
'with',
'with',
'with',
'with',
'with',
'with',
'with',
'with',
'work',
'would',
'x',
'year',
'year',
'year',
'years',
'yet']
In [17]:
len(allwords)

Out[17]:
972

How many distinct types of words ("types") are there?

In [18]:
len(set(allwords))

Out[18]:
459
In [19]:
sorted(set(allwords))

Out[19]:
['1',
'10',
'13',
'14',
'1944',
'1949',
'1955',
'1976',
'1978',
'1979',
'1980',
'1984',
'1987',
'1991',
'1994',
'1995',
'1997',
'2003',
'2004',
'2006',
'2007',
'2008',
'2010',
'2011',
'2012',
'2013',
'2015',
'21',
'24',
'4',
'40',
'5',
'65',
'72',
'8',
'a',
'ability',
'actress',
'after',
'air',
'aired',
'airing',
'all',
'alleged',
'also',
'always',
'american',
'an',
'and',
'angeles',
'announced',
'announcement',
'another',
'approximately',
'april',
'are',
'around',
'arrested',
'arthur',
'artist',
'as',
'at',
'attained',
'attorney',
'august',
'back',
'backlash',
'bailon',
'band',
'be',
'became',
'become',
'been',
'began',
'being',
'berger',
'birth',
'born',
'bought',
'boutique',
'boyfriend',
'brand',
'brandon',
'brandy',
'brody',
'brother',
'brown',
'bruce',
'bunch',
'bunim',
'bush',
'but',
'by',
'calabasas',
'california',
'came',
'cancer',
'capture',
'career',
'careers',
'cast',
'catfights',
'cattle',
'caused',
'ceremony',
'champion',
'chaotic',
'character',
'cheban',
'children',
'circulated',
'clothing',
'co',
'collections',
'couple',
'criticism',
'd',
'daughter',
'daughters',
'day',
'days',
'deal',
'decathlon',
'deceased',
'december',
'defended',
'despite',
'developed',
'development',
'diagnosed',
'died',
'disick',
'divorce',
'divorced',
'don',
'drive',
'driving',
'dropped',
'during',
'dynamics',
'e',
'eagles',
'earlier',
'early',
'earns',
'eight',
'eighth',
'embraced',
'ending',
'endless',
'entered',
'entertain',
'entertaining',
'entertainment',
'entitled',
'esophageal',
'established',
'estimated',
'eventually',
'exceptional',
'executive',
'factor',
'familiar',
'family',
'fashion',
'featured',
'feb',
'february',
'felder',
'filed',
'filing',
'film',
'filthy',
'finale',
'first',
'focused',
'followed',
'following',
'for',
'former',
'fortunate',
'forward',
'four',
'fourth',
'fragrances',
'frequently',
'friend',
'friends',
'from',
'full',
'further',
'gained',
'gave',
'george',
'goldman',
'h',
'hamptons',
'haqq',
'has',
'have',
'he',
'heart',
'her',
'herself',
'high',
'highly',
'hilarious',
'hilton',
'hollywood',
'home',
'host',
'houghton',
'hours',
'humphries',
'husband',
'in',
'including',
'industry',
'influence',
'integrated',
'into',
'involved',
'is',
'it',
'its',
'j',
'jailed',
'january',
'jenner',
'jenners',
'jersey',
'joined',
'jonathan',
'june',
'kanye',
'kardashian',
'kardashians',
'keep',
'keeping',
'kendall',
'khloé',
'kim',
'kourtney',
'kris',
'kristen',
'kylie',
'lachey',
'lakers',
'lamar',
'later',
'launched',
'launching',
'leah',
'leaked',
'lengthy',
'life',
'lindsay',
'line',
'lisa',
'lohan',
'los',
'loves',
'main',
'make',
'malika',
'march',
'marriage',
'married',
'mary',
'media',
'member',
'members',
'merely',
'million',
'modeling',
'month',
'most',
'murders',
'murray',
'nba',
'nets',
'network',
'new',
'news',
'nick',
'nicole',
'nine',
'north',
'norwood',
'notoriety',
'november',
'o',
'october',
'odom',
'of',
'olympic',
'olympics',
'on',
'one',
'opening',
'original',
'originally',
'orleans',
'outlets',
'own',
'ownership',
'owning',
'paris',
'per',
'personal',
'personalities',
'petition',
'player',
'popular',
'power',
'preeclampsia',
'pregnant',
'prematurely',
'premiere',
'premiered',
'president',
'producer',
'profile',
'programming',
'progressed',
'promote',
'public',
'publicity',
'publicized',
'rapper',
'ray',
'reality',
'recording',
'recurring',
'reggie',
'related',
'relationship',
'relationships',
'released',
'remove',
'renewed',
'reported',
'reporter',
'revolves',
'rich',
'richie',
'right',
'rights',
'rob',
'robert',
'ronald',
'running',
'ryan',
's',
'said',
'saints',
'sarcasm',
'scott',
'seacrest',
'season',
'seasons',
'second',
'series',
'served',
'serving',
'settled',
'seven',
'several',
'sex',
'she',
'shopper',
'show',
'shows',
'signed',
'simple',
'simpson',
'since',
'singer',
'sisters',
'sock',
'some',
'son',
'sons',
'split',
'spotlight',
'star',
'starred',
'still',
'stunt',
'stylist',
'subject',
'subsequent',
'sued',
'suffering',
'suit',
'summer',
'superstar',
'supporting',
'supports',
'sure',
'surmised',
'take',
'talked',
'tape',
'television',
'tenth',
'that',
'the',
'their',
'them',
'third',
'this',
'three',
'through',
'throughout',
'time',
'titled',
'to',
'together',
'total',
'trial',
'truly',
'two',
'under',
'up',
'ventured',
'ventures',
'version',
'vice',
'video',
'viewers',
'vivid',
'was',
'we',
'wedding',
'week',
'weeks',
'were',
'west',
'when',
'which',
'while',
'widely',
'wife',
'will',
'with',
'work',
'would',
'x',
'year',
'years',
'yet']

## Answering queries on the data¶

We will try to retrieve the closest matching sentence to a given query. To do this, we must define what "closest" means. In other words, we need a similarity measure.

A simple one is the number of types in common between the query and the sentence.

In [20]:
def types_in_common(query_words, sentence):
A = set(query_words)
B = set(sentence)
return len(A.intersection(B))


A slightly more complex one is the the Jaccard similarity measure, which additionaly takes into account the total number of types that the query and the sentence has.

In [21]:
def jaccard(query_words, sentence):
A = set(query_words)
B = set(sentence)
return float(len(A.intersection(B)))/len(A.union(B))


Next we'll define a basic "search engine" which will go through all the sentences and calculate each one's similarity with the query. It returns a list of sentences sorted by their similarity score (if that score is greater than zero).

To calculate the similarity, this function takes as an argument a similarity_measure function.

In [22]:
from operator import itemgetter

def run_search(query, similarity_measure):
query_words = word_splitter.findall(query)
query_words = [w.lower() for w in query_words]

sent_scores = [(sent, similarity_measure(query_words, sent))
for sent in sent_words_lower]

sent_scores = sorted(sent_scores, key=itemgetter(1), reverse=True)
sent_scores = [(sent, score)
for sent, score in sent_scores
if score > 0]

joined_sents = [(" ".join(sent), score) for sent, score in sent_scores]
return joined_sents


Now we'll run two versions of the search engines (one using the types_in_commmon measure and one using the jaccard measure) for two different queries.

In [23]:
run_search("kris olympic",types_in_common)

Out[23]:
[('kris married former olympic champion bruce jenner born 1949 in 1991', 2),
('robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khloé born 1984 and son rob born 1987',
1),
('bruce and kris had two daughters together kendall born 1995 and kylie born 1997',
1),
('in 2011 kim married nba player kris humphries in a highly publicized wedding ceremony but filed for divorce 72 days later',
1),
('the show revolves around the children of kris jenner and originally focused on her children from her first marriage to deceased attorney robert kardashian kourtney kim khloé and rob kourtney s boyfriend scott disick is a main character on the show',
1),
('as the series progressed kris children kendall and kylie also became recurring cast members of the show',
1),
('kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began',
1),
('in 2011 she received widespread criticism after filing for divorce from new jersey nets power forward kris humphries after a 72 day marriage',
1)]
In [24]:
run_search("kourtney",jaccard)

Out[24]:
[('in 2006 kourtney starred in her first reality television series filthy rich cattle drive',
0.07692307692307693),
('khloé kim and kourtney further ventured into fashion opening a high fashion boutique d a s h in calabasas california',
0.05555555555555555),
('on january 4 2015 a tenth season was announced to premiere on feb 8 2015 after the kourtney and khloé take the hamptons season finale',
0.047619047619047616),
('robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khloé born 1984 and son rob born 1987',
0.03571428571428571),
('the show revolves around the children of kris jenner and originally focused on her children from her first marriage to deceased attorney robert kardashian kourtney kim khloé and rob kourtney s boyfriend scott disick is a main character on the show',
0.030303030303030304),
('in the eighth season bruce s sons brandon and brody jenner and brandon s wife leah felder daughter of eagles band member don felder were integrated into the supporting cast while kourtney khloé and kim s friends malika haqq and jonathan cheban joined the series in the second and third seasons',
0.02564102564102564)]

As we will be playing around with the internals of our information retrieval system, let's write a convenience function for displaying the results and highlighting the ones that we know are actually relevant for our information need.

In [25]:
def print_results(orderedlist, relevant_docs=[], maxresults=5):
"""Print search results while highlighting the ones we truly care about"""
count = 1
for item, score in orderedlist:
if item in relevant_docs:
print("{:d} !!! {:.2f} {}".format(count, score, item))
elif count <= maxresults:
print("{:d}     {:.2f} {}".format(count, score, item))
print()
count += 1

In [26]:
relevant_docs_champion = ["kris married former olympic champion bruce jenner born 1949 in 1991",
"kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began"]


The Jaccard measure ranks a completely irrelevant document on #1, gets one right on #2, but the actual best document is all the way down on position 16.

In [27]:
print_results(run_search("the olympic champion in kardashians", jaccard),
relevant_docs=relevant_docs_champion)

1     0.25 the couple divorced in 1991

2 !!! 0.23 kris married former olympic champion bruce jenner born 1949 in 1991

3     0.15 keeping up with the kardashians premiered on october 14 2007

4     0.14 kendall and kylie have also established careers in the modeling industry

5     0.10 in 2012 she served as a co host during the second season of the american version of the x factor

16 !!! 0.07 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began



# Lecture 5: Vector Space Model¶

In order to have a more flexible system that we can modify, let's implement a vector space model. We will use term frequency (TF) weights in order to address one of the issues with Jaccard, namely, that if a word occurs more than once, it should naturally matter more.

In [28]:
terms = sorted(set(allwords))

# TF (term frequency) vectorization
# We represent vectors in a "sparse" dictionary format.
# All keys not present in the dictionary are assumed to be zeros.

def doc_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v)
return d

def query_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v)
return d

In [29]:
import math

def dot(d, q):
sum=0
for v in d:  # iterates through keys
sum += d[v] * q[v]
return sum


One simple similarity measure operating on vectors is the dot product. The higher the dot product between two vectors, the more similar they are.

In [30]:
def dot_measure(query_words, sentence):
A = query_to_vec(query_words)
B = doc_to_vec(sentence)
return float(dot(A, B))

In [31]:
print_results(run_search("the olympic champion in kardashians",dot_measure),relevant_docs=relevant_docs_champion)

1     7.00 lisa berger executive vice president of original programming and series development for the network said viewers have embraced the kardashian family and the series has become one of television s most talked about shows we are fortunate to work with seacrest and bunim murray which have an exceptional ability to capture the kardashians hilarious chaotic and always entertaining personalities and family dynamics the hollywood reporter reported that the family made an estimated 65 million throughout 2010

2     6.00 seacrest said at the heart of the series despite the catfights and endless sarcasm is a family that truly loves and supports one another the familiar dynamics of this family make them one hollywood bunch that is sure to entertain the series announcement came one week after paris hilton and her friend nicole richie announced that their popular e series entitled the simple life was ending

3     6.00 in the eighth season bruce s sons brandon and brody jenner and brandon s wife leah felder daughter of eagles band member don felder were integrated into the supporting cast while kourtney khloé and kim s friends malika haqq and jonathan cheban joined the series in the second and third seasons

4     5.00 since the series premiere the kardashian sisters have established careers in the fashion industry co owning the fashion boutique d a s h and launching several fragrances and clothing collections

5     5.00 rob launched the sock line arthur george in 2012 and was involved in a relationship with singer adrienne bailon in the second and third seasons

10 !!! 3.00 kris married former olympic champion bruce jenner born 1949 in 1991

13 !!! 3.00 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began



We can see that this does even worse, as it rewards longer documents unfairly. We can address that by length normalizing the documents and the query, that is, dividing the vectors by their norm.

The resulting measure is the cosine similarity measure:

In [32]:
def norm(d):
sum_sq = 0
for v in d:
sum_sq += d[v] * d[v]
return math.sqrt(sum_sq)

def cos_measure(query_words, sentence):
A = query_to_vec(query_words)
B = doc_to_vec(sentence)
return float(dot(A, B)) / (norm(A) * norm(B))

In [33]:
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)

1 !!! 0.40 kris married former olympic champion bruce jenner born 1949 in 1991

2     0.40 the couple divorced in 1991

3     0.38 rob launched the sock line arthur george in 2012 and was involved in a relationship with singer adrienne bailon in the second and third seasons

4     0.34 in 2012 she served as a co host during the second season of the american version of the x factor

5     0.33 since the series premiere the kardashian sisters have established careers in the fashion industry co owning the fashion boutique d a s h and launching several fragrances and clothing collections

15 !!! 0.24 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began



This already does better, ranking one of our important documents as first. The other is still low in the ranking.

Another issue we have at the moment is that all words matter equally, but intuition dictates that some words in the query (e.g. olympic) are more important than others (e.g. in). A way to address this is to weight the words according to their specificity, and a concrete implementation is to use inverse document frequency (IDF)

In [34]:
IDF = {}
DF = {}

for t in terms:
DF[t] = len([1 for sent in sent_words_lower if t in sent])
IDF[t] = 1 / float(DF[t] + 1)

In [35]:
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[:10]:
print(IDF_t)

print("...")

for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[-10:]:
print(IDF_t)

('the', 0.03225806451612903)
('and', 0.041666666666666664)
('in', 0.043478260869565216)
('a', 0.047619047619047616)
('kim', 0.07692307692307693)
('of', 0.08333333333333333)
('was', 0.08333333333333333)
('for', 0.1)
('series', 0.1)
('to', 0.1)
...
('west', 0.5)
('when', 0.5)
('widely', 0.5)
('wife', 0.5)
('will', 0.5)
('work', 0.5)
('would', 0.5)
('x', 0.5)
('years', 0.5)
('yet', 0.5)

In [36]:
##TF-IDF weights

def doc_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v) * IDF[v]
return d

def query_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v) * IDF[v]
return d

In [37]:
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)

1 !!! 0.52 kris married former olympic champion bruce jenner born 1949 in 1991

2 !!! 0.10 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began

3     0.08 keeping up with the kardashians premiered on october 14 2007

4     0.06 in august 2007 it was announced that the kardashians and jenners would star in a yet to be titled reality show on e with ryan seacrest serving as executive producer

5     0.03 lisa berger executive vice president of original programming and series development for the network said viewers have embraced the kardashian family and the series has become one of television s most talked about shows we are fortunate to work with seacrest and bunim murray which have an exceptional ability to capture the kardashians hilarious chaotic and always entertaining personalities and family dynamics the hollywood reporter reported that the family made an estimated 65 million throughout 2010



With TF-IDF and cosine similarity, we now get both relevant documents ranked first and second in the returned list. Neat!

# Lecture- Efficient retrieval¶

So far we made good progress on improving our IR system along the "ease of access" dimension (basically, getting the most relevant results on top). Now it's time to tackle about the scalability dimension: making our IR system answer queries efficiently.

If we time the run of our measure, we see that it takes 40 milliseconds to process a query on our collection of only 41 (short documents). If the run-time would increases linearly with the number of docs in the collection this would mean that in a web-scale collection of 1 billion documents, it would take us 11 1/2 days to run one search for a query!

On the tv transcript collection from Assingment 1 (containing about 40,000 sentences) running a single query takes 11 minutes.

In [38]:
%timeit a=run_search("the olympic champion in kardashians",cos_measure)

22.4 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Why so slow? let's look at our main data structure: the TERM-DOCUMENT matrix

In [ ]:
for d in sent_words_lower:
print (doc_to_vec(d).values())


### Building an inverted index¶

The first step is to reorganize the way we store information about which terms appear in which documents. Instead of storing a huge term-document martrix, we'll use an inverted index: a dictionary where each term links to a list that contains all unique document_ids of the documents in which it appears (a list of "postings").

Normaly this is implemented with linked lists, but for demonstrating the basic concept we will simply use regular lists.

In [40]:
from collections import defaultdict

docs=sent_words_lower

inverted_index=defaultdict(list)
for doc_idx, doc in enumerate(docs): # by iterating this way we make sure that doc_ids will always appear sorted in the postings
for term in doc:
if doc_idx not in inverted_index[term]: # this way we have unique doc_ids
inverted_index[term].append(doc_idx)

In [45]:
for term in inverted_index.keys():
print (term,inverted_index[term])

robert [0, 2, 5, 27]
kardashian [0, 11, 18, 21, 21, 22, 24, 27, 30]
1944 [0]
2003 [0, 5]
and [0, 0, 0, 0, 2, 4, 4, 6, 6, 7, 8, 11, 12, 13, 14, 14, 14, 18, 18, 18, 18, 18, 20, 21, 24, 26, 27, 27, 28, 29, 30, 30, 31, 37, 37, 38, 39, 39, 39, 39, 39]
kristen [0]
mary [0]
kris [0, 3, 4, 19, 27, 28, 29, 32]
houghton [0]
born [0, 0, 0, 0, 0, 3, 4, 4]
1955 [0]
married [0, 3, 19, 33, 35]
in [0, 1, 2, 3, 5, 6, 7, 8, 9, 9, 10, 12, 13, 13, 19, 19, 30, 31, 31, 32, 33, 34, 34, 34, 36, 37, 37, 37, 38, 39, 39]
1978 [0]
four [0]
children [0, 27, 27, 28]
together [0, 4]
daughters [0, 4]
kourtney [0, 7, 9, 26, 27, 27, 39]
1979 [0]
kim [0, 6, 7, 8, 10, 11, 12, 19, 27, 31, 33, 39]
1980 [0]
khloé [0, 7, 26, 27, 34, 39]
1984 [0]
son [0]
rob [0, 27, 37]
1987 [0]
the [1, 2, 2, 11, 11, 12, 12, 13, 14, 14, 14, 14, 14, 14, 15, 16, 17, 18, 18, 18, 18, 18, 18, 20, 21, 22, 23, 24, 24, 25, 26, 26, 27, 27, 27, 28, 28, 29, 29, 30, 30, 30, 30, 31, 33, 34, 35, 35, 36, 36, 36, 37, 37, 38, 39, 39, 39, 39, 40, 40]
couple [1]
divorced [1]
1991 [1, 3]
1994 [2]
entered [2]
media [2, 20]
spotlight [2]
when [2]
he [2]
defended [2]
o [2]
j [2, 8, 10]
simpson [2, 2]
for [2, 11, 12, 12, 16, 17, 18, 19, 32, 34, 34, 34]
murders [2]
of [2, 12, 14, 14, 18, 18, 23, 23, 27, 28, 31, 36, 36, 39, 40, 40]
nicole [2, 14]
brown [2]
ronald [2]
goldman [2]
during [2, 35, 36]
a [2, 6, 6, 6, 7, 7, 10, 13, 14, 17, 19, 21, 22, 24, 26, 27, 29, 30, 31, 31, 32, 35, 36, 37]
lengthy [2]
trial [2]
former [3]
olympic [3]
champion [3, 29]
bruce [3, 4, 29, 39]
jenner [3, 27, 29, 39]
1949 [3]
two [4]
kendall [4, 28, 38]
1995 [4]
kylie [4, 28, 38]
1997 [4]
died [5]
eight [5, 23, 24]
weeks [5]
after [5, 14, 26, 32, 32, 33, 34, 35]
being [5, 34]
diagnosed [5]
with [5, 10, 12, 13, 15, 18, 24, 31, 37]
esophageal [5]
cancer [5]
2004 [6]
became [6, 28, 31, 33]
personal [6, 6]
stylist [6, 6, 6]
to [6, 6, 13, 14, 18, 18, 21, 21, 22, 26, 27, 33]
recording [6]
artist [6]
brandy [6]
norwood [6, 8]
she [6, 8, 32, 33, 34, 35, 36]
eventually [6]
developed [6]
into [6, 7, 39]
full [6]
time [6]
was [6, 10, 13, 14, 16, 17, 21, 25, 26, 34, 37]
shopper [6]
actress [6]
lindsay [6]
lohan [6]
further [7]
ventured [7]
fashion [7, 7, 30, 30]
opening [7]
high [7, 8]
boutique [7, 30]
d [7, 30]
s [7, 8, 8, 18, 21, 21, 27, 30, 39, 39, 39]
h [7, 30]
calabasas [7]
california [7]
throughout [8, 18]
early [8]
career [8]
involved [8, 31, 37]
herself [8]
some [8]
profile [8]
relationships [8]
including [8]
brother [8]
rapper [8, 33]
ray [8, 10]
later [8, 19, 31]
singer [8, 37]
nick [8]
lachey [8]
2006 [9]
starred [9]
her [9, 14, 27, 27, 34]
first [9, 27]
reality [9, 13]
television [9, 18, 21]
series [9, 14, 14, 14, 18, 18, 23, 24, 28, 30, 39, 40]
filthy [9]
rich [9]
cattle [9]
drive [9]
february [10, 11]
2007 [10, 12, 13, 15, 16, 31, 31, 34]
home [10]
sex [10, 31]
video [10]
that [10, 13, 14, 14, 14, 16, 18, 21, 24]
years [10]
earlier [10]
leaked [10]
vivid [11, 12, 12]
entertainment [11, 12]
bought [11]
rights [11]
1 [11]
million [11, 12, 18, 25, 40]
released [11]
film [11]
as [11, 13, 23, 28, 31, 36]
superstar [11]
on [11, 13, 15, 16, 24, 26, 26, 27, 27, 29]
21 [11]
sued [12]
ownership [12]
tape [12, 31]
but [12, 19]
dropped [12]
suit [12]
april [12, 24]
settled [12]
5 [12]
august [13]
it [13, 16, 17]
announced [13, 14, 16, 26]
kardashians [13, 15, 18]
jenners [13]
would [13]
star [13]
yet [13]
be [13]
titled [13]
show [13, 16, 27, 27, 28, 29, 29]
e [13, 14, 16, 24]
ryan [13]
seacrest [13, 14, 18]
serving [13]
executive [13, 18]
producer [13]
said [14, 18]
at [14, 25]
heart [14]
despite [14]
catfights [14]
endless [14]
sarcasm [14]
is [14, 14, 27, 29]
family [14, 14, 18, 18, 18, 21, 24, 40]
truly [14]
loves [14]
supports [14]
one [14, 14, 14, 18, 35]
another [14]
familiar [14]
dynamics [14, 18]
this [14, 20]
make [14]
them [14]
hollywood [14, 18]
bunch [14]
sure [14]
entertain [14]
announcement [14]
came [14]
week [14]
paris [14]
hilton [14]
friend [14]
richie [14]
their [14, 21, 22, 33]
popular [14]
entitled [14]
simple [14]
life [14]
ending [14]
keeping [15]
up [15]
premiered [15]
october [15]
14 [15]
november [16]
13 [16]
renewed [16, 17]
its [16]
second [16, 29, 36, 37, 39]
season [16, 17, 26, 26, 35, 36, 39, 40]
following [17, 33, 35]
year [17, 24, 35]
third [17, 37, 39]
lisa [18]
berger [18]
vice [18]
president [18]
original [18]
programming [18, 22]
development [18]
network [18]
viewers [18]
have [18, 18, 23, 30, 38]
embraced [18]
has [18, 22, 29]
become [18]
most [18]
talked [18]
shows [18]
we [18]
are [18]
fortunate [18]
work [18]
bunim [18]
murray [18]
which [18, 34]
an [18, 18, 40]
exceptional [18]
ability [18]
capture [18]
hilarious [18]
chaotic [18]
always [18]
entertaining [18]
personalities [18]
reporter [18]
reported [18]
estimated [18, 25]
65 [18]
2010 [18, 31]
2011 [19, 32]
nba [19]
player [19]
humphries [19, 21, 32]
highly [19]
publicized [19]
wedding [19]
ceremony [19]
filed [19]
divorce [19, 32]
72 [19, 32]
days [19]
caused [20]
backlash [20]
from [20, 22, 27, 31, 32, 33]
public [20]
several [21, 30]
news [21]
outlets [21]
surmised [21]
marriage [21, 27, 32]
merely [21]
publicity [21]
stunt [21]
promote [21]
brand [21]
subsequent [21]
ventures [21]
widely [22]
circulated [22]
petition [22]
remove [22]
all [22]
related [22]
air [22]
followed [22]
since [22, 29, 30]
split [22]
december [23]
2013 [23]
seasons [23, 24, 37, 39]
aired [23]
24 [24]
2012 [24, 33, 36, 37]
signed [24]
three [24, 34]
deal [24, 25]
will [24]
keep [24]
airing [24]
through [24]
seven [24]
nine [24]
40 [25]
january [26]
4 [26]
2015 [26, 26]
tenth [26]
premiere [26, 30]
feb [26]
8 [26]
take [26]
hamptons [26]
finale [26]
revolves [27]
around [27]
originally [27]
focused [27]
deceased [27]
attorney [27]
boyfriend [27]
scott [27]
disick [27]
main [27]
character [27]
progressed [28]
also [28, 29, 38]
recurring [28, 29]
cast [28, 29, 39]
members [28]
husband [29]
1976 [29]
summer [29]
olympics [29]
decathlon [29]
frequently [29]
featured [29]
been [29]
member [29, 39]
began [29]
sisters [30]
established [30, 38]
careers [30, 38]
industry [30, 38]
co [30, 36]
owning [30]
launching [30]
fragrances [30]
clothing [30]
collections [30]
gained [31]
notoriety [31, 34]
subject [31]
relationship [31, 35, 37]
new [31, 32]
orleans [31]
saints [31]
running [31]
back [31]
reggie [31]
bush [31]
march [31]
criticism [32]
filing [32]
jersey [32]
nets [32]
power [32]
forward [32, 35]
day [32]
while [33, 39]
still [33]
pregnant [33]
by [33]
kanye [33]
west [33]
suffering [33]
preeclampsia [33]
gave [33]
birth [33]
prematurely [33]
daughter [33, 39]
north [33]
june [33]
attained [34]
own [34]
right [34]
arrested [34]
driving [34]
under [34]
influence [34]
jailed [34]
approximately [34]
hours [34]
2008 [34]
fourth [35]
los [35]
angeles [35]
lakers [35]
lamar [35]
odom [35]
month [35]
served [36]
host [36]
american [36]
version [36]
x [36]
factor [36]
launched [37]
sock [37]
line [37]
arthur [37]
george [37]
bailon [37]
modeling [38]
eighth [39]
sons [39]
brandon [39, 39]
brody [39]
wife [39]
leah [39]
felder [39, 39]
eagles [39]
band [39]
don [39]
were [39]
integrated [39]
supporting [39]
friends [39]
malika [39]
haqq [39]
jonathan [39]
cheban [39]
joined [39]
earns [40]
alleged [40]
total [40]
10 [40]
per [40]


Note that the document ids are sorted in all postings

Now we'll see how we can use the inverted index to make an efficient conjunctive boolean search, i.e., search for all documents that contain two given terms. This can be easily extended to conjunctive searches that span more than two terms, and to more general boolean searches that include disjunctive search.

The return merge posting will contain all the documents that contain both term1 and term2

In [46]:
def merge_postings(term1,term2):
postings1=inverted_index[term1]
postings2=inverted_index[term2]
merged_posting=[]
i,j=0,0
while i<len(postings1) and j<len(postings2):
if postings1[i]==postings2[j]:
merged_posting.append(postings1[i])
i+=1
j+=1
elif postings1[i]<postings2[j]:
i+=1
else:
j+=1
return merged_posting


All documents that contain both kim and kris:

In [47]:
merge_postings("kim","kris")

Out[47]:
[0, 19, 27]
In [48]:
for doc_idx in merge_postings("kim","kris"):
print (" ".join(docs[doc_idx]))
print()

robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khloé born 1984 and son rob born 1987

in 2011 kim married nba player kris humphries in a highly publicized wedding ceremony but filed for divorce 72 days later

the show revolves around the children of kris jenner and originally focused on her children from her first marriage to deceased attorney robert kardashian kourtney kim khloé and rob kourtney s boyfriend scott disick is a main character on the show



# Efficient ranked retrieval (using cosine similarity)¶

But our IR system is a ranked retrieval system using cosine similarity. Next we'll show how we can use inverted indeces to re-implement it efficiently.

We first need to redefine the inverted indexes to also keep track of term frequencies:

In [49]:
## keep TF together with docs in the postings
docs=sent_words_lower
inverted_index=defaultdict(list)
for doc_idx, doc in enumerate(docs):
for term in doc:
# comment out this one, because now I want :
#        if doc_idx not in inverted_index[term]:
inverted_index[term].append(doc_idx)

inverted_index_tf=defaultdict(list)

for term in inverted_index:
postings=inverted_index[term]
if len(postings)>0:
lastdoc=-1
i=0
while i<len(postings):
tf=0
lastdoc=postings[i]
while i<len(postings) and lastdoc==postings[i]:
tf+=1
lastdoc=postings[i]
i+=1
inverted_index_tf[term].append((lastdoc,tf))


In [52]:
for term in inverted_index_tf.keys():
print (term,inverted_index_tf[term])

robert [(0, 1), (2, 1), (5, 1), (27, 1)]
kardashian [(0, 1), (11, 1), (18, 1), (21, 2), (22, 1), (24, 1), (27, 1), (30, 1)]
1944 [(0, 1)]
2003 [(0, 1), (5, 1)]
and [(0, 4), (2, 1), (4, 2), (6, 2), (7, 1), (8, 1), (11, 1), (12, 1), (13, 1), (14, 3), (18, 5), (20, 1), (21, 1), (24, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 2), (31, 1), (37, 2), (38, 1), (39, 5)]
kristen [(0, 1)]
mary [(0, 1)]
kris [(0, 1), (3, 1), (4, 1), (19, 1), (27, 1), (28, 1), (29, 1), (32, 1)]
houghton [(0, 1)]
born [(0, 5), (3, 1), (4, 2)]
1955 [(0, 1)]
married [(0, 1), (3, 1), (19, 1), (33, 1), (35, 1)]
in [(0, 1), (1, 1), (2, 1), (3, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (12, 1), (13, 2), (19, 2), (30, 1), (31, 2), (32, 1), (33, 1), (34, 3), (36, 1), (37, 3), (38, 1), (39, 2)]
1978 [(0, 1)]
four [(0, 1)]
children [(0, 1), (27, 2), (28, 1)]
together [(0, 1), (4, 1)]
daughters [(0, 1), (4, 1)]
kourtney [(0, 1), (7, 1), (9, 1), (26, 1), (27, 2), (39, 1)]
1979 [(0, 1)]
kim [(0, 1), (6, 1), (7, 1), (8, 1), (10, 1), (11, 1), (12, 1), (19, 1), (27, 1), (31, 1), (33, 1), (39, 1)]
1980 [(0, 1)]
khloé [(0, 1), (7, 1), (26, 1), (27, 1), (34, 1), (39, 1)]
1984 [(0, 1)]
son [(0, 1)]
rob [(0, 1), (27, 1), (37, 1)]
1987 [(0, 1)]
the [(1, 1), (2, 2), (11, 2), (12, 2), (13, 1), (14, 6), (15, 1), (16, 1), (17, 1), (18, 6), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 2), (27, 3), (28, 2), (29, 2), (30, 4), (31, 1), (33, 1), (34, 1), (35, 2), (36, 3), (37, 2), (38, 1), (39, 4), (40, 2)]
couple [(1, 1)]
divorced [(1, 1)]
1991 [(1, 1), (3, 1)]
1994 [(2, 1)]
entered [(2, 1)]
media [(2, 1), (20, 1)]
spotlight [(2, 1)]
when [(2, 1)]
he [(2, 1)]
defended [(2, 1)]
o [(2, 1)]
j [(2, 1), (8, 1), (10, 1)]
simpson [(2, 2)]
for [(2, 1), (11, 1), (12, 2), (16, 1), (17, 1), (18, 1), (19, 1), (32, 1), (34, 3)]
murders [(2, 1)]
of [(2, 1), (12, 1), (14, 2), (18, 2), (23, 2), (27, 1), (28, 1), (31, 1), (36, 2), (39, 1), (40, 2)]
nicole [(2, 1), (14, 1)]
brown [(2, 1)]
ronald [(2, 1)]
goldman [(2, 1)]
during [(2, 1), (35, 1), (36, 1)]
a [(2, 1), (6, 3), (7, 2), (10, 1), (13, 1), (14, 1), (17, 1), (19, 1), (21, 1), (22, 1), (24, 1), (26, 1), (27, 1), (29, 1), (30, 1), (31, 2), (32, 1), (35, 1), (36, 1), (37, 1)]
lengthy [(2, 1)]
trial [(2, 1)]
former [(3, 1)]
olympic [(3, 1)]
champion [(3, 1), (29, 1)]
bruce [(3, 1), (4, 1), (29, 1), (39, 1)]
jenner [(3, 1), (27, 1), (29, 1), (39, 1)]
1949 [(3, 1)]
two [(4, 1)]
kendall [(4, 1), (28, 1), (38, 1)]
1995 [(4, 1)]
kylie [(4, 1), (28, 1), (38, 1)]
1997 [(4, 1)]
died [(5, 1)]
eight [(5, 1), (23, 1), (24, 1)]
weeks [(5, 1)]
after [(5, 1), (14, 1), (26, 1), (32, 2), (33, 1), (34, 1), (35, 1)]
being [(5, 1), (34, 1)]
diagnosed [(5, 1)]
with [(5, 1), (10, 1), (12, 1), (13, 1), (15, 1), (18, 1), (24, 1), (31, 1), (37, 1)]
esophageal [(5, 1)]
cancer [(5, 1)]
2004 [(6, 1)]
became [(6, 1), (28, 1), (31, 1), (33, 1)]
personal [(6, 2)]
stylist [(6, 3)]
to [(6, 2), (13, 1), (14, 1), (18, 2), (21, 2), (22, 1), (26, 1), (27, 1), (33, 1)]
recording [(6, 1)]
artist [(6, 1)]
brandy [(6, 1)]
norwood [(6, 1), (8, 1)]
she [(6, 1), (8, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)]
eventually [(6, 1)]
developed [(6, 1)]
into [(6, 1), (7, 1), (39, 1)]
full [(6, 1)]
time [(6, 1)]
was [(6, 1), (10, 1), (13, 1), (14, 1), (16, 1), (17, 1), (21, 1), (25, 1), (26, 1), (34, 1), (37, 1)]
shopper [(6, 1)]
actress [(6, 1)]
lindsay [(6, 1)]
lohan [(6, 1)]
further [(7, 1)]
ventured [(7, 1)]
fashion [(7, 2), (30, 2)]
opening [(7, 1)]
high [(7, 1), (8, 1)]
boutique [(7, 1), (30, 1)]
d [(7, 1), (30, 1)]
s [(7, 1), (8, 2), (18, 1), (21, 2), (27, 1), (30, 1), (39, 3)]
h [(7, 1), (30, 1)]
calabasas [(7, 1)]
california [(7, 1)]
throughout [(8, 1), (18, 1)]
early [(8, 1)]
career [(8, 1)]
involved [(8, 1), (31, 1), (37, 1)]
herself [(8, 1)]
some [(8, 1)]
profile [(8, 1)]
relationships [(8, 1)]
including [(8, 1)]
brother [(8, 1)]
rapper [(8, 1), (33, 1)]
ray [(8, 1), (10, 1)]
later [(8, 1), (19, 1), (31, 1)]
singer [(8, 1), (37, 1)]
nick [(8, 1)]
lachey [(8, 1)]
2006 [(9, 1)]
starred [(9, 1)]
her [(9, 1), (14, 1), (27, 2), (34, 1)]
first [(9, 1), (27, 1)]
reality [(9, 1), (13, 1)]
television [(9, 1), (18, 1), (21, 1)]
series [(9, 1), (14, 3), (18, 2), (23, 1), (24, 1), (28, 1), (30, 1), (39, 1), (40, 1)]
filthy [(9, 1)]
rich [(9, 1)]
cattle [(9, 1)]
drive [(9, 1)]
february [(10, 1), (11, 1)]
2007 [(10, 1), (12, 1), (13, 1), (15, 1), (16, 1), (31, 2), (34, 1)]
home [(10, 1)]
sex [(10, 1), (31, 1)]
video [(10, 1)]
that [(10, 1), (13, 1), (14, 3), (16, 1), (18, 1), (21, 1), (24, 1)]
years [(10, 1)]
earlier [(10, 1)]
leaked [(10, 1)]
vivid [(11, 1), (12, 2)]
entertainment [(11, 1), (12, 1)]
bought [(11, 1)]
rights [(11, 1)]
1 [(11, 1)]
million [(11, 1), (12, 1), (18, 1), (25, 1), (40, 1)]
released [(11, 1)]
film [(11, 1)]
as [(11, 1), (13, 1), (23, 1), (28, 1), (31, 1), (36, 1)]
superstar [(11, 1)]
on [(11, 1), (13, 1), (15, 1), (16, 1), (24, 1), (26, 2), (27, 2), (29, 1)]
21 [(11, 1)]
sued [(12, 1)]
ownership [(12, 1)]
tape [(12, 1), (31, 1)]
but [(12, 1), (19, 1)]
dropped [(12, 1)]
suit [(12, 1)]
april [(12, 1), (24, 1)]
settled [(12, 1)]
5 [(12, 1)]
august [(13, 1)]
it [(13, 1), (16, 1), (17, 1)]
announced [(13, 1), (14, 1), (16, 1), (26, 1)]
kardashians [(13, 1), (15, 1), (18, 1)]
jenners [(13, 1)]
would [(13, 1)]
star [(13, 1)]
yet [(13, 1)]
be [(13, 1)]
titled [(13, 1)]
show [(13, 1), (16, 1), (27, 2), (28, 1), (29, 2)]
e [(13, 1), (14, 1), (16, 1), (24, 1)]
ryan [(13, 1)]
seacrest [(13, 1), (14, 1), (18, 1)]
serving [(13, 1)]
executive [(13, 1), (18, 1)]
producer [(13, 1)]
said [(14, 1), (18, 1)]
at [(14, 1), (25, 1)]
heart [(14, 1)]
despite [(14, 1)]
catfights [(14, 1)]
endless [(14, 1)]
sarcasm [(14, 1)]
is [(14, 2), (27, 1), (29, 1)]
family [(14, 2), (18, 3), (21, 1), (24, 1), (40, 1)]
truly [(14, 1)]
loves [(14, 1)]
supports [(14, 1)]
one [(14, 3), (18, 1), (35, 1)]
another [(14, 1)]
familiar [(14, 1)]
dynamics [(14, 1), (18, 1)]
this [(14, 1), (20, 1)]
make [(14, 1)]
them [(14, 1)]
hollywood [(14, 1), (18, 1)]
bunch [(14, 1)]
sure [(14, 1)]
entertain [(14, 1)]
announcement [(14, 1)]
came [(14, 1)]
week [(14, 1)]
paris [(14, 1)]
hilton [(14, 1)]
friend [(14, 1)]
richie [(14, 1)]
their [(14, 1), (21, 1), (22, 1), (33, 1)]
popular [(14, 1)]
entitled [(14, 1)]
simple [(14, 1)]
life [(14, 1)]
ending [(14, 1)]
keeping [(15, 1)]
up [(15, 1)]
premiered [(15, 1)]
october [(15, 1)]
14 [(15, 1)]
november [(16, 1)]
13 [(16, 1)]
renewed [(16, 1), (17, 1)]
its [(16, 1)]
second [(16, 1), (29, 1), (36, 1), (37, 1), (39, 1)]
season [(16, 1), (17, 1), (26, 2), (35, 1), (36, 1), (39, 1), (40, 1)]
following [(17, 1), (33, 1), (35, 1)]
year [(17, 1), (24, 1), (35, 1)]
third [(17, 1), (37, 1), (39, 1)]
lisa [(18, 1)]
berger [(18, 1)]
vice [(18, 1)]
president [(18, 1)]
original [(18, 1)]
programming [(18, 1), (22, 1)]
development [(18, 1)]
network [(18, 1)]
viewers [(18, 1)]
have [(18, 2), (23, 1), (30, 1), (38, 1)]
embraced [(18, 1)]
has [(18, 1), (22, 1), (29, 1)]
become [(18, 1)]
most [(18, 1)]
talked [(18, 1)]
shows [(18, 1)]
we [(18, 1)]
are [(18, 1)]
fortunate [(18, 1)]
work [(18, 1)]
bunim [(18, 1)]
murray [(18, 1)]
which [(18, 1), (34, 1)]
an [(18, 2), (40, 1)]
exceptional [(18, 1)]
ability [(18, 1)]
capture [(18, 1)]
hilarious [(18, 1)]
chaotic [(18, 1)]
always [(18, 1)]
entertaining [(18, 1)]
personalities [(18, 1)]
reporter [(18, 1)]
reported [(18, 1)]
estimated [(18, 1), (25, 1)]
65 [(18, 1)]
2010 [(18, 1), (31, 1)]
2011 [(19, 1), (32, 1)]
nba [(19, 1)]
player [(19, 1)]
humphries [(19, 1), (21, 1), (32, 1)]
highly [(19, 1)]
publicized [(19, 1)]
wedding [(19, 1)]
ceremony [(19, 1)]
filed [(19, 1)]
divorce [(19, 1), (32, 1)]
72 [(19, 1), (32, 1)]
days [(19, 1)]
caused [(20, 1)]
backlash [(20, 1)]
from [(20, 1), (22, 1), (27, 1), (31, 1), (32, 1), (33, 1)]
public [(20, 1)]
several [(21, 1), (30, 1)]
news [(21, 1)]
outlets [(21, 1)]
surmised [(21, 1)]
marriage [(21, 1), (27, 1), (32, 1)]
merely [(21, 1)]
publicity [(21, 1)]
stunt [(21, 1)]
promote [(21, 1)]
brand [(21, 1)]
subsequent [(21, 1)]
ventures [(21, 1)]
widely [(22, 1)]
circulated [(22, 1)]
petition [(22, 1)]
remove [(22, 1)]
all [(22, 1)]
related [(22, 1)]
air [(22, 1)]
followed [(22, 1)]
since [(22, 1), (29, 1), (30, 1)]
split [(22, 1)]
december [(23, 1)]
2013 [(23, 1)]
seasons [(23, 1), (24, 1), (37, 1), (39, 1)]
aired [(23, 1)]
24 [(24, 1)]
2012 [(24, 1), (33, 1), (36, 1), (37, 1)]
signed [(24, 1)]
three [(24, 1), (34, 1)]
deal [(24, 1), (25, 1)]
will [(24, 1)]
keep [(24, 1)]
airing [(24, 1)]
through [(24, 1)]
seven [(24, 1)]
nine [(24, 1)]
40 [(25, 1)]
january [(26, 1)]
4 [(26, 1)]
2015 [(26, 2)]
tenth [(26, 1)]
premiere [(26, 1), (30, 1)]
feb [(26, 1)]
8 [(26, 1)]
take [(26, 1)]
hamptons [(26, 1)]
finale [(26, 1)]
revolves [(27, 1)]
around [(27, 1)]
originally [(27, 1)]
focused [(27, 1)]
deceased [(27, 1)]
attorney [(27, 1)]
boyfriend [(27, 1)]
scott [(27, 1)]
disick [(27, 1)]
main [(27, 1)]
character [(27, 1)]
progressed [(28, 1)]
also [(28, 1), (29, 1), (38, 1)]
recurring [(28, 1), (29, 1)]
cast [(28, 1), (29, 1), (39, 1)]
members [(28, 1)]
husband [(29, 1)]
1976 [(29, 1)]
summer [(29, 1)]
olympics [(29, 1)]
decathlon [(29, 1)]
frequently [(29, 1)]
featured [(29, 1)]
been [(29, 1)]
member [(29, 1), (39, 1)]
began [(29, 1)]
sisters [(30, 1)]
established [(30, 1), (38, 1)]
careers [(30, 1), (38, 1)]
industry [(30, 1), (38, 1)]
co [(30, 1), (36, 1)]
owning [(30, 1)]
launching [(30, 1)]
fragrances [(30, 1)]
clothing [(30, 1)]
collections [(30, 1)]
gained [(31, 1)]
notoriety [(31, 1), (34, 1)]
subject [(31, 1)]
relationship [(31, 1), (35, 1), (37, 1)]
new [(31, 1), (32, 1)]
orleans [(31, 1)]
saints [(31, 1)]
running [(31, 1)]
back [(31, 1)]
reggie [(31, 1)]
bush [(31, 1)]
march [(31, 1)]
criticism [(32, 1)]
filing [(32, 1)]
jersey [(32, 1)]
nets [(32, 1)]
power [(32, 1)]
forward [(32, 1), (35, 1)]
day [(32, 1)]
while [(33, 1), (39, 1)]
still [(33, 1)]
pregnant [(33, 1)]
by [(33, 1)]
kanye [(33, 1)]
west [(33, 1)]
suffering [(33, 1)]
preeclampsia [(33, 1)]
gave [(33, 1)]
birth [(33, 1)]
prematurely [(33, 1)]
daughter [(33, 1), (39, 1)]
north [(33, 1)]
june [(33, 1)]
attained [(34, 1)]
own [(34, 1)]
right [(34, 1)]
arrested [(34, 1)]
driving [(34, 1)]
under [(34, 1)]
influence [(34, 1)]
jailed [(34, 1)]
approximately [(34, 1)]
hours [(34, 1)]
2008 [(34, 1)]
fourth [(35, 1)]
los [(35, 1)]
angeles [(35, 1)]
lakers [(35, 1)]
lamar [(35, 1)]
odom [(35, 1)]
month [(35, 1)]
served [(36, 1)]
host [(36, 1)]
american [(36, 1)]
version [(36, 1)]
x [(36, 1)]
factor [(36, 1)]
launched [(37, 1)]
sock [(37, 1)]
line [(37, 1)]
arthur [(37, 1)]
george [(37, 1)]
bailon [(37, 1)]
modeling [(38, 1)]
eighth [(39, 1)]
sons [(39, 1)]
brandon [(39, 2)]
brody [(39, 1)]
wife [(39, 1)]
leah [(39, 1)]
felder [(39, 2)]
eagles [(39, 1)]
band [(39, 1)]
don [(39, 1)]
were [(39, 1)]
integrated [(39, 1)]
supporting [(39, 1)]
friends [(39, 1)]
malika [(39, 1)]
haqq [(39, 1)]
jonathan [(39, 1)]
cheban [(39, 1)]
joined [(39, 1)]
earns [(40, 1)]
alleged [(40, 1)]
total [(40, 1)]
10 [(40, 1)]
per [(40, 1)]


quick check: we have 3 documents containing "born", with 5, 1 and 2 ocurances of the term:

In [53]:
inverted_index_tf["born"]

Out[53]:
[(0, 5), (3, 1), (4, 2)]
In [54]:
docs[0].count("born")

Out[54]:
5
In [55]:
" ".join(docs[0])

Out[55]:
'robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khloé born 1984 and son rob born 1987'

### Cosine similarity with inverted index (with some shortcuts)¶

Now we define a method that searches for a query efficiently using the inverted index. Our goal is to get the same results as before, but much more efficiently. The main intuition is to use the indexes to go only through the douments that have a chance to be relevant.

Note that some of the implementation details are inefficient (efficient implementations are part of your assigment)

In [56]:
## let's use the TF-IDF weighting
def docweight(queryterm, tf):
return tf*IDF[queryterm]

doc_norms={}
for doc_idx,doc in enumerate(docs):
doc_norms[doc_idx]=norm(doc_to_vec(doc))  ## this can be done more efficiently, but I only do it once, so i don't care

def run_search_on_index(query):
query_words = word_splitter.findall(query)
query_words = [w.lower() for w in query_words]
query_weights_dict={}  ## precomputed nonzero query  weights
large_vector=query_to_vec(query_words)
for t in large_vector:
if large_vector[t]>0:
query_weights_dict[t]=large_vector[t]
##this can be done more efficiently

## these are the score accumulators
## they will accumulate accumulate TF-IDF weight products (w_iq * w_ij)
doc_scores=defaultdict(int)

## since keys of our index are query words, we will iterate through those
for queryterm in query_words:
postings=inverted_index_tf[queryterm]
for doc_idx,tf in postings:   ## only touching docs that are involved in the query
weight_update=query_weights_dict[queryterm]*docweight(queryterm,tf)
doc_scores[doc_idx]+=weight_update  ## accumulate TF-IDF updates here
##(at the end of the outer loop, for each document that has any of these terms
##  we will have accumulated all TF-IDF weights that are needed to compute the
## dot-product part of the cosine similarity score

for d in doc_scores:
doc_scores[d]=doc_scores[d]/(float(doc_norms[d])*float(norm(query_weights_dict)))
## normalization part of the cosine similarity score
## same ORDER if you remove the query norm (every document is divided by it)

doc_idx_scores = sorted(doc_scores.items(), key=itemgetter(1), reverse=True)  ##further optimization possible
doc_scores = [(docs[doc_idx], score)
for doc_idx, score in doc_idx_scores
if score > 0]

joined_sents = [(" ".join(sent), score) for sent, score in doc_scores]
return joined_sents


Now let's see if we get the same results as with the non-efficient calculation of cosine similarity:

print_results(run_search_on_index("the olympic champion in kardashians"),relevant_docs=relevant_docs_champion)

We do (with the same scores):

In [57]:
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)

1 !!! 0.52 kris married former olympic champion bruce jenner born 1949 in 1991

2 !!! 0.10 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began

3     0.08 keeping up with the kardashians premiered on october 14 2007

4     0.06 in august 2007 it was announced that the kardashians and jenners would star in a yet to be titled reality show on e with ryan seacrest serving as executive producer

5     0.03 lisa berger executive vice president of original programming and series development for the network said viewers have embraced the kardashian family and the series has become one of television s most talked about shows we are fortunate to work with seacrest and bunim murray which have an exceptional ability to capture the kardashians hilarious chaotic and always entertaining personalities and family dynamics the hollywood reporter reported that the family made an estimated 65 million throughout 2010



And it is ~100 times more efficient (on bigger collection, this efficiency gain is much bigger)

In [58]:
%timeit a=run_search_on_index("the olympic champion in kardashians")

291 µs ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [59]:
%timeit a=run_search("the olympic champion in kardashians",cos_measure)

21.2 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can make this even more efficient if we realize that terms with very low idf scores will not affect the similarity score that much but will incur a high efficiency cost. We will approximate the cos_similarity measure by removing terms that do not make it past a certain trashold.

We only add one line to the our "run_search_on_index" function:

In [60]:
## Only consider terms that have nontrivial IDF

def run_search_on_index(query):
query_words = word_splitter.findall(query)
query_words = [w.lower() for w in query_words]
query_weights_dict={}  ## precomputed nonzero query  weights
large_vector=query_to_vec(query_words)
for t in large_vector:
if large_vector[t]>0:
query_weights_dict[t]=large_vector[t]

doc_scores=defaultdict(int)
for queryterm in query_words:
if IDF[queryterm]>0.05:
# DOUBLE BENEFITS: 1) going through fewer postings, 2) we can cut these from the indexes too
postings=inverted_index_tf[queryterm]
for doc_idx,tf in postings:
weight_update=query_weights_dict[queryterm]*docweight(queryterm,tf)
doc_scores[doc_idx]+=weight_update

for d in doc_scores:
doc_scores[d]=doc_scores[d]/(float(doc_norms[d])*float(norm(query_weights_dict)))  ##same ORDER if you remove the last norm

doc_idx_scores = sorted(doc_scores.items(), key=itemgetter(1), reverse=True)  ##further optimization possible
doc_scores = [(docs[doc_idx], score)
for doc_idx, score in doc_idx_scores
if score > 0]

joined_sents = [(" ".join(sent), score) for sent, score in doc_scores]
return joined_sents


Note that ordering stayed the same, cosine similarity scores changed just slightly, but we got further improvement in efficiency.

In [61]:
print_results(run_search_on_index("the olympic champion in kardashians"),relevant_docs=relevant_docs_champion)

1 !!! 0.51 kris married former olympic champion bruce jenner born 1949 in 1991

2 !!! 0.10 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began

3     0.08 keeping up with the kardashians premiered on october 14 2007

4     0.05 in august 2007 it was announced that the kardashians and jenners would star in a yet to be titled reality show on e with ryan seacrest serving as executive producer

5     0.03 lisa berger executive vice president of original programming and series development for the network said viewers have embraced the kardashian family and the series has become one of television s most talked about shows we are fortunate to work with seacrest and bunim murray which have an exceptional ability to capture the kardashians hilarious chaotic and always entertaining personalities and family dynamics the hollywood reporter reported that the family made an estimated 65 million throughout 2010


In [62]:
%timeit a=run_search_on_index("the olympic champion in kardashians")

185 µs ± 5.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


This boost in efficiency becomes even more important when we have large collections of documents: On the tv transcript collection from Assgment 1 (containing about 40,000 sentences) running a single query using the old cosine similarity method takes 11 minutes. After this optimization step, running a query on that same collection takes only 8 miliseconds: that is 80,000 times faster!

Methodological note: To select the IDF treshhold (0.05 in this case) we can look at the IDF distribution in combination with the words that fall in different IDF buckets

In [67]:
plt.hist(IDF.values())
plt.xlabel("IDF")
plt.ylabel("Number of terms with the respective IDF")

Out[67]:
Text(0,0.5,'Number of terms with the respective IDF')
In [68]:
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[:10]:
print(IDF_t)

print("...")

for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[-10:]:
print(IDF_t)

('the', 0.03225806451612903)
('and', 0.041666666666666664)
('in', 0.043478260869565216)
('a', 0.047619047619047616)
('kim', 0.07692307692307693)
('of', 0.08333333333333333)
('was', 0.08333333333333333)
('for', 0.1)
('series', 0.1)
('to', 0.1)
...
('west', 0.5)
('when', 0.5)
('widely', 0.5)
('wife', 0.5)
('will', 0.5)
('work', 0.5)
('would', 0.5)
('x', 0.5)
('years', 0.5)
('yet', 0.5)