Info/CS 4300: Language and Information - in-class demo

Proto Information Retrieval System

We're going to build a very basic proto-IR system from scratch. In the first part we will preprocess the "documents" (in this case sentences from a Wikipedia article); in the second part we will implement a method for searching for the set of documents which are closest to a given query.

Basic text processing: sentences, types, tokens, similarity

In [1]:
from __future__ import print_function
In [2]:
import re
In [3]:
# Copy and pasted background info from the show's Wikipedia page
# (http://en.wikipedia.org/wiki/Keeping_Up_with_the_Kardashians)

background = u"""
Robert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987). The couple divorced in 1991.[3] In 1994, Robert entered the media spotlight when he defended O.J. Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial. Kris married former Olympic champion Bruce Jenner (born 1949) in 1991. Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997). Robert died in 2003 eight weeks after being diagnosed with esophageal cancer.[4] In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan.[5] Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California. Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey.[5] In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive.[6] In February 2007, a home sex video that Kim made with Ray J years earlier was leaked.[7] Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21.[7] Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for $5 million.[8]

In August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer.[6] Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [...] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain." The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending.[6] Keeping Up With the Kardashians premiered on October 14, 2007.[9]

On November 13, 2007, it was announced that E! renewed the show for its second season.[9] The following year, it was renewed for a third season. Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television's most-talked-about shows [...] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics."[10] The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010.[3]

In 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony,[11] but filed for divorce 72 days later.[12] This caused widespread backlash from the public and media.[13] Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures.[14] A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split.[15] As of December 2013, eight seasons of the series have aired. On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine.[16] The deal was estimated at $40 million.[17][18] On January. 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale.

The show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show. As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show. Kris' second husband[19] 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began.
Since the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections.
Kim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010.[20] In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage. In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June.
Khloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008. The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship. In 2012, she served as a co-host during the second season of the American version of The X Factor.
Rob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons.
Kendall and Kylie have also established careers in the modeling industry.
In the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons.
The family earns an alleged total of $10 million per season of the series.[21]
"""

The simplest way to identify individual words is to split the string at every whitespace.

In [4]:
background.split()
Out[4]:
[u'Robert',
 u'Kardashian',
 u'(1944\u20142003)',
 u'and',
 u'Kristen',
 u'Mary',
 u'"Kris"',
 u'Houghton',
 u'(born',
 u'1955)',
 u'married',
 u'in',
 u'1978',
 u'and',
 u'had',
 u'four',
 u'children',
 u'together,',
 u'daughters',
 u'Kourtney',
 u'(born',
 u'1979),',
 u'Kim',
 u'(born',
 u'1980),',
 u'and',
 u'Khlo\xe9,',
 u'(born',
 u'1984),',
 u'and',
 u'son',
 u'Rob',
 u'(born',
 u'1987).',
 u'The',
 u'couple',
 u'divorced',
 u'in',
 u'1991.[3]',
 u'In',
 u'1994,',
 u'Robert',
 u'entered',
 u'the',
 u'media',
 u'spotlight',
 u'when',
 u'he',
 u'defended',
 u'O.J.',
 u'Simpson',
 u'for',
 u'the',
 u'murders',
 u'of',
 u'Nicole',
 u'Brown',
 u'Simpson',
 u'and',
 u'Ronald',
 u'Goldman',
 u'during',
 u'a',
 u'lengthy',
 u'trial.',
 u'Kris',
 u'married',
 u'former',
 u'Olympic',
 u'champion',
 u'Bruce',
 u'Jenner',
 u'(born',
 u'1949)',
 u'in',
 u'1991.',
 u'Bruce',
 u'and',
 u'Kris',
 u'had',
 u'two',
 u'daughters',
 u'together,',
 u'Kendall',
 u'(born',
 u'1995)',
 u'and',
 u'Kylie',
 u'(born',
 u'1997).',
 u'Robert',
 u'died',
 u'in',
 u'2003',
 u'eight',
 u'weeks',
 u'after',
 u'being',
 u'diagnosed',
 u'with',
 u'esophageal',
 u'cancer.[4]',
 u'In',
 u'2004,',
 u'Kim',
 u'became',
 u'a',
 u'personal',
 u'stylist',
 u'to',
 u'recording',
 u'artist',
 u'Brandy',
 u'Norwood;',
 u'she',
 u'eventually',
 u'developed',
 u'into',
 u'a',
 u'full-time',
 u'stylist,',
 u'and',
 u'was',
 u'a',
 u'personal',
 u'shopper',
 u'and',
 u'stylist',
 u'to',
 u'actress',
 u'Lindsay',
 u'Lohan.[5]',
 u'Khlo\xe9,',
 u'Kim',
 u'and',
 u'Kourtney',
 u'further',
 u'ventured',
 u'into',
 u'fashion,',
 u'opening',
 u'a',
 u'high',
 u'fashion',
 u'boutique',
 u'D-A-S-H',
 u'in',
 u'Calabasas,',
 u'California.',
 u'Throughout',
 u"Kim's",
 u'early',
 u'career,',
 u'she',
 u'involved',
 u'herself',
 u'in',
 u'some',
 u'high-profile',
 u'relationships',
 u'including',
 u"Norwood's",
 u'brother,',
 u'rapper',
 u'Ray',
 u'J',
 u'and',
 u'later,',
 u'singer',
 u'Nick',
 u'Lachey.[5]',
 u'In',
 u'2006,',
 u'Kourtney',
 u'starred',
 u'in',
 u'her',
 u'first',
 u'reality',
 u'television',
 u'series,',
 u'Filthy',
 u'Rich:',
 u'Cattle',
 u'Drive.[6]',
 u'In',
 u'February',
 u'2007,',
 u'a',
 u'home',
 u'sex',
 u'video',
 u'that',
 u'Kim',
 u'made',
 u'with',
 u'Ray',
 u'J',
 u'years',
 u'earlier',
 u'was',
 u'leaked.[7]',
 u'Vivid',
 u'Entertainment',
 u'bought',
 u'the',
 u'rights',
 u'for',
 u'$1',
 u'million',
 u'and',
 u'released',
 u'the',
 u'film',
 u'as',
 u'Kim',
 u'Kardashian:',
 u'Superstar',
 u'on',
 u'February',
 u'21.[7]',
 u'Kim',
 u'sued',
 u'Vivid',
 u'for',
 u'ownership',
 u'of',
 u'the',
 u'tape,',
 u'but',
 u'dropped',
 u'the',
 u'suit',
 u'in',
 u'April',
 u'2007',
 u'and',
 u'settled',
 u'with',
 u'Vivid',
 u'Entertainment',
 u'for',
 u'$5',
 u'million.[8]',
 u'In',
 u'August',
 u'2007,',
 u'it',
 u'was',
 u'announced',
 u'that',
 u'the',
 u'Kardashians',
 u'and',
 u'Jenners',
 u'would',
 u'star',
 u'in',
 u'a',
 u'yet-to-be-titled',
 u'reality',
 u'show',
 u'on',
 u'E!,',
 u'with',
 u'Ryan',
 u'Seacrest',
 u'serving',
 u'as',
 u'executive',
 u'producer.[6]',
 u'Seacrest',
 u'said',
 u'"At',
 u'the',
 u'heart',
 u'of',
 u'the',
 u'series\u2014despite',
 u'the',
 u'catfights',
 u'and',
 u'endless',
 u'sarcasm\u2014is',
 u'a',
 u'family',
 u'that',
 u'truly',
 u'loves',
 u'and',
 u'supports',
 u'one',
 u'another',
 u'[...]',
 u'The',
 u'familiar',
 u'dynamics',
 u'of',
 u'this',
 u'family',
 u'make',
 u'them',
 u'one',
 u'Hollywood',
 u'bunch',
 u'that',
 u'is',
 u'sure',
 u'to',
 u'entertain."',
 u'The',
 u'series',
 u'announcement',
 u'came',
 u'one',
 u'week',
 u'after',
 u'Paris',
 u'Hilton',
 u'and',
 u'her',
 u'friend',
 u'Nicole',
 u'Richie',
 u'announced',
 u'that',
 u'their',
 u'popular',
 u'E!',
 u'series',
 u'entitled',
 u'The',
 u'Simple',
 u'Life',
 u'was',
 u'ending.[6]',
 u'Keeping',
 u'Up',
 u'With',
 u'the',
 u'Kardashians',
 u'premiered',
 u'on',
 u'October',
 u'14,',
 u'2007.[9]',
 u'On',
 u'November',
 u'13,',
 u'2007,',
 u'it',
 u'was',
 u'announced',
 u'that',
 u'E!',
 u'renewed',
 u'the',
 u'show',
 u'for',
 u'its',
 u'second',
 u'season.[9]',
 u'The',
 u'following',
 u'year,',
 u'it',
 u'was',
 u'renewed',
 u'for',
 u'a',
 u'third',
 u'season.',
 u'Lisa',
 u'Berger,',
 u'executive',
 u'vice',
 u'president',
 u'of',
 u'original',
 u'programming',
 u'and',
 u'series',
 u'development',
 u'for',
 u'the',
 u'network,',
 u'said',
 u'"viewers',
 u'have',
 u'embraced',
 u'the',
 u'Kardashian',
 u'family',
 u'and',
 u'the',
 u'series',
 u'has',
 u'become',
 u'one',
 u'of',
 u"television's",
 u'most-talked-about',
 u'shows',
 u'[...]',
 u'We',
 u'are',
 u'fortunate',
 u'to',
 u'work',
 u'with',
 u'Seacrest',
 u'and',
 u'Bunim-Murray,',
 u'which',
 u'have',
 u'an',
 u'exceptional',
 u'ability',
 u'to',
 u'capture',
 u'the',
 u"Kardashians'",
 u'hilarious,',
 u'chaotic',
 u'and',
 u'always',
 u'entertaining',
 u'personalities',
 u'and',
 u'family',
 u'dynamics."[10]',
 u'The',
 u'Hollywood',
 u'Reporter',
 u'reported',
 u'that',
 u'the',
 u'family',
 u'made',
 u'an',
 u'estimated',
 u'$65',
 u'million',
 u'throughout',
 u'2010.[3]',
 u'In',
 u'2011,',
 u'Kim',
 u'married',
 u'NBA',
 u'player',
 u'Kris',
 u'Humphries',
 u'in',
 u'a',
 u'highly',
 u'publicized',
 u'wedding',
 u'ceremony,[11]',
 u'but',
 u'filed',
 u'for',
 u'divorce',
 u'72',
 u'days',
 u'later.[12]',
 u'This',
 u'caused',
 u'widespread',
 u'backlash',
 u'from',
 u'the',
 u'public',
 u'and',
 u'media.[13]',
 u'Several',
 u'news',
 u'outlets',
 u'surmised',
 u'that',
 u"Kardashian's",
 u'marriage',
 u'to',
 u'Humphries',
 u'was',
 u'merely',
 u'a',
 u'publicity',
 u'stunt,',
 u'to',
 u'promote',
 u'the',
 u'Kardashian',
 u"family's",
 u'brand',
 u'and',
 u'their',
 u'subsequent',
 u'television',
 u'ventures.[14]',
 u'A',
 u'widely',
 u'circulated',
 u'petition',
 u'asking',
 u'to',
 u'remove',
 u'all',
 u'Kardashian-related',
 u'programming',
 u'from',
 u'the',
 u'air',
 u'has',
 u'followed',
 u'since',
 u'their',
 u'split.[15]',
 u'As',
 u'of',
 u'December',
 u'2013,',
 u'eight',
 u'seasons',
 u'of',
 u'the',
 u'series',
 u'have',
 u'aired.',
 u'On',
 u'April',
 u'24,',
 u'2012,',
 u'E!',
 u'signed',
 u'a',
 u'three-year',
 u'deal',
 u'with',
 u'the',
 u'Kardashian',
 u'family',
 u'that',
 u'will',
 u'keep',
 u'the',
 u'series',
 u'airing',
 u'through',
 u'seasons',
 u'seven,',
 u'eight',
 u'and',
 u'nine.[16]',
 u'The',
 u'deal',
 u'was',
 u'estimated',
 u'at',
 u'$40',
 u'million.[17][18]',
 u'On',
 u'January.',
 u'4,',
 u'2015,',
 u'a',
 u'tenth',
 u'season',
 u'was',
 u'announced',
 u'to',
 u'premiere',
 u'on',
 u'Feb',
 u'8,',
 u'2015',
 u'after',
 u'the',
 u'Kourtney',
 u'and',
 u'Khlo\xe9',
 u'Take',
 u'The',
 u'Hamptons',
 u'season',
 u'finale.',
 u'The',
 u'show',
 u'revolves',
 u'around',
 u'the',
 u'children',
 u'of',
 u'Kris',
 u'Jenner,',
 u'and',
 u'originally',
 u'focused',
 u'on',
 u'her',
 u'children',
 u'from',
 u'her',
 u'first',
 u'marriage',
 u'to',
 u'deceased',
 u'attorney',
 u'Robert',
 u'Kardashian:',
 u'Kourtney,',
 u'Kim,',
 u'Khlo\xe9,',
 u'and',
 u'Rob',
 u"Kourtney's",
 u'boyfriend',
 u'Scott',
 u'Disick',
 u'is',
 u'a',
 u'main',
 u'character',
 u'on',
 u'the',
 u'show.',
 u'As',
 u'the',
 u'series',
 u'progressed,',
 u"Kris'",
 u'children',
 u'Kendall',
 u'and',
 u'Kylie',
 u'also',
 u'became',
 u'recurring',
 u'cast',
 u'members',
 u'of',
 u'the',
 u'show.',
 u"Kris'",
 u'second',
 u'husband[19]',
 u'1976',
 u'Summer',
 u'Olympics',
 u'decathlon',
 u'champion',
 u'Bruce',
 u'Jenner,',
 u'is',
 u'also',
 u'frequently',
 u'featured',
 u'on',
 u'the',
 u'show,',
 u'and',
 u'has',
 u'been',
 u'a',
 u'recurring',
 u'cast',
 u'member',
 u'since',
 u'the',
 u'show',
 u'began.',
 u'Since',
 u'the',
 u"series'",
 u'premiere,',
 u'the',
 u'Kardashian',
 u'sisters',
 u'have',
 u'established',
 u'careers',
 u'in',
 u'the',
 u'fashion',
 u'industry,',
 u'co-owning',
 u'the',
 u'fashion',
 u'boutique',
 u'D-A-S-H',
 u'and',
 u'launching',
 u'several',
 u'fragrances',
 u'and',
 u'clothing',
 u'collections.',
 u'Kim',
 u'gained',
 u'notoriety',
 u'as',
 u'the',
 u'subject',
 u'of',
 u'a',
 u'sex',
 u'tape',
 u'in',
 u'2007,',
 u'and',
 u'later',
 u'became',
 u'involved',
 u'in',
 u'a',
 u'relationship',
 u'with',
 u'New',
 u'Orleans',
 u'Saints',
 u'running',
 u'back',
 u'Reggie',
 u'Bush',
 u'from',
 u'2007-',
 u'March',
 u'2010.[20]',
 u'In',
 u'2011,',
 u'she',
 u'received',
 u'widespread',
 u'criticism',
 u'after',
 u'filing',
 u'for',
 u'divorce',
 u'from',
 u'New',
 u'Jersey',
 u'Nets',
 u'power',
 u'forward',
 u'Kris',
 u'Humphries',
 u'after',
 u'a',
 u'72-day',
 u'marriage.',
 u'In',
 u'2012,',
 u'while',
 u'still',
 u'married',
 u'Kim',
 u'became',
 u'pregnant',
 u'by',
 u'rapper',
 u'Kanye',
 u'West;',
 u'after',
 u'suffering',
 u'from',
 u'preeclampsia',
 u'she',
 u'gave',
 u'birth',
 u'prematurely',
 u'to',
 u'their',
 u'daughter',
 u'North',
 u'the',
 u'following',
 u'June.',
 u'Khlo\xe9',
 u'attained',
 u'notoriety',
 u'in',
 u'her',
 u'own',
 u'right',
 u'after',
 u'being',
 u'arrested',
 u'for',
 u'driving',
 u'under',
 u'the',
 u'influence',
 u'in',
 u'2007,',
 u'for',
 u'which',
 u'she',
 u'was',
 u'jailed',
 u'for',
 u'approximately',
 u'three',
 u'hours',
 u'in',
 u'2008.',
 u'The',
 u'following',
 u'year',
 u'(during',
 u'the',
 u'fourth',
 u'season),',
 u'she',
 u'married',
 u'Los',
 u'Angeles',
 u'Lakers',
 u'forward',
 u'Lamar',
 u'Odom',
 u'after',
 u'a',
 u'one-month',
 u'relationship.',
 u'In',
 u'2012,',
 u'she',
 u'served',
 u'as',
 u'a',
 u'co-host',
 u'during',
 u'the',
 u'second',
 u'season',
 u'of',
 u'the',
 u'American',
 u'version',
 u'of',
 u'The',
 u'X',
 u'Factor.',
 u'Rob',
 u'launched',
 u'the',
 u'sock',
 u'line',
 u'"Arthur',
 u'George"',
 u'in',
 u'2012,',
 u'and',
 u'was',
 u'involved',
 u'in',
 u'a',
 u'relationship',
 u'with',
 u'singer',
 u'Adrienne',
 u'Bailon',
 u'in',
 u'the',
 u'second',
 u'and',
 u'third',
 u'seasons.',
 u'Kendall',
 u'and',
 u'Kylie',
 u'have',
 u'also',
 u'established',
 u'careers',
 u'in',
 u'the',
 u'modeling',
 u'industry.',
 u'In',
 u'the',
 u'eighth',
 u'season,',
 u"Bruce's",
 u'sons',
 u'Brandon',
 u'and',
 u'Brody',
 u'Jenner,',
 u'and',
 u"Brandon's",
 u'wife',
 u'Leah',
 u'Felder',
 u'(daughter',
 u'of',
 u'Eagles',
 u'band',
 u'member',
 u'Don',
 u'Felder),',
 u'were',
 u'integrated',
 u'into',
 u'the',
 u'supporting',
 u'cast,',
 u'while',
 u'Kourtney,',
 u'Khlo\xe9,',
 u'and',
 u"Kim's",
 u'friends',
 u'Malika',
 u'Haqq',
 u'and',
 u'Jonathan',
 u'Cheban',
 u'joined',
 u'the',
 u'series',
 u'in',
 u'the',
 u'second',
 u'and',
 u'third',
 u'seasons.',
 u'The',
 u'family',
 u'earns',
 u'an',
 u'alleged',
 u'total',
 u'of',
 u'$10',
 u'million',
 u'per',
 u'season',
 u'of',
 u'the',
 u'series.[21]']

The output is not satisfactory, especially around punctuation. Let's try to catch a higher-level structure first: sentences. A simple idea is to break at periods.

In [5]:
background.split(".")
Out[5]:
[u'\nRobert Kardashian (1944\u20142003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khlo\xe9, (born 1984), and son Rob (born 1987)',
 u' The couple divorced in 1991',
 u'[3] In 1994, Robert entered the media spotlight when he defended O',
 u'J',
 u' Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial',
 u' Kris married former Olympic champion Bruce Jenner (born 1949) in 1991',
 u' Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997)',
 u' Robert died in 2003 eight weeks after being diagnosed with esophageal cancer',
 u'[4] In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan',
 u'[5] Khlo\xe9, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California',
 u" Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey",
 u'[5] In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive',
 u'[6] In February 2007, a home sex video that Kim made with Ray J years earlier was leaked',
 u'[7] Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21',
 u'[7] Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for $5 million',
 u'[8]\n\nIn August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer',
 u'[6] Seacrest said "At the heart of the series\u2014despite the catfights and endless sarcasm\u2014is a family that truly loves and supports one another [',
 u'',
 u'',
 u'] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain',
 u'" The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending',
 u'[6] Keeping Up With the Kardashians premiered on October 14, 2007',
 u'[9]\n\nOn November 13, 2007, it was announced that E! renewed the show for its second season',
 u'[9] The following year, it was renewed for a third season',
 u' Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television\'s most-talked-about shows [',
 u'',
 u'',
 u"] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics",
 u'"[10] The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010',
 u'[3]\n\nIn 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony,[11] but filed for divorce 72 days later',
 u'[12] This caused widespread backlash from the public and media',
 u"[13] Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures",
 u'[14] A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split',
 u'[15] As of December 2013, eight seasons of the series have aired',
 u' On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine',
 u'[16] The deal was estimated at $40 million',
 u'[17][18] On January',
 u' 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khlo\xe9 Take The Hamptons season finale',
 u"\n\nThe show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khlo\xe9, and Rob Kourtney's boyfriend Scott Disick is a main character on the show",
 u" As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show",
 u" Kris' second husband[19] 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began",
 u"\nSince the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections",
 u'\nKim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010',
 u'[20] In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage',
 u' In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June',
 u'\nKhlo\xe9 attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008',
 u' The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship',
 u' In 2012, she served as a co-host during the second season of the American version of The X Factor',
 u'\nRob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons',
 u'\nKendall and Kylie have also established careers in the modeling industry',
 u"\nIn the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khlo\xe9, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons",
 u'\nThe family earns an alleged total of $10 million per season of the series',
 u'[21]\n']

This is not perfect either. One problem consists of the Wikipedia footnotes such as "[9]", which are irrelevant to us. Let's remove them with a regular expression.

In [6]:
re.findall(r"\[\d+\]", background)
Out[6]:
[u'[3]',
 u'[4]',
 u'[5]',
 u'[5]',
 u'[6]',
 u'[7]',
 u'[7]',
 u'[8]',
 u'[6]',
 u'[6]',
 u'[9]',
 u'[9]',
 u'[10]',
 u'[3]',
 u'[11]',
 u'[12]',
 u'[13]',
 u'[14]',
 u'[15]',
 u'[16]',
 u'[17]',
 u'[18]',
 u'[19]',
 u'[20]',
 u'[21]']
In [7]:
background_nofoot = re.sub(r"\[\d+\]", "", background)

We can now try to split sentences again. We also want to split on other punctuation marks, not just the period. This is possible with regular expressions.

In [8]:
splitter = re.compile(r"""
    [.!?]       # split on punctuation
    """, re.VERBOSE)

While this looks better, it messes up som instances: O.J. Simpson's name, the "E!" network's name, and ellipses (...).

To deal with this, we need to think about the decision process of whether a punctuation mark is an end of sentence or not. In class, we came up with three requirements:

  • It is succeeded by at least one whitespace character (not O.J)
  • The first character after the whitespace should be uppercase (not E! renewed)
  • The last character before it should not be uppercase (not J. Simpson)

Are these rules complete?

In [9]:
splitter = re.compile(r"""
    (?<![A-Z])  # last character cannot be uppercase
    [.!?]       # match punctuation
    \s+         # followed by whitespace
    (?=[A-Z])   # next character must be uppercase
    """, re.VERBOSE)

Running our sentence splitter on the text, it seems to do a good job.

In [10]:
for sentence in splitter.split(background_nofoot):
    print(sentence.strip())
    print("--")
Robert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987)
--
The couple divorced in 1991
--
In 1994, Robert entered the media spotlight when he defended O.J. Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial
--
Kris married former Olympic champion Bruce Jenner (born 1949) in 1991
--
Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997)
--
Robert died in 2003 eight weeks after being diagnosed with esophageal cancer
--
In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan
--
Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California
--
Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey
--
In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive
--
In February 2007, a home sex video that Kim made with Ray J years earlier was leaked
--
Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21
--
Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for $5 million
--
In August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer
--
Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [...] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain." The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending
--
Keeping Up With the Kardashians premiered on October 14, 2007
--
On November 13, 2007, it was announced that E! renewed the show for its second season
--
The following year, it was renewed for a third season
--
Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television's most-talked-about shows [...] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics." The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010
--
In 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony, but filed for divorce 72 days later
--
This caused widespread backlash from the public and media
--
Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures
--
A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split
--
As of December 2013, eight seasons of the series have aired
--
On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine
--
The deal was estimated at $40 million
--
On January. 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale
--
The show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show
--
As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show
--
Kris' second husband 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began
--
Since the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections
--
Kim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010
--
In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage
--
In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June
--
Khloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008
--
The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship
--
In 2012, she served as a co-host during the second season of the American version of The X Factor
--
Rob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons
--
Kendall and Kylie have also established careers in the modeling industry
--
In the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons
--
The family earns an alleged total of $10 million per season of the series.
--

Now let's identify individual words within each sentence. Rather than splitting at whitespace, let's match all sequences of word characters.

In [11]:
word_splitter = re.compile(r"""
    (\w+)
    """, re.VERBOSE)
In [12]:
sent_words = [word_splitter.findall(sent)
              for sent in splitter.split(background_nofoot)]
In [13]:
sent_words_lower = [[w.lower() for w in sent]
                    for sent in sent_words]

How many sentences do we have?

In [14]:
len(sent_words_lower)
Out[14]:
41

How many words are there in total? ("tokens")

In [15]:
allwords=[w for sent in sent_words_lower for w in sent]
In [16]:
sorted(allwords)
Out[16]:
[u'1',
 u'10',
 u'13',
 u'14',
 u'1944',
 u'1949',
 u'1955',
 u'1976',
 u'1978',
 u'1979',
 u'1980',
 u'1984',
 u'1987',
 u'1991',
 u'1991',
 u'1994',
 u'1995',
 u'1997',
 u'2003',
 u'2003',
 u'2004',
 u'2006',
 u'2007',
 u'2007',
 u'2007',
 u'2007',
 u'2007',
 u'2007',
 u'2007',
 u'2007',
 u'2008',
 u'2010',
 u'2010',
 u'2011',
 u'2011',
 u'2012',
 u'2012',
 u'2012',
 u'2012',
 u'2013',
 u'2015',
 u'2015',
 u'21',
 u'24',
 u'4',
 u'40',
 u'5',
 u'65',
 u'72',
 u'72',
 u'8',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'a',
 u'ability',
 u'about',
 u'actress',
 u'adrienne',
 u'after',
 u'after',
 u'after',
 u'after',
 u'after',
 u'after',
 u'after',
 u'after',
 u'air',
 u'aired',
 u'airing',
 u'all',
 u'alleged',
 u'also',
 u'also',
 u'also',
 u'always',
 u'american',
 u'an',
 u'an',
 u'an',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'and',
 u'angeles',
 u'announced',
 u'announced',
 u'announced',
 u'announced',
 u'announcement',
 u'another',
 u'approximately',
 u'april',
 u'april',
 u'are',
 u'around',
 u'arrested',
 u'arthur',
 u'artist',
 u'as',
 u'as',
 u'as',
 u'as',
 u'as',
 u'as',
 u'asking',
 u'at',
 u'at',
 u'attained',
 u'attorney',
 u'august',
 u'back',
 u'backlash',
 u'bailon',
 u'band',
 u'be',
 u'became',
 u'became',
 u'became',
 u'became',
 u'become',
 u'been',
 u'began',
 u'being',
 u'being',
 u'berger',
 u'birth',
 u'born',
 u'born',
 u'born',
 u'born',
 u'born',
 u'born',
 u'born',
 u'born',
 u'bought',
 u'boutique',
 u'boutique',
 u'boyfriend',
 u'brand',
 u'brandon',
 u'brandon',
 u'brandy',
 u'brody',
 u'brother',
 u'brown',
 u'bruce',
 u'bruce',
 u'bruce',
 u'bruce',
 u'bunch',
 u'bunim',
 u'bush',
 u'but',
 u'but',
 u'by',
 u'calabasas',
 u'california',
 u'came',
 u'cancer',
 u'capture',
 u'career',
 u'careers',
 u'careers',
 u'cast',
 u'cast',
 u'cast',
 u'catfights',
 u'cattle',
 u'caused',
 u'ceremony',
 u'champion',
 u'champion',
 u'chaotic',
 u'character',
 u'cheban',
 u'children',
 u'children',
 u'children',
 u'children',
 u'circulated',
 u'clothing',
 u'co',
 u'co',
 u'collections',
 u'couple',
 u'criticism',
 u'd',
 u'd',
 u'daughter',
 u'daughter',
 u'daughters',
 u'daughters',
 u'day',
 u'days',
 u'deal',
 u'deal',
 u'decathlon',
 u'deceased',
 u'december',
 u'defended',
 u'despite',
 u'developed',
 u'development',
 u'diagnosed',
 u'died',
 u'disick',
 u'divorce',
 u'divorce',
 u'divorced',
 u'don',
 u'drive',
 u'driving',
 u'dropped',
 u'during',
 u'during',
 u'during',
 u'dynamics',
 u'dynamics',
 u'e',
 u'e',
 u'e',
 u'e',
 u'eagles',
 u'earlier',
 u'early',
 u'earns',
 u'eight',
 u'eight',
 u'eight',
 u'eighth',
 u'embraced',
 u'ending',
 u'endless',
 u'entered',
 u'entertain',
 u'entertaining',
 u'entertainment',
 u'entertainment',
 u'entitled',
 u'esophageal',
 u'established',
 u'established',
 u'estimated',
 u'estimated',
 u'eventually',
 u'exceptional',
 u'executive',
 u'executive',
 u'factor',
 u'familiar',
 u'family',
 u'family',
 u'family',
 u'family',
 u'family',
 u'family',
 u'family',
 u'family',
 u'fashion',
 u'fashion',
 u'fashion',
 u'fashion',
 u'featured',
 u'feb',
 u'february',
 u'february',
 u'felder',
 u'felder',
 u'filed',
 u'filing',
 u'film',
 u'filthy',
 u'finale',
 u'first',
 u'first',
 u'focused',
 u'followed',
 u'following',
 u'following',
 u'following',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'for',
 u'former',
 u'fortunate',
 u'forward',
 u'forward',
 u'four',
 u'fourth',
 u'fragrances',
 u'frequently',
 u'friend',
 u'friends',
 u'from',
 u'from',
 u'from',
 u'from',
 u'from',
 u'from',
 u'full',
 u'further',
 u'gained',
 u'gave',
 u'george',
 u'goldman',
 u'h',
 u'h',
 u'had',
 u'had',
 u'hamptons',
 u'haqq',
 u'has',
 u'has',
 u'has',
 u'have',
 u'have',
 u'have',
 u'have',
 u'have',
 u'he',
 u'heart',
 u'her',
 u'her',
 u'her',
 u'her',
 u'her',
 u'herself',
 u'high',
 u'high',
 u'highly',
 u'hilarious',
 u'hilton',
 u'hollywood',
 u'hollywood',
 u'home',
 u'host',
 u'houghton',
 u'hours',
 u'humphries',
 u'humphries',
 u'humphries',
 u'husband',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'in',
 u'including',
 u'industry',
 u'industry',
 u'influence',
 u'integrated',
 u'into',
 u'into',
 u'into',
 u'involved',
 u'involved',
 u'involved',
 u'is',
 u'is',
 u'is',
 u'is',
 u'it',
 u'it',
 u'it',
 u'its',
 u'j',
 u'j',
 u'j',
 u'jailed',
 u'january',
 u'jenner',
 u'jenner',
 u'jenner',
 u'jenner',
 u'jenners',
 u'jersey',
 u'joined',
 u'jonathan',
 u'june',
 u'kanye',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashian',
 u'kardashians',
 u'kardashians',
 u'kardashians',
 u'keep',
 u'keeping',
 u'kendall',
 u'kendall',
 u'kendall',
 u'khlo',
 u'khlo',
 u'khlo',
 u'khlo',
 u'khlo',
 u'khlo',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kim',
 u'kourtney',
 u'kourtney',
 u'kourtney',
 u'kourtney',
 u'kourtney',
 u'kourtney',
 u'kourtney',
 u'kris',
 u'kris',
 u'kris',
 u'kris',
 u'kris',
 u'kris',
 u'kris',
 u'kris',
 u'kristen',
 u'kylie',
 u'kylie',
 u'kylie',
 u'lachey',
 u'lakers',
 u'lamar',
 u'later',
 u'later',
 u'later',
 u'launched',
 u'launching',
 u'leah',
 u'leaked',
 u'lengthy',
 u'life',
 u'lindsay',
 u'line',
 u'lisa',
 u'lohan',
 u'los',
 u'loves',
 u'made',
 u'made',
 u'main',
 u'make',
 u'malika',
 u'march',
 u'marriage',
 u'marriage',
 u'marriage',
 u'married',
 u'married',
 u'married',
 u'married',
 u'married',
 u'mary',
 u'media',
 u'media',
 u'member',
 u'member',
 u'members',
 u'merely',
 u'million',
 u'million',
 u'million',
 u'million',
 u'million',
 u'modeling',
 u'month',
 u'most',
 u'murders',
 u'murray',
 u'nba',
 u'nets',
 u'network',
 u'new',
 u'new',
 u'news',
 u'nick',
 u'nicole',
 u'nicole',
 u'nine',
 u'north',
 u'norwood',
 u'norwood',
 u'notoriety',
 u'notoriety',
 u'november',
 u'o',
 u'october',
 u'odom',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'of',
 u'olympic',
 u'olympics',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'on',
 u'one',
 u'one',
 u'one',
 u'one',
 u'one',
 u'opening',
 u'original',
 u'originally',
 u'orleans',
 u'outlets',
 u'own',
 u'ownership',
 u'owning',
 u'paris',
 u'per',
 u'personal',
 u'personal',
 u'personalities',
 u'petition',
 u'player',
 u'popular',
 u'power',
 u'preeclampsia',
 u'pregnant',
 u'prematurely',
 u'premiere',
 u'premiere',
 u'premiered',
 u'president',
 u'producer',
 u'profile',
 u'programming',
 u'programming',
 u'progressed',
 u'promote',
 u'public',
 u'publicity',
 u'publicized',
 u'rapper',
 u'rapper',
 u'ray',
 u'ray',
 u'reality',
 u'reality',
 u'received',
 u'recording',
 u'recurring',
 u'recurring',
 u'reggie',
 u'related',
 u'relationship',
 u'relationship',
 u'relationship',
 u'relationships',
 u'released',
 u'remove',
 u'renewed',
 u'renewed',
 u'reported',
 u'reporter',
 u'revolves',
 u'rich',
 u'richie',
 u'right',
 u'rights',
 u'rob',
 u'rob',
 u'rob',
 u'robert',
 u'robert',
 u'robert',
 u'robert',
 u'ronald',
 u'running',
 u'ryan',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u's',
 u'said',
 u'said',
 u'saints',
 u'sarcasm',
 u'scott',
 u'seacrest',
 u'seacrest',
 u'seacrest',
 u'season',
 u'season',
 u'season',
 u'season',
 u'season',
 u'season',
 u'season',
 u'season',
 u'seasons',
 u'seasons',
 u'seasons',
 u'seasons',
 u'second',
 u'second',
 u'second',
 u'second',
 u'second',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'series',
 u'served',
 u'serving',
 u'settled',
 u'seven',
 u'several',
 u'several',
 u'sex',
 u'sex',
 u'she',
 u'she',
 u'she',
 u'she',
 u'she',
 u'she',
 u'she',
 u'shopper',
 u'show',
 u'show',
 u'show',
 u'show',
 u'show',
 u'show',
 u'show',
 u'shows',
 u'signed',
 u'simple',
 u'simpson',
 u'simpson',
 u'since',
 u'since',
 u'since',
 u'singer',
 u'singer',
 u'sisters',
 u'sock',
 u'some',
 u'son',
 u'sons',
 u'split',
 u'spotlight',
 u'star',
 u'starred',
 u'still',
 u'stunt',
 u'stylist',
 u'stylist',
 u'stylist',
 u'subject',
 u'subsequent',
 u'sued',
 u'suffering',
 u'suit',
 u'summer',
 u'superstar',
 u'supporting',
 u'supports',
 u'sure',
 u'surmised',
 u'take',
 u'talked',
 u'tape',
 u'tape',
 u'television',
 u'television',
 u'television',
 u'tenth',
 u'that',
 u'that',
 u'that',
 u'that',
 u'that',
 u'that',
 u'that',
 u'that',
 u'that',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'the',
 u'their',
 u'their',
 u'their',
 u'their',
 u'them',
 u'third',
 u'third',
 u'third',
 u'this',
 u'this',
 u'three',
 u'three',
 u'through',
 u'throughout',
 u'throughout',
 u'time',
 u'titled',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'to',
 u'together',
 u'together',
 u'total',
 u'trial',
 u'truly',
 u'two',
 u'under',
 u'up',
 u'ventured',
 u'ventures',
 u'version',
 u'vice',
 u'video',
 u'viewers',
 u'vivid',
 u'vivid',
 u'vivid',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'was',
 u'we',
 u'wedding',
 u'week',
 u'weeks',
 u'were',
 u'west',
 u'when',
 u'which',
 u'which',
 u'while',
 u'while',
 u'widely',
 u'widespread',
 u'widespread',
 u'wife',
 u'will',
 u'with',
 u'with',
 u'with',
 u'with',
 u'with',
 u'with',
 u'with',
 u'with',
 u'with',
 u'work',
 u'would',
 u'x',
 u'year',
 u'year',
 u'year',
 u'years',
 u'yet']
In [17]:
len(allwords)
Out[17]:
972

How many distinct types of words ("types") are there?

In [18]:
len(set(allwords))
Out[18]:
459
In [19]:
sorted(set(allwords))
Out[19]:
[u'1',
 u'10',
 u'13',
 u'14',
 u'1944',
 u'1949',
 u'1955',
 u'1976',
 u'1978',
 u'1979',
 u'1980',
 u'1984',
 u'1987',
 u'1991',
 u'1994',
 u'1995',
 u'1997',
 u'2003',
 u'2004',
 u'2006',
 u'2007',
 u'2008',
 u'2010',
 u'2011',
 u'2012',
 u'2013',
 u'2015',
 u'21',
 u'24',
 u'4',
 u'40',
 u'5',
 u'65',
 u'72',
 u'8',
 u'a',
 u'ability',
 u'about',
 u'actress',
 u'adrienne',
 u'after',
 u'air',
 u'aired',
 u'airing',
 u'all',
 u'alleged',
 u'also',
 u'always',
 u'american',
 u'an',
 u'and',
 u'angeles',
 u'announced',
 u'announcement',
 u'another',
 u'approximately',
 u'april',
 u'are',
 u'around',
 u'arrested',
 u'arthur',
 u'artist',
 u'as',
 u'asking',
 u'at',
 u'attained',
 u'attorney',
 u'august',
 u'back',
 u'backlash',
 u'bailon',
 u'band',
 u'be',
 u'became',
 u'become',
 u'been',
 u'began',
 u'being',
 u'berger',
 u'birth',
 u'born',
 u'bought',
 u'boutique',
 u'boyfriend',
 u'brand',
 u'brandon',
 u'brandy',
 u'brody',
 u'brother',
 u'brown',
 u'bruce',
 u'bunch',
 u'bunim',
 u'bush',
 u'but',
 u'by',
 u'calabasas',
 u'california',
 u'came',
 u'cancer',
 u'capture',
 u'career',
 u'careers',
 u'cast',
 u'catfights',
 u'cattle',
 u'caused',
 u'ceremony',
 u'champion',
 u'chaotic',
 u'character',
 u'cheban',
 u'children',
 u'circulated',
 u'clothing',
 u'co',
 u'collections',
 u'couple',
 u'criticism',
 u'd',
 u'daughter',
 u'daughters',
 u'day',
 u'days',
 u'deal',
 u'decathlon',
 u'deceased',
 u'december',
 u'defended',
 u'despite',
 u'developed',
 u'development',
 u'diagnosed',
 u'died',
 u'disick',
 u'divorce',
 u'divorced',
 u'don',
 u'drive',
 u'driving',
 u'dropped',
 u'during',
 u'dynamics',
 u'e',
 u'eagles',
 u'earlier',
 u'early',
 u'earns',
 u'eight',
 u'eighth',
 u'embraced',
 u'ending',
 u'endless',
 u'entered',
 u'entertain',
 u'entertaining',
 u'entertainment',
 u'entitled',
 u'esophageal',
 u'established',
 u'estimated',
 u'eventually',
 u'exceptional',
 u'executive',
 u'factor',
 u'familiar',
 u'family',
 u'fashion',
 u'featured',
 u'feb',
 u'february',
 u'felder',
 u'filed',
 u'filing',
 u'film',
 u'filthy',
 u'finale',
 u'first',
 u'focused',
 u'followed',
 u'following',
 u'for',
 u'former',
 u'fortunate',
 u'forward',
 u'four',
 u'fourth',
 u'fragrances',
 u'frequently',
 u'friend',
 u'friends',
 u'from',
 u'full',
 u'further',
 u'gained',
 u'gave',
 u'george',
 u'goldman',
 u'h',
 u'had',
 u'hamptons',
 u'haqq',
 u'has',
 u'have',
 u'he',
 u'heart',
 u'her',
 u'herself',
 u'high',
 u'highly',
 u'hilarious',
 u'hilton',
 u'hollywood',
 u'home',
 u'host',
 u'houghton',
 u'hours',
 u'humphries',
 u'husband',
 u'in',
 u'including',
 u'industry',
 u'influence',
 u'integrated',
 u'into',
 u'involved',
 u'is',
 u'it',
 u'its',
 u'j',
 u'jailed',
 u'january',
 u'jenner',
 u'jenners',
 u'jersey',
 u'joined',
 u'jonathan',
 u'june',
 u'kanye',
 u'kardashian',
 u'kardashians',
 u'keep',
 u'keeping',
 u'kendall',
 u'khlo',
 u'kim',
 u'kourtney',
 u'kris',
 u'kristen',
 u'kylie',
 u'lachey',
 u'lakers',
 u'lamar',
 u'later',
 u'launched',
 u'launching',
 u'leah',
 u'leaked',
 u'lengthy',
 u'life',
 u'lindsay',
 u'line',
 u'lisa',
 u'lohan',
 u'los',
 u'loves',
 u'made',
 u'main',
 u'make',
 u'malika',
 u'march',
 u'marriage',
 u'married',
 u'mary',
 u'media',
 u'member',
 u'members',
 u'merely',
 u'million',
 u'modeling',
 u'month',
 u'most',
 u'murders',
 u'murray',
 u'nba',
 u'nets',
 u'network',
 u'new',
 u'news',
 u'nick',
 u'nicole',
 u'nine',
 u'north',
 u'norwood',
 u'notoriety',
 u'november',
 u'o',
 u'october',
 u'odom',
 u'of',
 u'olympic',
 u'olympics',
 u'on',
 u'one',
 u'opening',
 u'original',
 u'originally',
 u'orleans',
 u'outlets',
 u'own',
 u'ownership',
 u'owning',
 u'paris',
 u'per',
 u'personal',
 u'personalities',
 u'petition',
 u'player',
 u'popular',
 u'power',
 u'preeclampsia',
 u'pregnant',
 u'prematurely',
 u'premiere',
 u'premiered',
 u'president',
 u'producer',
 u'profile',
 u'programming',
 u'progressed',
 u'promote',
 u'public',
 u'publicity',
 u'publicized',
 u'rapper',
 u'ray',
 u'reality',
 u'received',
 u'recording',
 u'recurring',
 u'reggie',
 u'related',
 u'relationship',
 u'relationships',
 u'released',
 u'remove',
 u'renewed',
 u'reported',
 u'reporter',
 u'revolves',
 u'rich',
 u'richie',
 u'right',
 u'rights',
 u'rob',
 u'robert',
 u'ronald',
 u'running',
 u'ryan',
 u's',
 u'said',
 u'saints',
 u'sarcasm',
 u'scott',
 u'seacrest',
 u'season',
 u'seasons',
 u'second',
 u'series',
 u'served',
 u'serving',
 u'settled',
 u'seven',
 u'several',
 u'sex',
 u'she',
 u'shopper',
 u'show',
 u'shows',
 u'signed',
 u'simple',
 u'simpson',
 u'since',
 u'singer',
 u'sisters',
 u'sock',
 u'some',
 u'son',
 u'sons',
 u'split',
 u'spotlight',
 u'star',
 u'starred',
 u'still',
 u'stunt',
 u'stylist',
 u'subject',
 u'subsequent',
 u'sued',
 u'suffering',
 u'suit',
 u'summer',
 u'superstar',
 u'supporting',
 u'supports',
 u'sure',
 u'surmised',
 u'take',
 u'talked',
 u'tape',
 u'television',
 u'tenth',
 u'that',
 u'the',
 u'their',
 u'them',
 u'third',
 u'this',
 u'three',
 u'through',
 u'throughout',
 u'time',
 u'titled',
 u'to',
 u'together',
 u'total',
 u'trial',
 u'truly',
 u'two',
 u'under',
 u'up',
 u'ventured',
 u'ventures',
 u'version',
 u'vice',
 u'video',
 u'viewers',
 u'vivid',
 u'was',
 u'we',
 u'wedding',
 u'week',
 u'weeks',
 u'were',
 u'west',
 u'when',
 u'which',
 u'while',
 u'widely',
 u'widespread',
 u'wife',
 u'will',
 u'with',
 u'work',
 u'would',
 u'x',
 u'year',
 u'years',
 u'yet']

Answering queries on the data

We will try to retrieve the closest matching sentence to a given query. To do this, we must define what "closest" means. In other words, we need a similarity measure.

A simple one is the number of types in common between the query and the sentence.

In [20]:
def types_in_common(query_words, sentence):
    A = set(query_words)
    B = set(sentence)
    return len(A.intersection(B))

A slightly more complex one is the the Jaccard similarity measure, which additionaly takes into account the total number of types that the query and the sentence has.

In [21]:
def jaccard(query_words, sentence):
    A = set(query_words)
    B = set(sentence)
    return float(len(A.intersection(B)))/len(A.union(B))

Next we'll define a basic "search engine" which will go through all the sentences and calculate each one's similarity with the query. It returns a list of sentences sorted by their similarity score (if that score is greater than zero).

To calculate the similarity, this function takes as an argument a similarity_measure function.

In [22]:
from operator import itemgetter

def run_search(query, similarity_measure):
    query_words = word_splitter.findall(query)
    query_words = [w.lower() for w in query_words]
    
    sent_scores = [(sent, similarity_measure(query_words, sent))
                   for sent in sent_words_lower]

    sent_scores = sorted(sent_scores, key=itemgetter(1), reverse=True)
    sent_scores = [(sent, score)
                   for sent, score in sent_scores
                   if score > 0]

    joined_sents = [(" ".join(sent), score) for sent, score in sent_scores]
    return joined_sents

Now we'll run two versions of the search engines (one using the types_in_commmon measure and one using the jaccard measure) for two different queries.

In [23]:
run_search("kris olympic",types_in_common)
Out[23]:
[(u'kris married former olympic champion bruce jenner born 1949 in 1991', 2),
 (u'robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khlo born 1984 and son rob born 1987',
  1),
 (u'bruce and kris had two daughters together kendall born 1995 and kylie born 1997',
  1),
 (u'in 2011 kim married nba player kris humphries in a highly publicized wedding ceremony but filed for divorce 72 days later',
  1),
 (u'the show revolves around the children of kris jenner and originally focused on her children from her first marriage to deceased attorney robert kardashian kourtney kim khlo and rob kourtney s boyfriend scott disick is a main character on the show',
  1),
 (u'as the series progressed kris children kendall and kylie also became recurring cast members of the show',
  1),
 (u'kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began',
  1),
 (u'in 2011 she received widespread criticism after filing for divorce from new jersey nets power forward kris humphries after a 72 day marriage',
  1)]
In [24]:
run_search("kourtney",jaccard)
Out[24]:
[(u'in 2006 kourtney starred in her first reality television series filthy rich cattle drive',
  0.07692307692307693),
 (u'khlo kim and kourtney further ventured into fashion opening a high fashion boutique d a s h in calabasas california',
  0.05555555555555555),
 (u'on january 4 2015 a tenth season was announced to premiere on feb 8 2015 after the kourtney and khlo take the hamptons season finale',
  0.047619047619047616),
 (u'robert kardashian 1944 2003 and kristen mary kris houghton born 1955 married in 1978 and had four children together daughters kourtney born 1979 kim born 1980 and khlo born 1984 and son rob born 1987',
  0.03571428571428571),
 (u'the show revolves around the children of kris jenner and originally focused on her children from her first marriage to deceased attorney robert kardashian kourtney kim khlo and rob kourtney s boyfriend scott disick is a main character on the show',
  0.030303030303030304),
 (u'in the eighth season bruce s sons brandon and brody jenner and brandon s wife leah felder daughter of eagles band member don felder were integrated into the supporting cast while kourtney khlo and kim s friends malika haqq and jonathan cheban joined the series in the second and third seasons',
  0.02564102564102564)]

Lecture 4

As we will be playing around with the internals of our information retrieval system, let's write a convenience function for displaying the results and highlighting the ones that we know are actually relevant for our information need.

In [55]:
def print_results(orderedlist, relevant_docs=[], maxresults=5):
    """Print search results while highlighting the ones we truly care about"""
    count = 1
    for item, score in orderedlist:
        if item in relevant_docs:
            print("{:d} !!! {:.2f} {}".format(count, score, item))
        elif count <= maxresults:
            print("{:d}     {:.2f} {}".format(count, score, item))
        print()
        count += 1
In [56]:
relevant_docs_champion = ["kris married former olympic champion bruce jenner born 1949 in 1991",
                          "kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began"]

The Jaccard measure ranks a completely irrelevant document on #1, gets one right on #2, but the actual best document is all the way down on position 16.

In [57]:
print_results(run_search("the olympic champion in kardashians", jaccard),
              relevant_docs=relevant_docs_champion)
1     0.25 the couple divorced in 1991

2 !!! 0.23 kris married former olympic champion bruce jenner born 1949 in 1991

3     0.15 keeping up with the kardashians premiered on october 14 2007

4     0.14 kendall and kylie have also established careers in the modeling industry

5     0.10 in 2012 she served as a co host during the second season of the american version of the x factor











16 !!! 0.07 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began


























In order to have a more flexible system that we can modify, let's implement a vector space model. We will use term frequency (TF) weights in order to address one of the issues with Jaccard, namely, that if a word occurs more than once, it should naturally matter more.

In [48]:
terms = sorted(set(allwords))

# TF (term frequency) vectorization
# We represent vectors in a "sparse" dictionary format.
# All keys not present in the dictionary are assumed to be zeros.

def doc_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v)
    return d

def query_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v)
    return d
In [49]:
import math

def dot(d, q):
    sum=0
    for v in d:  # iterates through keys
        sum += d[v] * q[v]
    return sum

One simple similarity measure operating on vectors is the dot product. The higher the dot product between two vectors, the more similar they are.

In [50]:
def dot_measure(query_words, sentence):
    A = query_to_vec(query_words)
    B = doc_to_vec(sentence)
    return float(dot(A, B))
In [58]:
print_results(run_search("the olympic champion in kardashians",dot_measure),relevant_docs=relevant_docs_champion)
1     7.00 lisa berger executive vice president of original programming and series development for the network said viewers have embraced the kardashian family and the series has become one of television s most talked about shows we are fortunate to work with seacrest and bunim murray which have an exceptional ability to capture the kardashians hilarious chaotic and always entertaining personalities and family dynamics the hollywood reporter reported that the family made an estimated 65 million throughout 2010

2     6.00 seacrest said at the heart of the series despite the catfights and endless sarcasm is a family that truly loves and supports one another the familiar dynamics of this family make them one hollywood bunch that is sure to entertain the series announcement came one week after paris hilton and her friend nicole richie announced that their popular e series entitled the simple life was ending

3     6.00 in the eighth season bruce s sons brandon and brody jenner and brandon s wife leah felder daughter of eagles band member don felder were integrated into the supporting cast while kourtney khlo and kim s friends malika haqq and jonathan cheban joined the series in the second and third seasons

4     5.00 since the series premiere the kardashian sisters have established careers in the fashion industry co owning the fashion boutique d a s h and launching several fragrances and clothing collections

5     5.00 rob launched the sock line arthur george in 2012 and was involved in a relationship with singer adrienne bailon in the second and third seasons





10 !!! 3.00 kris married former olympic champion bruce jenner born 1949 in 1991



13 !!! 3.00 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began





























We can see that this does even worse, as it rewards longer documents unfairly. We can address that by length normalizing the documents and the query, that is, dividing the vectors by their norm.

The resulting measure is the cosine similarity measure:

In [63]:
def norm(d):
    sum_sq = 0
    for v in d:
        sum_sq += d[v] * d[v]
    return math.sqrt(sum_sq)

def cos_measure(query_words, sentence):
    A = query_to_vec(query_words)
    B = doc_to_vec(sentence)
    return float(dot(A, B)) / (norm(A) * norm(B))
In [64]:
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)
1 !!! 0.40 kris married former olympic champion bruce jenner born 1949 in 1991

2     0.40 the couple divorced in 1991

3     0.38 rob launched the sock line arthur george in 2012 and was involved in a relationship with singer adrienne bailon in the second and third seasons

4     0.34 in 2012 she served as a co host during the second season of the american version of the x factor

5     0.33 since the series premiere the kardashian sisters have established careers in the fashion industry co owning the fashion boutique d a s h and launching several fragrances and clothing collections










15 !!! 0.24 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began



























This already does better, ranking one of our important documents as first. The other is still low in the ranking.

Another issue we have at the moment is that all words matter equally, but intuition dictates that some words in the query (e.g. olympic) are more important than others (e.g. in). A way to address this is to weight the words according to their specificity, and a concrete implementation is to use inverse document frequency (IDF)

In [37]:
IDF = {}
DF = {}

for t in terms:
    DF[t] = len([1 for sent in sent_words_lower if t in sent])
    IDF[t] = 1 / float(DF[t] + 1)
In [62]:
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[:10]:
    print(IDF_t)

print("...")

for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[-10:]:
    print(IDF_t)
(u'the', 0.03225806451612903)
(u'and', 0.041666666666666664)
(u'in', 0.043478260869565216)
(u'a', 0.047619047619047616)
(u'kim', 0.07692307692307693)
(u'of', 0.08333333333333333)
(u'was', 0.08333333333333333)
(u'to', 0.1)
(u'series', 0.1)
(u'for', 0.1)
...
(u'cheban', 0.5)
(u'friends', 0.5)
(u'died', 0.5)
(u'vice', 0.5)
(u'2006', 0.5)
(u'2004', 0.5)
(u'time', 0.5)
(u'2008', 0.5)
(u'original', 0.5)
(u'simpson', 0.5)

In [39]:
##TF-IDF weights

def doc_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v) * IDF[v]
    return d

def query_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v) * IDF[v]
    return d
In [65]:
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)
1 !!! 0.40 kris married former olympic champion bruce jenner born 1949 in 1991

2     0.40 the couple divorced in 1991

3     0.38 rob launched the sock line arthur george in 2012 and was involved in a relationship with singer adrienne bailon in the second and third seasons

4     0.34 in 2012 she served as a co host during the second season of the american version of the x factor

5     0.33 since the series premiere the kardashian sisters have established careers in the fashion industry co owning the fashion boutique d a s h and launching several fragrances and clothing collections










15 !!! 0.24 kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began



























With TF-IDF and cosine similarity, we now get both relevant documents ranked first and second in the returned list. Neat!