Python: module scrapeRateProfs

scrapeRateProfs

Python Module for scraping reviews from RateMyProfessor. This module scrapes reviews about professors from a certain university from ratemyprofessor.com. Does this in the following steps: Step 1: Get the list of professors starting with each letter Step 2: Get the list of all reviews for that professor. Step 3: Get each review from that list. It also has the ability to be interrupted and (effectively) resume from where it stopped without having to redownload all the previous files. It does this by performs optimizations such as storing the webpages it has downloaded and compacting the downloaded pages into a format that is easy to process. Usage: scrapeRateProfs.py [-h] -sid SchoolID [-delay DELAY] -o OUTPUT -path PATH Inputs: -h, --help show this help message and exit -sid SchoolID ID of the school on RateMyProf -delay DELAY Amount of time to pause after downloading a website -o OUTPUT Path to output file for reviews -path PATH Directory where the webpages should be downloaded Key Outputs: - TSV File containing the review information. - TSV File containing the aggregate information for a professor. - Condensed set of information downloaded Formats are provided in the accompanying README

Modules

argparse
bisect
os
sys
time
urllib2

Functions


downloadToFile(url, fileName, force=False)
Downloads url to file Inputs: - url : Url to be downloaded - fileName : Name of the file to write to - force : Optional boolean argument which if true will overwrite the file even it it exists Returns: - Pair indicating if the file was downloaded and a list of the contents of the file split up line-by-line

getAllReviews(sid, outF)
The main function to get all the reviews from a school.

getFileContentFromWeb(url)
Downloads data from a website Inputs: - url : Url to be downloaded Returns: - Content of url

getLinksFromList(plContents)
Extracts the links for the individual review pages of each professor Inputs: - plContents : Pruned contents of the professor list file. Returns: - Triplet containing 3 arrays, one for the professor names, the link to their review page and a boolean flag indicating if they have any reviews

getReviewsForProf(rurl, rname)
Gets the reviews and aggregate statistics for a specific professor.          Inputs: - rurl : Link to the review page for a professor - rname: Name of the professor (this will be unique)

pruneProfListFile(plContents, fileName)
Prunes the professor list file contents from the entire website to only what is required          Inputs: - plContents : Contents of the Professor List webpage - filename   : Name of the file to write the output to Returns: - Pruned content from the list page containing only information about the links to each individual professor's review page

pruneProfReviewFile(prContents, fileName)
Prunes the review page for a particular professor to only what is required namely the stats and the reviews Inputs: - prContents : Contents of the review webpage for a particular professor - filename   : Name of the file to write the output to Returns: - Pruned contents of the review page, which includes aggregate details of all reviews, as well as all individual reviews.