------------------------- README File For RateProf-Scrape ------------------------- Karthik Raman Version 1.0 31/08/2012 http://www.cs.cornell.edu/~karthik/projects/rateprof-scrape/index.html ------------------------- INTRODUCTION ------------------------- RateProf-Scrape is a python class for scraping reviews about professors from a certain university from ratemyprofessors.com. This includes both aggregate information as well as detailed review scores (along the 4 different axis the website provides) Does this in the following steps: Step 1: Get the list of professors starting with each letter Step 2: Get the list of all reviews for that professor. Step 3: Get each review from that list. It also has the ability to be interrupted and (effectively) resume from where it stopped without having to redownload all the previous files. It does this by performs optimizations such as storing the webpages it has downloaded and compacting the downloaded pages into a format that is easy to process. Note that no Personal Information of the reviewers such as their account name or address is scraped.. ------------------------- COMPILING ------------------------- RateProf-Scrape can works in Windows, Linux and Mac environment. **NOTE** RateProf-Scrape does require Python version 2.7 or newer in order to run properly. You can download the latest version of Python at http://www.python.org/download/ ------------------------- INSTALLATION ------------------------- If you want to install this module in your directory for third-party Python modules then run python setup.py install ------------------------- RUNNING ------------------------- To run the function Usage: scrapeRateProfs.py [-h] -sid SchoolID [-delay DELAY] -o OUTPUT -path PATH Inputs: -h, --help show this help message and exit -sid SchoolID ID of the school on RateMyProf -delay DELAY Amount of time to pause after downloading a website -o OUTPUT Path to output file for reviews -path PATH Directory where the webpages should be downloaded ------------------------- INPUT DATA FORMAT ------------------------- The function takes in input via the command line as mentioned above. The key input is the school id on ratemyprofessors.com. For example, reviews for faculty from Cornell University are found at http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=298. Thus Cornell university has the sid of "298" Simply provided with this, the code will work for any institution having review information available on the ratemyprofessors.com website. ------------------------- OUTPUT DATA FORMAT ------------------------- The code produces 2 key TSV output files. a) The ".aggreg" file is an aggregate file, containing the aggregate information for each different faculty. The file will be sorted alphabetically as per last name (though some faculty for whom there are no reviews will have their last names presented before their first names) Format: Index Name Total # Of Reviews Department Average Quality Average Helpfullness Average Clarity Average Easiness b) The ".review" file contains all the individual reviews for each different faculty. The file is ordered datewise for each faculty (with the faculty being sorted alphabetically by last name as before). The format it follows is: ReviewIndex Faculty Name Faculty Dept. Review ID for this faculty Review Date Class for which Review was written Quality Rating Helpfullnes Rating Clarity Rating Easiness Rating Reviewer Interest Review Text Note that Review-Text may have the windows newline characters (seen as Ctrl+M in vim). These can easily be replaced within vim/using sed or other similar tools. ------------------------- CONTENTS ------------------------- The source distribution includes the following files: 1. README.txt : This readme file. 2. LICENSE.txt : License under which software is released. 3. scrapeRateProfs.py : The python module. 4. setup.py : The setup file 5. DOCUMENTATION.html : The file containing detailed documentation of the code. There is also a windows binary .exe file available (untested though). ------------------------- SAMPLE USAGE ------------------------- To run for Cornell University with delay 2, with output files having prefix CornellU.tsv and with intermediate files written into the WebPages directory: python scrapeRateProfs.py -sid 298 -path WebPages/ -o CornellU.tsv -delay 2 ------------------------- FURTHER DOCUMENTATION ------------------------- Please see the html file for Documentation about the different functions.