Home | About Me | Publications | Software |


Introduction

HotelRev-Scrape is a lightweight python based tool for scraping review data from hotels (date, rating and review text) from Tridadvisor/Orbitz for all hotels in (and close to) the given list of cities in an US state.

It does this in multiple steps, which include getting the list of urls for all hotels in a city, followed by getting all the review pages for that hotel, from which the individual reviews are scraped.

If interrupted, it will resume from where it stopped without having to download the previous files.
Detailed documentation of the code is available here.

Please contact the author (at the above address) for questions or to report bugs.

Download

Please read the license under which this code is distributed before using it.
Download: [GZ] [ZIP].

There is also a Windows executable for installation (untested).

These packages contain the following files:
1. README.txt : The readme file for the project.
2. LICENSE.txt : License under which software is released.
3. sample_cities.txt : Sample input file to indicate data format.
4. scrapeHotelReviewData.py : The python module.
5. setup.py : The setup file.
6. DOCUMENTATION.html : The file containing detailed documentation of the code.

Compiling and Installation

HotelRev-Scrape can works in Windows, Linux and Mac environment. It does require Python version 2.7 or newer in order to run properly.

If you want to install this module in your directory for third-party Python modules then run (within the directory containing the code):
          python setup.py install

To run the code, run:
python scrapeHotelReviewData.py -state STATE -cities CITIES [-delay DELAY] -site SITE -o OUTPUT -path PATH

For more details see the readme.

Usage


Data Formats

Input: The tool takes as input a file containing the cities. The expected format is: Text file with each line corresponding to the name of the city.

A sample file is available.

Output: TSV (Tab-seperated) file containing 1 review per line. Reviews are buched as per the hotel. The format is:
Hotel Name  City  Date  Rating  Review-Text  Hotel-Address

Note that Review-Text has no newlines as they are instead replaced with "-newline-".

FAQ

Q) Do you support scraping from other travel websites which have reviews?
A) Currently the only supported websites are Tripadvisor and Orbitz.

Q) I found some hotels from cities other than those specified. Is this a bug?
A) No this is not a bug. If the review website displays hotels from other cities then the tool will scrape reviews from those as well. However note that since the output contains the hotel address and city, such reviews can be filtered.

Q) The tool could not find the hotels in a city from Orbitz. Why?
A) When the city description is ambiguous and there are multiple cities with that name, then the tool will not be able to proceed since it does not know which of the candidates to choose. To avoid this problem, it is best to be as concise as possible with the city names.

 

Last Edited on Sep 4th, 2012