CS 5150 Software Engineering: Project Suggestion

CS 5150
Software Engineering
Fall 2009

Project Suggestion:
Social Network Applied Perception (SNAP) Search Engine

Home

Syllabus

Social Network Applied Perception (SNAP) Search Engine

Client

Stephen Purpura, Cornell Information Science
email: sp559@cs.cornell.edu

Contact for this project

Jiyoung Park is forming a team for people who are interested in this project. His email address and phone number are jp757@cornell.edu and 607-793-0465.

Objective

SNAP is a research search engine developed on top of the open source Lucene system to support experiments with modifications to traditional information retrieval indexing and user experience. The research goals of SNAP include experimenting with the both the manner that users specify search concepts to a search engine and the underlying indexing system that supports search.

For an interested CS 5150 team, we would like to extend SNAP to work with data crawled from Facebook, Twitter, and (time allowing) a limited collection of blogs or Wikipedia data. In addition, the team will go through at least two iterations of User Experience testing for two components of the system (U/I for managing the crawls and U/I for managing some expert tagging). The system will be deployed on a "production" Amazon Elastic Compute Cloud system (to be provided to the development team) and the team will need to coordinate at least a few "deployments" during the release cycle to facilitate testing.

The Facebook and Twitter crawling projects are non-trivial in that there are special accommodations that need to be made to anonymize data for both compliance with reasonable treatment of privacy for human subjects and the terms of service agreements.

Harvesting from Facebook and Twitter

The proposed addition to SNAP will (given a list of Facebook or Twitter usernames):

(1) Crawl all posts (including comments and follow-up), extracting them into flat text files and a MySQL database such that they can be indexed by Lucene. The file/database layout need to be designed.

(2) Extract personal information about the person posting each post (as allowed by the service) for linkage to each of the text files created in the first step.

(3) During extraction, data will be anonymized following a procedure to be provided.

Ideal Candidates

This project is ideal for people interested in information retrieval, human computer interaction research, and gaining experience working in an Amazon Elastic Compute Cloud environment. Lucene is Java based and other existing modules of SNAP are written in Perl, Python, and C. If the team can make a rational argument to support it, we can examine expanding the project to include Hadoop processing as a final step. The team may also need to interface with an iPhone app development project also integrating with SNAP.

Wider application

The implementation of the system will eventually be released as open source software. It will initially be used as a research platform at Cornell.

wya@cs.cornell.edu
Last changed: September 2009