CS6742 Fall 2014: Assignment 1
This page last modified
Wed September 3, 2014 8:35 PM. Important updates will be posted to Piazza.
Your task: Propose a research idea related to one of the readings below and execute a pilot empirical study using one of the listed datasets. Most crucial to me is that (a) the idea is interesting, and (b) your pilot empirical study demonstrates that you can quickly evaluate feasibility and estimate likely chance of an interesting result. Note that there is no requirement that your final course project has to be at all related to your proposal for this assignment.
You should post your proposal to Piazza as early as possible (preferably far ahead of the deadlines below), for two reasons. First, I (and, I hope, your classmates) will give you public feedback on your idea — indeed, multiple rounds of it if you like — to help you refine or adapt it as appropriate. Second, you are encouraged to work in groups, so early posting will facilitate linking up with classmates having similar interests.
After posting your proposal, you should continue to monitor and participate in the Piazza forum. After all, your classmates have read the same papers and are using the same data, so we have a lot of common ground. Example things to post: feedback on other people's proposals; some oddity of the datasets that it's worth alerting others to; unexpected early results that are interesting or that you need help interpreting.
Basically, I'd like us to act as a team; we're all in this together!
These readings were chosen because they are short, accessible, and thought-provoking, and together represent a pretty wide range of possibilities.
- Amazon Fine Foods reviews from the Stanford SNAP lab
- Vikram Rao Sudarshan notes, "some lines [have] somehow spilled from the line above them. Below are approximate line numbers (+/- 10 lines): 847777, 1593769, 4845258, 1711786, 2554502, 3396184, 3160642. They can be merged with the line above them."
- Since the Amazon public API was used to crawl this data, it's not clear that it really contains all reviews from the given time period.
- Slashdot portion of the British Columbia Conversation Corpora BC3- Blog Corpus (https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3.html).
- As stated in the README, some comments are missing. For example, compare the original slashdot tree http://slashdot.org/story/09/05/25/212203/public-notices-going-online-not-in-newspapers with the file slashdot_part_1/09_05_25_212203.instancedata.txt (notice correspondence between file name and URL). This makes this dataset less than ideal for full conversation structure analysis, but should be fine for our pilot-study purposes. The CAW 2.0 dataset may have more complete Slashdot conversation trees, but it has fewer of them.
- The "semanticweb.org" URLs given in the files appear to be broken.
All deadlines refer to 5:00pm that afternoon, except, of course, for in-class activities.
Thursday Sep 4 extended to Friday Sep 5th at 2:30p.m. (but do this well beforehand!):
- Post study idea(s) to Piazza, as an individual "Question" (not a "Note", because it's easier for me to track whether I've replied yet that way). My length expectation is 3+ paragraphs (they can be short paragraphs). Be as detailed as possible while remaining sensible. In the ideal case, you'll already have peeked at the data to make sure your idea is going to be feasible.
- If persons A, B, and C have already decided to work together, then A should post the "question", and B and C should each individually respond on Piazza to A's "question", each saying that they've agreed to work together; this way, I can tell who has finished this part of the assignment.
- If, subsequently, D and E want to join forces because your proposals are similar, please arrange to do so among yourselves, if the arrangement seems workable. This is why the deadline for CMS group formation is a bit later than the proposal submission deadline.
- Send an email to my administrative assistant Ms. Maria Witlox (firstname.lastname@example.org), subject line “CS6742 CMS registration request”, giving your Cornell netID in the body (NOT your Cornell ID number, or your preferred email address, but your official Cornell NetID, which should be your initials followed by a number, e.g., LJL2), so we can add you to CMS.
- Monday Sep 8: form groups on CMS. CMS group formation requires invitations and acceptance of invitations, i.e., action by two people per person added. You may form groups of any size. I need this information to schedule the group presentations.
- Tuesday Sep 9: be prepared to informally discuss how things are going during class time. For example, any preliminary observations about the data?
- Friday Sep 12: Submit a report on CMS. One group = one submission on CMS.
Required information: the overall research problem you proposed, how it relates to the reading(s) (this description should provide evidence that you read the relevant parts of the readings carefully enough), proposed techniques (if applicable), processing and selection of data, results (probably preliminary, possibly negative), what you learned, a list of the roles that each member of the group played, if there is more than one person in your group. If you collaborated a bit with people outside your group, which can happen if group memberships shift, acknowledge those other people in the writeup.
- Tuesday Sep 16: Group presentations. You can bring handouts (often most effective for discussions, since people can refer to things out of order) or project slides off a laptop. If the latter, bring a spare copy of your presentation on a flash drive and email me a copy.
Academic Integrity Academic and scientific integrity compels one to properly attribute to others any work, ideas, or phrasing that one did not create oneself. To do otherwise is fraud.
We emphasize certain points here. As you can see above, talking to and helping others is strongly encouraged. The easiest rule of thumb is, acknowledge the work and contributions and ideas and words and wordings of others. Do not copy or slightly reword portions of papers, Wikipedia articles, textbooks, other students' work, something you heard from a talk or a conversation, or anything else, really, without acknowledging your sources. See http://www.cs.cornell.edu/courses/cs6742/2011sp/handouts/ack-others.pdf and http://www.theuniversityfaculty.cornell.edu/AcadInteg/ for more information and useful examples.