CS6742 Fall : Assignment 1
This page last modified
Mon August 24, 2015 6:15 PM. Important updates will be posted to Piazza.
Your task: Propose a research idea related to one of the readings below and execute a pilot empirical study using one of the listed datasets. Most crucial to is that (a) your idea is interesting, and (b) your pilot empirical study demonstrates that you can quickly evaluate feasibility and estimate the chances of an interesting result.
It is neither required nor expected that your proposal for this assignment will relate to your final course project.
Please strive to post your proposal well in advance of the actual due date (a suggested goal: ), for two reasons. First, (and, hope, your classmates) will give you public feedback on your idea — indeed, multiple rounds of it if you like — to help you refine or adapt it. Second, you are encouraged to work in groups, and early posting will facilitate linking up with classmates having similar interests.
After posting your proposal, continue to monitor and participate in the Piazza forum. After all, your classmates have read the same papers and are using the same data, so we have a lot of common ground. Example things to post: feedback on other people's proposals; some oddity of the datasets you've found that it's worth alerting others to; unexpected early results that are interesting or that you need help interpreting.
Basically, would like us all to act as a team; we're all in this together!
Readings
These readings were chosen because they are thought-provoking, accessible, short, and together represent a wide range of possibilities.
Datasets
- Amazon Fine Foods reviews from the Stanford SNAP lab
- Vikram Rao Sudarshan (2014) noted, "some lines somehow spilled from the line above them. Below are approximate line numbers (+/- 10 lines): 847777, 1593769, 4845258, 1711786, 2554502, 3396184, 3160642. They can be merged with the line above them."
- Since the Amazon public API was used to crawl this data, it may not contain all reviews from the given time period.
- Slashdot portion of the British Columbia Conversation Corpora BC3-Blog Corpus (https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3.html)
- As stated in the README, some comments are missing. For example, compare the original slashdot tree (http://slashdot.org/story/09/05/25/212203/public-notices-going-online-not-in-newspapers) with the file slashdot_part_1/09_05_25_212203.instancedata.txt (notice correspondence between file name and URL). This makes this dataset less than ideal in representing full conversations, but should be fine for our pilot-study purposes. Although the CAW 2.0 dataset (http://caw2.barcelonamedia.org/?page_id=98) may have more complete Slashdot conversation trees, it has fewer of them.
- The "semanticweb.org" URLs given in the files appear to be broken.
Due dates
All deadlines refer to 5:00pm unless otherwise specified.
- : Send an email to administrative assistant , subject line “CS6742 CMS registration request”, giving your Cornell netID in the body (not your Cornell ID number, not your preferred email address, but your official Cornell NetID, which should be your initials followed by a number, e.g., LJL2), so we can add you to CMS.
- (Note the earlier-than-5pm deadline, and, as mentioned in the "Your task" description above, aim for an earlier date of ):
- Post study idea(s) to Piazza using the folder (topic) "Assignment 1". Post it as an individual "Question" (not a "Note"), which makes it easier for to track whether have replied yet. Choose a title for your "Question" that describes your project idea (e.g., "identifying reviewers with nefarious schemes" as opposed to "four random ideas"). length expectation is 3+ paragraphs; these paragraphs don't have to be long. Be as detailed as possible while remaining sensible. In the ideal case, you'll already have peeked at the data to make sure your idea is going to be feasible.
- If persons A, B, and C have already decided to work together, then A should post the "Question", and B and C should each individually post a response to A's "question" stating that they've agreed to work together. This way, can tell who has finished this part of the assignment.
- If, subsequently, D and E want to join forces with A, B and C because your proposals are similar, please arrange to do so among yourselves. The deadline for CMS group formation is a bit later than the proposal submission deadline precisely to allow for this possibility.
- : form groups on CMS. CMS group formation requires invitations and acceptance of invitations via the system, i.e., action by two people per person added; please check the CMS documentation for more details. You may form groups of any size. need the group information from CMS to schedule the group presentations.
- in class: be prepared to informally discuss how things are going. For example, any preliminary observations about the data? No formal presentation materials are required.
- : Submit a project report on CMS. One group = one submission on CMS.
Required information: the overall research problem you proposed; relation of your research problem to the reading(s) (this description should provide evidence that you read the relevant parts of the readings carefully enough); proposed techniques; steps employed to process/clean/select data; results (probably preliminary, possibly negative); what you learned; a list of the roles that each member of the group played, if there is more than one person in your group. If you collaborated a bit with people outside your group, acknowledge those other people by name and explain their contribution in the writeup.
- in class: Group presentations. You can bring handouts (often most effective for discussions, since people can refer to things out of order) or project slides off a laptop. If the latter, bring a spare copy of your presentation on a flash drive and email a copy.
Academic Integrity Academic and scientific integrity compels one to properly attribute to others any work, ideas, or phrasing that one did not create oneself. To do otherwise is fraud.
We emphasize certain points here. As you can see above, talking to and helping others is strongly encouraged. You may also, with attribution, use the code from other sources. The easiest rule of thumb is, acknowledge the work and contributions and ideas and words and wordings of others. Do not copy or slightly reword portions of papers, Wikipedia articles, textbooks, other students' work, Stack Overflow answers, something you heard from a talk or a conversation or saw on the Internet, or anything else, really, without acknowledging your sources. See http://www.cs.cornell.edu/courses/cs6742/2011sp/handouts/ack-others.pdf and http://www.theuniversityfaculty.cornell.edu/AcadInteg/ for more information and useful examples.
This is not to say that you can receive course credit for work that is not your own — e.g., taking someone else's report and putting your name at the top, next to the other person(s)' names. However, violations of academic integrity (e.g., fraud) undergo the academic-integrity hearing process on top of any grade penalties imposed, whereas not following the rules of the assignment only risk grade penalties.