This colored text indicates that this is A1
Task: Propose a research idea related to one
of the readings below and execute a pilot empirical study using one of the
listed datasets. Most crucial to is that (a) your idea
is interesting, and (b) your pilot empirical study demonstrates that you can
quickly evaluate feasibility and estimate the chances of an interesting result.
It is neither required nor expected that your proposal for this assignment
will relate to your final course project.
Please strive to post your proposal well in advance of the actual due date
(a suggested goal: Tuesday Aug. 29, 11:59pm), for two reasons.
First, I (and, I hope, your classmates)
need time to be able to post useful replies and feedback
— indeed, perhaps more than one round, time permitting —
to help you refine or adapt it. Second, you are encouraged to work in groups,
and early posting will facilitate linking up with classmates having similar interests.
After posting your proposal, continue to monitor and participate on the course discussion site.
After all, your classmates have read the same papers and are using the same data,
so we have a lot of common ground.
Example things to post: feedback on other people's proposals;
some oddity of the datasets you've found that is worth alerting others to;
unexpected early results that are interesting or that you need help interpreting.
Basically, I would like us all to act as a team; we're all in this together!
The two required readings
- Excerpts from anaesthetica's “Attacked from within”,
- Justine Zhang, Ravi Kumar, Sujith Ravi, and Cristian Danescu-Niculescu-Mizil, 2016.
Conversational flow in Oxford-style debates.
These readings were chosen because they are thought-provoking, accessible, short,
and together represent a wide range of possibilities.
The two datasets — you are required to use one
- Cornell ChangeMyView data, November 2016 version
- README for the January 2016 version — still mostly applicable, since the file format did not change.
- Discussion and example code
- Optional reading: the original paper
in which this dataset was introduced, Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee, 2016,
Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions,
WWW, pp. 613–624.
- Miscellaneous notes: the reason I chose this dataset, rather than
the dataset associated with the Zhang et al. reading, is that it has more types of information in it, and
so might be conducive to a wider variety of exploratory projects.
- Slashdot portion of the British Columbia Conversation Corpora BC3-Blog Corpus
- README, constructed
from the paper in which this dataset was introduced, viz., Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Shafiq R. Joty, 2011,
Exploiting conversational features to detect high-quality blog comments,
the Canadian Conference on AI, pp. 122–127.
[official link] [author-posted version]
- As stated in the README, some comments are missing. For example,
original slashdot tree
with the file slashdot_part_1/09_05_25_212203.instancedata.txt (notice the correspondence between file name and URL). This makes this dataset less than ideal in representing full conversations, but should be fine for our pilot-study purposes.
- The "semanticweb.org" URLs given in the files appear to be broken.
- Miscellaneous notes:
(i) Here is the full UBC Conversation Corpora website.
(ii) Why the UBC corpus and not the CAW 2.0 dataset? While the latter may have more complete Slashdot conversation trees, it has fewer of them.
(iii) am aware of this criticism of both the CAW
2.0 dataset and looking at Slashdot in general.
Teamwork is encouraged.
Groups of any size can be formed, where each group jointly submits a single project report
at the end on the official course management system, CMS. However, each individual remains
individually responsible for posting feedback on other people's/group's proposals.
There are further notes on how to find/work as a group below.
All deadlines refer to 5:00pm unless otherwise specified.
- Friday Aug. 25:
- Sign in to https://blogs.cornell.edu
using your Cornell NetID and password. Click on "Your Profile" and
choose a nickname (which can be your real name or your first name)
and, via the pull-down menu that says "Display Name
Publicly As", a display name. (The "nickname" you entered will
be one of the options.)
- Send an email to email@example.com with subject line "CS/IS 6742 account request"
containing all the following information.
Once you send this email, you will be (manually) given access to
the course discussion site and, if not already in the system, CMS.
- Your Cornell NetID (example: LJL2)
- Name you prefer to be referred to by in this class
(example: I prefer to be called "Lillian". Some other
"Lillian" prefer to be called "Lil",
but not me.)
- The display name you entered at the course discussion site
- Your goals for taking this course
- What background you have, including but not limited to how you
satisfy the three prerequisites ((a) CS 2110 or equivalent
programming experience; (b) a course in artificial intelligence or
any relevant subfield (e.g., NLP, information retrieval,
machine learning, Cornell CS courses numbered 47xx or 67xx);
(c) proficiency with using machine learning tools (e.g., fluency
at training an SVM, comfort with assessing a classifier’s
performance using cross-validation))
- Friday Sept. 1, 2:30pm (Note the earlier-than-5pm deadline, and, as mentioned in the "Task" description above, aim for an earlier date of Tuesday Aug. 29, 11:59pm): Post pilot-study idea(s) to
the course discussion site. (Look for the "+ New"
item in the admin bar across the very top of the page and select "Post".)
- Feel free to make scratch or rough-draft posts to get used to the
discussion-site interface! But use the Edit→“Move to Trash" functionality
to get rid of scratch posts.
- Each idea should be a separate post, to keep discussions organized.
- Choose a title that summarizes your project idea (e.g., “Identifying reviewers with nefarious schemes” as opposed to “Random ideas”).
- Length expectation: 3+ paragraphs. Be as detailed as possible while remaining sensible. In the ideal case, you'll already have peeked at the data to make sure your idea is going to be feasible.
- Add the category "A1" before publishing your post (the category selection
choices are on the right-hand sidebar, underneath the "Publish" button).
- If persons A, B, and C have already decided to work together, then A should make the idea post, and B and C should each individually post a reply to A's post stating that they've agreed to work together. This way, can tell who has finished this part of the assignment.
- If, subsequently, D and E want to join forces with A, B and C because your proposals are similar, please arrange to do so among yourselves. The deadline for CMS group formation is a bit later than the proposal submission deadline precisely to allow for this possibility.
- Monday Sept 4: form groups on MS. CMS group formation requires invitations and acceptance of invitations via the system, i.e., action by two people per person added; please check the official CMS documentation or this more graphically-oriented guide for instructions. need the group information from CMS to schedule the group presentations.
- Tuesday Sept. 5 in class:
- Check back on course discussion site
for any comments on your proposal, and add, as replies, any suggestions
you have on other people's proposals. Ideally, you will continually
monitor the site for updates to your or other people's proposals.
- Be prepared to informally discuss how things are going. For example, any preliminary observations about the data? No formal presentation materials are required.
- Friday Sept. 8: Submit a project report on CMS. One group = one CMS submission: any
group member can upload a version, which will overwrite any previous versions
by any other members of the group.
Required information: (a) the overall research problem you proposed; (b) relation
of your research problem to the reading(s) (this description should provide
evidence that you read the relevant parts of the readings carefully);
(c) proposed techniques; steps employed to process/clean/select data;
(d) results (probably preliminary, possibly negative); (e) what you learned;
(f) a list of the roles that each member of the group played, if there is more than one person in your group.
(g) If you collaborated a bit with people outside your group, acknowledge those
other people by name and explain their contribution in the writeup.
- Tuesday Sept. 12, in class: Group presentations. You can bring handouts (often most effective for discussions, since people can refer to things out of order) or project slides off a laptop. If the latter, bring a spare copy of your presentation on a flash drive and email a copy.
Academic Integrity Academic and scientific integrity compels one to properly attribute to
others any work, ideas, or phrasing that one did not create oneself. To do otherwise is fraud.
Certain points deserve emphasis here.
In this class, talking to and helping others is strongly encouraged.
You may also, with attribution, use the code from other sources.
The easiest rule of thumb is, acknowledge the work and contributions and ideas and words and wordings of others.
Do not copy or slightly reword portions of papers, Wikipedia articles, textbooks, other students' work, Stack Overflow answers,
something you heard from a talk or a conversation or saw on the Internet,
or anything else, really, without acknowledging your sources.
See "Acknowledging the Work of Others" in
The Essential Guide to Academic Integrity at Cornell
for more information and useful examples.
This is not to say that you can receive course credit for work that is not your own —
e.g., taking someone else's report and putting your name at the top, next to the other person(s)' names.
However, violations of academic integrity (e.g., fraud) undergo the academic-integrity hearing process on
top of any grade penalties imposed,
whereas not following the rules of the assignment “only” risks grade penalties.