Wikipedia Talk Page Conversations Corpus v1.01 (released September 2012) Distributed together with: "Echoes of power: Language effects and power differences in social interaction" Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg WWW 2012 NOTE: If you have results to report on this corpus, please send an email to cristian@cs.cornell.edu so we can add you to our list of people using this data. Thanks! Contents of this README: A) Brief description B) Files description C) Contact A) Brief description: This corpus contains a collection of conversations from Wikipedia editor's talk Pages (http://en.wikipedia.org/wiki/Wikipedia:Talk_page_guidelines) with metadata: - 391,294 utterances making 125,292 conversations - involving a total of 30,732 editors - taking place on 5,657 talk pages (and their archives) metadata includes: - editor's status - time of editor's status change - utterance timestamp - editor's gender - editor's number of edits - structure of the conversations This data was collected in August 2011 B) Files description <===> wikipedia.talkpages.conversations.txt This file contains 125,292 conversations separated by blank lines; each line in a conversation corresponds to an utterance and has the following format (the field separator is "+++$+++"): UTTERANCE_ID +++$+++ USER +++$+++ TALKPAGE_USER +++$+++ CONVERSATION_ROOT +++$+++ REPLY_TO +++$+++ TIMESTAMP +++$+++ TIMESTAMP_UNIXTIME +++$+++ CLEAN_TEXT +++$+++ RAW_TEXT where: UTTERANCE_ID unique id of the utterance USER the username of the Wikipedia editor who wrote this utterance (empty string if the username could not be parsed) TALKPAGE_USER the username of the Wikipedia editor on whose page this conversation took place CONVERSATION_ROOT the id of the initial post (utterance) in this conversation; this can be used as a unique conversation id REPLY_TO the id of the utterance to which this utterance was a reply to; this is not necessarily the utterance in the previous line (see "Note on the structure of conversations" below). -1 indicates that the reply structure could not be recovered. TIMESTAMP the timestamp of this utterance in the format "yyyy-mm-dd hh:mm:ss" ("-1" if the timestamp could not be parsed) TIMESTAMP_UNIXTIME the timestamp of this utterance in unix time format ("-1" if the timestamp could not be parsed) CLEAN_TEXT cleaned version of the utterance (e.g., signature removed, conversational indentation removed) RAW_TEXT raw version of the utterance Notes: <> Note on the structure of conversations: Wikipedia talk pages support nested conversations and an utterance is not necessarily the reply to the immediately preceding utterance, for example in the following conversation: B: original post 1:00AM A: first comment 1:01AM B: comment (reply to A's first comment) 1:03AM A: second comment 1:02AM A's second comment is a reply to B's original comment; it is not a reply to the temporally preceding comment (A's first comment), nor is it a reply to the visually preceding comment (B's comment) Due to inconsistencies in the formatting employed, sometimes it is not possible to confidently reconstruct the structure of a conversation. In those cases the REPLY_TO is set to -1. We are conservative in our treatment of this issue; for example, in the case of a double indented reply such as: http://en.wikipedia.org/wiki/User_talk:AnonEMouse/Archive_15#PD-RoM the REPLY_TO field is set to -1 <> A user can reply to his/her own posts. <> Many edits are unregistered edits (http://en.wikipedia.org/wiki/Wikipedia:Welcome_unregistered_editing), in this case the USER is a public IP address <> Due to many formatting inconsistencies, the parsing and cleaning of the utterances might not be perfect (e.g., for some of the utterances the username could not be confidently recovered from the signature). Note that the RAW_TEXT (which includes markups and the signature) is included for further processing. ---------- <===> wikipedia.talkpages.admins.txt List of users in our data with administrator status (http://en.wikipedia.org/wiki/Wikipedia:Administrators) at the time this data was collected. The date when this status was gained through a Request for Adminship election process (http://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship) is indicated in yyyy-mm-dd format (missing dates are indicated with NA) ---------- <===> wikipedia.talkpages.userinfo.txt Metadata for 26,397 users involved in the conversations. Each line corresponds to a user and has the following format (the field separator is "+++$+++"): USER +++$+++ EDIT_COUNT +++$+++ GENDER +++$+++ NUMERICAL_ID where: USER the username of the Wikipedia editor EDIT_COUNT the number of edits this editor contributed to wikipedia GENDER the self-declared gender of this editor: female/male/unknown NUMERICAL_ID unique Wikipedia numerical id of this editor Note: metadata is not available for unregistered users (http://en.wikipedia.org/wiki/Wikipedia:Welcome_unregistered_editing) ---------- C) Contact: Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil) This material is based upon work supported in part by the National Science Foundation under grant IIS-0910664. Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.