The datafiles in these two directories constitute the Slashdot portion of the British Columbia Conversation Corpora (BC3) Blog Corpus, https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3.html. A topic-annotated subset is described as "forthcoming" on that webpage. The description below of this (sub)corpus is a verbatim copy of Sections 2.1 and 2.2 of Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Shafiq R. Joty. Exploiting conversational features to detect high-quality blog comments. Canadian Conference on AI, pages 122–127, 2011. http://www.cs.ubc.ca/~carenini/PAPERS/slashcrf_CCAI_FINAL.pdf Excerpted by Lillian Lee - this README does not occur in the original U. British Columbia data distribution. I do not know why the corpus is split into these two directories; the sizes don't match the description of a train/test split.) ............... 2.1 The Slashdot Corpus We compiled a new corpus comprised of articles and their subsequent user com- ments from the science and technology news aggregation website Slashdot (1 http://slashdot.org). This site was chosen for several reasons. Comments on Slashdot are moderated by users of the site, meaning that each comment has a scores from -1 to +5 indicating the total score of moderations assigned, with each moderator able to modify the score of a given comment by +1 or -1. Furthermore, each moderation assigns a classification to the comment: for good comments, the classes are: Interesting, Insightful, Informative and Funny. For bad comments, the classes are: Flame-bait, Troll, Off-Topic and Redundant. Since the goal of this work was to identify high-quality comments, most of our experiments were conducted with comments grouped into GOOD and BAD. Slashdot comments are displayed in a threaded conversation-tree type layout. Users can directly reply to a given comment, and their reply will be placed underneath that comment in a nested structure. This conversational structure allows us to use Conversational Features in our classification approach (see Sect. 2.3). Some comments were not successfully crawled, which meant that some comments in the corpus referred to parent comments which had not been collected. In order to prevent this, comments whose parents were missing were excluded from the corpus. After this cleanup, the collection totalled 425,853 comments on 4320 articles. 2.2 Transformation into Sequences As mentioned above, Slashdot commenters can reply directly to other comments, forming several tree-like conversation for each article. This creates a problem for our use of Linear-Chain CRFs, which require linear sequences. In order to solve this problem, each conversation tree is transformed into multiple Threads, one for each leaf-comment in the tree. The Thread is the sequence of comments from the root comment to the leaf comment. Each Thread is then treated as a separate sequence by the classifier. One consequence of this is that any comment with more than one reply will occur multiple times in the training or testing set. This makes some intuitive sense for training, as comments higher in the conversation tree are likely more important to the conversation as a whole, as the earlier a comment appears in the thread the greater effect it has on the course of conversation down-thread. We describe the process of re-merging these comment threads, and investigate the effect this has on accuracy, in Sect. 3.3.