This is version 1.0 of the multimodal wikipedia dataset to accompany @inproceedings{hessel2018concreteness, title={Quantifying the visual concreteness of words and topics in multimodal datasets}, author={Hessel, Jack and Mimno, David and Lee, Lillian}, booktitle={NAACL}, year={2018} } If you encounter any bugs or have feedback, please contact: jhessel@cs.cornell.edu This data was derived from the digital books collection released by the british library. All data was released to the public domain (CC0 1.0); the details of the liscense can be found here: https://creativecommons.org/publicdomain/mark/1.0/ The original download links can be found here: The dataset's homepage is: https://data.bl.uk/digbks/ The text of the books is from: https://data.bl.uk/digbks/db14.html The "plates" images are from: https://data.bl.uk/digbks/db19.html The "medium" images are from: https://data.bl.uk/digbks/db18.html @misc{British_Library_Labs_2016, title={Digitised Books}, publisher={British Library} author={British Library Labs}, howpublished={\url{https://data.bl.uk/digbks/}}, year={2016} } If you use this dataset, please consider citing the digibooks dataset, and/or the above NAACL paper describing the preprocessing. The dataset included here consists of 404765 images coupled with the OCR text in the +/- 3 pages surrounding them, i.e., if an image appeared on page 17, all OCR text from page 14, 15, 16, 17, 18, 19, and 20 would be included. The following files are included: - bl_image_release.zip: a zipfile containing 404765 jpg files. Many of these images have been resized so that the largest axis is no more than 600 pixels (the original aspect ratios are preserved) - bl_images.tsv.bz2: a tsv file with 404765 lines where each line corresponds to an image with zero indexing, i.e., the information on line 10 corresponds to 9.jpg; 16391 volumes are represented, in total: - the first column is the title of the book in which an image appeared (used for the book-level holdout described in the NAACL paper) - the second column is the original path of the image in the british library data release, and, among other things, indicates the year of the image, and whether it was a "plate" or a "medium" image. - bl_text.txt.bz2: a text file with 404765 lines containing the OCR text of the british library data. Like image_books.tsv, each line corresponds to an image with zero indexing, i.e., the information on line 10 corresponds to 9.jpg.