Skip to Main Content

SoTE: Data Mining

Boston College Attribution

This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!

Free Sources

Free Sources


Below are open access databases and repositories which can be used for text and data mining. 

Resource Provider Description
Arxiv Cornell University Open access to 1,153,908 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Bulk Access
BioMed Central Springer Science+Business Media Over 250,000 full-text, peer-reviewed articles are available for text and data mining.
Chronicling America: Historical American Newspapers Library of Congress Collection of digitized historical newspapers from 1836-1922. OCR batch downloads. Brigham Young University Compiled by Prof. Mark Davies, Linguistics, at Brigham Young University, there are multiple corpora available for analysis, for English, as well as Spanish and Portuguese.
Digital Public Library of America DPLA Data is available for bulk download in JSON files. More information about the Database export files
Google Books Google Ngram Viewer: from 1800 to 2000
Google Books BYU View Brigham Young University Created by Prof. Mark Davies, Lingusistics, at Brigham Young University, this compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams.
HathiTrust Digital Library HathiTrust Its corpus is available for research purposes. Learn more about this by visiting the HathiTrust Research Center page
Internet Archive & Open Library Internet Archive Offers over 10,000,000 fully accessible books and texts. Instructions for downloading in bulk
MSU Libraries Humanities Data Michigan State University Includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them, with particular strength in text and audio data.
PLOS Public Library of Science Provides access to its peer-reviewed articles.
Project Gutenberg Project Gutenberg The first producer of free electronic books (ebooks), their catalog includes nearly 30,000 free books and over 100,000 titles. Here is the Project's Terms of Use
PubMed Central : Databases and Text Mining Tools NCBI Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.
University of Oxford Text Archive University of Oxford A repository of digital literary and linguistic resources for research and teaching in higher education.