This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!
Below are open access databases and repositories which can be used for text and data mining.
Resource | Provider | Description |
---|---|---|
Arxiv | Cornell University | Open access to 1,153,908 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Bulk Access |
BioMed Central | Springer Science+Business Media | Over 250,000 full-text, peer-reviewed articles are available for text and data mining. |
Chronicling America: Historical American Newspapers | Library of Congress | Collection of digitized historical newspapers from 1836-1922. OCR batch downloads. |
Corpus.byu.edu | Brigham Young University | Compiled by Prof. Mark Davies, Linguistics, at Brigham Young University, there are multiple corpora available for analysis, for English, as well as Spanish and Portuguese. |
Digital Public Library of America | DPLA | Data is available for bulk download in JSON files. More information about the Database export files |
Google Books | Ngram Viewer: from 1800 to 2000 | |
Google Books BYU View | Brigham Young University | Created by Prof. Mark Davies, Lingusistics, at Brigham Young University, this compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams. |
HathiTrust Digital Library | HathiTrust | Its corpus is available for research purposes. Learn more about this by visiting the HathiTrust Research Center page |
Internet Archive & Open Library | Internet Archive | Offers over 10,000,000 fully accessible books and texts. Instructions for downloading in bulk |
MSU Libraries Humanities Data | Michigan State University | Includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them, with particular strength in text and audio data. |
PLOS | Public Library of Science | Provides access to its peer-reviewed articles. |
Project Gutenberg | Project Gutenberg | The first producer of free electronic books (ebooks), their catalog includes nearly 30,000 free books and over 100,000 titles. Here is the Project's Terms of Use |
PubMed Central : Databases and Text Mining Tools | NCBI | Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records. |
University of Oxford Text Archive | University of Oxford | A repository of digital literary and linguistic resources for research and teaching in higher education. |
© Copyright 2024 National University. All Rights Reserved.