Skip to Main Content

SoTE: Data Mining

Boston College Attribution

This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!



A sampling of projects that are using text and data mining methods. Many of these projects are also applying other computational and quantitative methods as well as visualizations.

America's Public Bible (Lincoln Mullen)

  • Explores trends and frequency in use of biblical quotations in newspapers mined from Chronicling America: Historic American Newspapers (LoC)

Early Modern Print: Text Mining Early Printed English (Washington University in St. Louis)

  • Provides an introduction and explanation of the various tools and visualizations that are possible using the EEBO Text Creation Partnership using the XML/SGML encoded transcriptions of early printed books in Early English Books Online
  • Presents examples with tools and visualizations, such as the N-gram browser and corpus analysis of English print culture before 1700 in EEBO-TCP

Martha Ballard's Diary (Cameron Blevins) 

  • Text mining, analysis, and topic modeling of Martha Ballard's diary (1785-1812) who documented daily life as a midwife in Maine

Mining Biodiversity

  • Text mining methods are being used to produce semantic metadata and index, along with visualization, crowdsourcing, and social media, to provide enhanced access to Biodiversity Heritage Library documents

Mining the Dispatch (Digital Scholarship Lab, University of Richmond)

  • The full run of the Richmond Daily Dispatch from November 1860 to April 1865 is text mined and presented with topic models based on prominent topics found within articles of this newspaper
  • Charts display the topic proportions by month in all articles containing that topic, transcriptions of articles can be viewed with additional topics identified

Robots Reading Vogue (Yale University)

  • Data mining over 400,000 pages of Vogue magazines and application of topic modeling, n-grams, and color analysis

Viral Texts (Northeastern University)

  • Explores reprinting of texts in nineteenth-century United States newspapers and journals drawing from the Chronicling America: Historic American Newspapers collection at the LOC
  • Developing computational linguistics tools to analyze newspaper content using text mining, data visualization, and other techniques


HathiTrust Research Center - texts and tools for analysis, mining, and visualization

Text Mining the Novel (NovelTM) - Large scale cross-cultural study of the novel using quantitative methods

Uses of Scale in Literary Study - Aims to demonstrate new methodologies, reduce barriers to entry for scholars, share resources for normalizing large collections of texts


Ben M. Schmidt - text mining and data visualization with a focus on history, politics, and current media and social issues

Image Mining (Miriam Posner) - materials and post on image and text mining (with B. Schmidt) for a medical history workshop at the National Library of Medicine. She also writes on a variety of digital humanities topics and tools

Matthew L. Jockers - exploration of text mining and sentiment analysis with examples and documentation

Ted Underwood - text mining and modeling eighteenth and nineteenth century literary texts

Tidy Topic Modeling (Julia Silge & David Robinson) - explores using tidy text principles to create topic models on works by Dickens, Wells, Verne, and Austen