LibGuides: SoTE: Data Mining: Overview

This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!

What is text and data mining?

Text and data mining (TDM) is the computational analysis of vast quantities of digital information, whether free-form natural language text or structured data.

Using specialized software, researchers can extract data, identify trends, look for patterns and better understand the relationships of terms within and between documents. Analysis might focus on word frequency, words that frequently appear near each other, contextual information for key words, common phrases and other patterns.

Materials to be analyzed range from websites (such as publicly available Facebook posts), 16th C. manuscripts, DNA sequences, to old newspapers.

CC-BY License

What is text and data mining?

Text and data mining (TDM) is the computational analysis of vast quantities of digital information, whether free-form natural language text or structured data.

Materials to be analyzed range from websites (such as publicly available Facebook posts), 16th C. manuscripts, DNA sequences, to old newspapers.

This is a graphic analysis, constructed using Voyant, of the frequency of terms in the novel, Agnes Grey, by Charlotte Bronte.

Policies for Mining Licensed Content

If you wish to undertake a text or data mining project with content from the Libraries’ licensed databases, please contact a librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, it is possible to actively negotiate text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire NU community.

Please also see our Best Practice Tips for mining licensed databases.

Lit & News Feed

Twitter feed

Learn More

Data Mining and Text Analysis
From the Intro to Digital Humanities Libguide (UCLA Center for Digital Humanities)
Text Mining and Scholarly Publishing
(Jonathan Clark, Publishing Research Consortium, 2012)
Text and Data Mining in the in the Humanities and Social Sciences -- Strategies and Tools (July 29, 2015)
Webinar from the Center for Research Libraries.
Seven Ways Humanists are Using Computers to Understand Text
Glossary of Digital Humanities Terms
Text and Data Mining and Fair Use
This ARL Issue Brief makes the case for TDM as a fair use and recaps recent court cases that support it.