Skip to Main Content

Data Mining

This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!

What is text and data mining?

Text and data mining (TDM) is the computational analysis of vast quantities of digital information, whether free-form natural language text or structured data. 

Using specialized software, researchers can extract data, identify trends, look for patterns and better understand the relationships of terms within and between documents. Analysis might focus on word frequency, words that frequently appear near each other, contextual information for key words, common phrases and other patterns. 

Materials to be analyzed range from websites (such as publicly available Facebook posts), 16th C. manuscripts, DNA sequences, to old newspapers.

Image of a graphic analysis, constructed using Voyant, of the frequency of terms in the novel, Agnes Grey, by Charlotte Bronte.

This is a graphic analysis, constructed using Voyant, of the frequency of terms in the novel, Agnes Grey, by Charlotte Bronte.

Policies for Mining Licensed Content

If you wish to undertake a text or data mining project with content from the Libraries’ licensed databases, please contact a librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, it is possible to actively negotiate text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire NU community.

Please also see our Best Practice Tips for mining licensed databases.

Lit & News Feed

Loading ...

Twitter feed

Learn More