Skip to Main Content

SoTE: Data Mining

Boston College Attribution

This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!

Tools & Techniques

The languages, tools, and methods listed below represent a very small portion of resources available to those interested in text and data mining and analysis. Although they are categorized under specific headings, many are not limited to one type of task. 

Extracting & Scraping

Beautiful Soup - Python library used for web-scraping - data extractor in a web browser can be used to run automatic and bulk extraction, and run APIs

R - a language and environment for statistical computing and graphics that enables data manipulation, calculation, and graphical display (available via BC Citrix server

RegEx -  define and search for patterns in data or text using find and replace operations; also useful for cleaning messy data

Tabula - extract data tables from PDF files

Web Scraper - Chrome browser extension for extracting data from web pages


Cleaning & Processing

Lexos - integrated workflow of pre-processing, analysis, and visualization tools for finding and exploring patterns in texts

OpenRefine - tool for working with messy data; clean, transform, reconcile, normalize, extend data; compatible with expression languages (i.e. GREL, Jython)

Stanford Parser - probabilistic natural language parser

Stanford Part-of-Speech Tagger - assigns parts of speech to words or tokens


Analysis & Visualization 

AntConc - corpus analysis toolkit for text analysis and creation of concordances

Gephi - visualization toolkit for exploring graphs and networks (available in Digital Studio, O'Neill second floor)

Mallet - Java-based toolkit for statistical natural language processing, including tasks, such as document classification, clustering, topic modeling and information extraction

Textexture - visualze texts as a network

Voyant - suite of web-based tools for text reading, analysis, and visualization

Tool Directories

DH Toychest - extensive list of digital humanities tools, including those for text extraction, analysis, mining, and visualization

DiRT Directory - general directory of digital humanities tools with descriptions and metadata including fields such as, development status, cost, platforms, and categories  

TAPoR - directory of tools specifically used for text analysis, retrieval, and visualization


Codecademy - learn to code in Python for data extraction and manipulation

Basic Text Mining in R - tutorial on text mining with R

Programming Historian - learn how to extract, clean, manipulate, and transform data; also includes lessons on topic modeling and text analysis 

Scikit Tutorial - learn how to use scikit to analyze topics within a collection of texts

Text Analysis Tutorial - tutorials on how to use topic models for quantitative text analysis in the humanities and social sciences