This page was copied and adapted from the Boston College Libraries Text & Data Mining Guide under a Creative Commons Attribution 4.0 License. Our thanks to Boston College for developing this excellent resource and sharing it under the license!
The languages, tools, and methods listed below represent a very small portion of resources available to those interested in text and data mining and analysis. Although they are categorized under specific headings, many are not limited to one type of task.
Extracting & Scraping
Beautiful Soup - Python library used for web-scraping
Import.io - data extractor in a web browser can be used to run automatic and bulk extraction, and run APIs
R - a language and environment for statistical computing and graphics that enables data manipulation, calculation, and graphical display (available via BC Citrix server)
RegEx - define and search for patterns in data or text using find and replace operations; also useful for cleaning messy data
Tabula - extract data tables from PDF files
Web Scraper - Chrome browser extension for extracting data from web pages
Cleaning & Processing
Lexos - integrated workflow of pre-processing, analysis, and visualization tools for finding and exploring patterns in texts
OpenRefine - tool for working with messy data; clean, transform, reconcile, normalize, extend data; compatible with expression languages (i.e. GREL, Jython)
Stanford Parser - probabilistic natural language parser
Stanford Part-of-Speech Tagger - assigns parts of speech to words or tokens
Analysis & Visualization
AntConc - corpus analysis toolkit for text analysis and creation of concordances
Gephi - visualization toolkit for exploring graphs and networks (available in Digital Studio, O'Neill second floor)
Mallet - Java-based toolkit for statistical natural language processing, including tasks, such as document classification, clustering, topic modeling and information extraction
Textexture - visualze texts as a network
Voyant - suite of web-based tools for text reading, analysis, and visualization
DH Toychest - extensive list of digital humanities tools, including those for text extraction, analysis, mining, and visualization
DiRT Directory - general directory of digital humanities tools with descriptions and metadata including fields such as, development status, cost, platforms, and categories
TAPoR - directory of tools specifically used for text analysis, retrieval, and visualization
Codecademy - learn to code in Python for data extraction and manipulation
Basic Text Mining in R - tutorial on text mining with R
Programming Historian - learn how to extract, clean, manipulate, and transform data; also includes lessons on topic modeling and text analysis
Scikit Tutorial - learn how to use scikit to analyze topics within a collection of texts
Text Analysis Tutorial - tutorials on how to use topic models for quantitative text analysis in the humanities and social sciences
© Copyright 2024 National University. All Rights Reserved.