Skip to Main Content

Data Science Ph.D Program

A website for the Data Science students in the Doctorate Program

Data Banks Introduction

Data Banks

Databanks play a key role in data science by providing structured data collections that enable meaningful insights and breakthroughs. Each type of databank—academic, generalist, specialized, open, and proprietary—serves different needs and shapes how data is accessed and applied.

Academic databanks, often used in research, help advance fields by making valuable data widely available for analysis. Generalist databanks support cross-disciplinary projects and encourage open sharing, which helps researchers from various areas build on each other’s work. Specialized databanks cater to specific fields, offering datasets with deeper focus and precision, often tailored to support advancements in areas like healthcare, environmental science, and technology.

Open databanks expand access to data, supporting transparency and the sharing of knowledge on a broader scale, which is essential for collaborative and reproducible research. Proprietary databanks, on the other hand, provide exclusive datasets that drive decision-making in business, finance, and industry, offering insights not easily available elsewhere.

The availability of these databanks in data science means that researchers, developers, and analysts can access high-quality data to test models, verify hypotheses, and uncover patterns. The diversity among databanks enriches the field, allowing data scientists to choose resources that best fit the nature and needs of their projects, ultimately pushing the boundaries of what can be achieved with data.

More Data Banks for your consideration

Data Sets and Licenses

When using datasets for research and analysis, it is essential to understand the bodies that provide dataset licenses and the rules they enforce. These organizations establish guidelines to ensure the ethical and legal use of data. 

Please note that regarding citations, the author, the link, and the license must be included in your manuscript, such as the Abstract, Chapter 3, Chapter 4, and the references. In some cases, you may use the dataset for research, but you are not allowed to download it; in other cases, you may also use it for research, but you are not allowed to publish it, share it, or use it for subsequent research. Research and products after Licensed datasets cannot be patented in many cases or have specific legal and monetary obligations in others. 

Data Sets and Licenses
Governing Body License Types Rules Website
Creative Commons (CC) (CC0), (CC BY), (CC BY-SA), (CC (CC BY-NC), (CC BY-NC-SA), (CC BY-NC-ND) Attribution: Users must give appropriate credit, provide a link to the license, and indicate if changes were made. ShareAlike: Derivative works must be licensed under identical terms. NonCommercial: Use is limited to non-commercial purposes. NoDerivs: No derivatives or adaptations of the work are permitted. Creative Commons 
Open Data Commons (ODC) Open Database License (ODbL), Attribution License (ODC-By), Public Domain Dedication and License (PDDL) ODbL: Allows use, modification, and sharing of databases while requiring the same open use for derivatives. ODC-By: Requires attribution for the use of the data. PDDL: Places data in the public domain, allowing free use without any restrictions. Open Data Commons 
Government Data Licenses Specific to each government’s open data policies, often similar to public domain or CC licenses. U.S. Government: Generally, data is released into the public domain, but specific terms may apply depending on the agency. European Union: EU Open Data Portal follows a similar approach, ensuring data can be freely used, modified, and shared. Data.gov 
EU Open Data Portal 
Academic and Research Institutions Varies by institution, often using CC licenses or custom agreements. Attribution: Proper citation of the data source. Usage Restrictions: Terms may specify non-commercial use or restrictions on redistribution. Ethical Use: Compliance with ethical guidelines for data use, particularly involving human subjects. Harvard Dataverse  ICPSR
Industry and Commercial Licenses Proprietary licenses, often with strict terms and conditions. Commercial Use: Terms may restrict use to non-commercial purposes or require payment for commercial use. Redistribution: Limits on sharing data with third parties. Data Protection: Compliance with data protection laws and privacy regulations. Kaggle 
Google Dataset Search

  1. Academic Databanks: These databanks, such as ICPSR and Academic Torrents, are geared toward research and often contain datasets for fields like social sciences, economics, and health. They support data preservation and sharing among researchers and are frequently accessible through educational institutions.
  2. Generalist Databanks: Databanks like Figshare and Zenodo are open-access and multidisciplinary, allowing researchers from any field to upload, share, and cite datasets. These repositories promote collaboration across disciplines and support open data policies by enabling wide access.
  3. Specialized Databanks: Focused on specific fields, specialized databanks provide highly curated datasets. IEEE Data Port, for example, offers datasets in engineering and technology, while NIH repositories focus on biomedical and clinical data, enhancing the depth and relevance of data in these fields.
  4. Open Databanks: Open databanks, such as Google Dataset Search and Registry of Open Data on AWS, facilitate public access to data by aggregating or directly hosting datasets. These databanks support transparency and open science, allowing users to access data without restrictions.
  5. Proprietary or Subscription-Based Databanks: Databanks like certain sections of IEEE or NASDAQ Data Link require institutional or paid access, offering exclusive, often high-value datasets for professional or corporate use, particularly in fields like finance, engineering, and business analytics.

Data Banks and Data Resources

Most Common Datasets Sources:

UCI: https://archive.ics.uci.edu/datasets

Kaggle: https://www.kaggle.com/datasets

OpenML: https://www.openml.org/search?type=data&sort=runs&status=active

Google Dataset Search

Meta search engine for datasets from journals, government agencies, and publishers.

Allen Institute for AI (AI2)

NLP, commonsense reasoning, question answering datasets (e.g., SciFact, ARC, AI2 Reasoning Challenge).

Academic Torrents
A distributed system for sharing enormous datasets, fostering collaboration, and facilitating access to academic data and research materials.

BMIC Home
A collection of repositories supported by NIH, aimed at promoting data sharing and advancing biomedical research.

Data Asset eXchange
A platform providing access to curated datasets designed to help developers and data scientists build AI models and applications.

Data Excellence. Research Impact.
Provides access to a vast archive of social science data for research and instruction, supporting data preservation and sharing.

Datasets
A collection of ready-to-use datasets for machine learning and data science, covering a wide range of applications.

Dataset Search
A tool that enables users to find datasets stored across the web, making data discovery easy and comprehensive.

DBpedia
A crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.

Dryad
An open-source repository for research data, providing a platform for researchers to publish and share datasets across various scientific disciplines.

European Data
Provides access to a wide range of data from EU institutions and bodies, supporting transparency and enabling data reuse.

Figshare
A repository where users can make all their research outputs available in a citable, shareable, and discoverable manner.

A Global Clinical Research Data Sharing Platform
An organization dedicated to sharing clinical research data globally, promoting transparency and collaboration in medical research.

The Global Health Observatory
Provides access to health-related data, supporting global health research and policymaking.

Google Public Data Explorer
Allows users to explore large public-interest datasets, visualize the data, and generate interactive charts and maps.

Harvard Dataverse
An open-source repository for sharing, citing, and preserving research data across all scientific disciplines.

Hugging Face Datasets

NLP, vision, audio, multimodal – includes benchmark datasets like SQuAD, IMDB, CommonVoice.

IEEE Data Port
A valuable resource for researchers, offering a repository for datasets in a variety of technical fields, enhancing data sharing and collaboration.

List of Datasets for Machine-Learning Research
This Wikipedia page provides a comprehensive list of datasets widely used in machine learning research, including descriptions and links to datasets across various domains such as computer vision, natural language processing, and more.

Nasdaq Data Link
Provides financial and economic data, offering a comprehensive resource for market researchers and financial analysts.

NIH-Supported Data Sharing Resources: Domain Specific Repositories
Lists repositories specific to certain domains supported by NIH, facilitating data sharing and preservation within specialized fields.

NIH-Supported Data Sharing Resources: Generalist Repositories
Provides a list of generalist repositories supported by NIH, designed for broad data sharing and accessibility across disciplines.

NAIRR Pilot: 

The NAIRR Pilot aims to connect U.S. researchers and educators to computational, data, and training resources needed to advance AI research and research that employs AI. Federal agencies are collaborating with government-supported and non-governmental partners to implement the Pilot as a preparatory step toward an eventual full NAIRR implementation.

NASA Open Data portal

OSF
A platform to support researchers in managing their projects, sharing data, and collaborating openly with the global research community.

Our Data
Provides access to datasets used in FiveThirtyEight's data journalism articles, covering a wide range of topics including politics, sports, and science.

The Qualitative Data Repository
Provides a repository for storing and sharing qualitative data, supporting researchers in the social sciences.

Recent Uploads
An open-access repository developed by CERN, enabling researchers to share and preserve data and publications.

Registry of Open Data on AWS
Hosts a variety of public datasets, making it easier to find, access, and use open data in the AWS cloud.

Research Process: Datasets
Provides a curated collection of datasets available through the National University Library, supporting research across various disciplines with access to high-quality, reliable data sources.

Share Your Research Data
An open-access data repository that enables researchers to make their data discoverable, shareable, and citable.

United States Census Bureau
The U.S. Census Bureau provides a vast array of demographic, economic, and social data about the United States, supporting research and policymaking across multiple sectors.

World Bank Open Data

Economic, health, and development indicators for over 200 countries.

Computer Vision and Audio Data Sources

ImageNet

https://www.image-net.org

Massive database of labeled images used in deep learning.

COCO (Common Objects in Context)

 https://cocodataset.org

Widely used for object detection, segmentation, and captioning tasks.

 Open Images Dataset

 https://storage.googleapis.com/openimages/web/index.html

Annotated image dataset released by Google, used for large-scale image recognition.

LibriSpeech

 http://www.openslr.org/12

ASR benchmark dataset built from audiobooks.

CommonVoice by Mozilla

 https://commonvoice.mozilla.org

Open-source voice dataset in multiple languages.

 

The HathiTrust Research Center has expanded its services to support computational research on the entire collection of one of the world’s largest digital libraries held by HathiTrust. HathiTrust’s collections include over 14 million digitized volumes, including more than 7 million books, 725,000 US federal government documents, and 350,000 serial publications. Previously the HathiTrust Research Center supported analysis of only the public domain subset of the HathiTrust collection. Researchers will now be able to explore the entire collection and run an algorithm against all 14 million volumes. The change is being piloted in 2016 and is expected to be more widely available in 2017.

https://htrc.atlassian.net/wiki/spaces/COM/overview?mode=global