Skip to Main Content

Data Science Ph.D Program

A website for the Data Science students in the Doctorate Program

Program Essentials

Organize your references using a recommended method, such as RefWorks, Mendeley, Zotero, or EndNote. Share your references storage folder with both your chair and SME. This ensures they can access your sources and provide informed feedback on your literature review. 

Effectively storing and organizing references is crucial for any research project, including a data science dissertation. Proper reference management ensures all sources are easily accessible, correctly cited, and systematically organized. Here are some common methods for storing references and their advantages and disadvantages. 

Method 

Advantages 

Disadvantages 

Reference Management Software 

 EndNote, Zotero, Mendeley, RefWorks 

- Automated citation and bibliography generation. 

- Learning curve to master software features. 

- Easy organization with tagging, folders, and search functions. 

- Potential for software costs, although some tools offer free versions. 

- Integration with word processing software for seamless writing and citing. 

- Risk of data loss if not regularly backed up or synced. 

- Ability to import references directly from academic databases. 

Spreadsheets 

 Excel, Google Sheets 

- Customizable format to suit personal preferences and specific needs. 

- Time-consuming manual entry and updating process. 

- Free to use and widely accessible. 

- Limited advanced features for citation and bibliography management. 

- Easy to sort and filter references based on various criteria. 

- Higher potential for human error in data entry. 

Manual Filing Systems 

Physical folders, Digital folders 

- No need for learning new software tools. 

- Difficulty in quickly searching and retrieving specific references. 

- Can be organized in a way that makes sense to the researcher. 

- Lack of integration with word processing software for automatic citations. 

- Useful for managing hard-copy articles and physical books. 

- Takes up physical space and can become unwieldy with many references. 


Your Options

EndNote: (https://www.endnote.com/

Zotero: (https://www.zotero.org/

Mendeley: (https://www.mendeley.com/

RefWorks: (https://refworks.proquest.com/

Google Drive: (https://drive.google.com/) - for a simple folder-based approach 

Microsoft OneDrive: (https://onedrive.live.com/) - for a simple folder-based approach 

Data Science Research

Data science research differs from other fields' focus on data-driven methods to solve problems and uncover patterns. Unlike traditional research, which often relies on controlled experiments or theoretical models, data science uses large datasets, algorithms, and statistical tools to generate insights. This field is highly interdisciplinary, combining elements of computer science, mathematics, and domain-specific knowledge to analyze and predict real-world outcomes. Data science also emphasizes practical applications, aiming to create models that can be directly used for decision-making in healthcare, finance, and technology industries. Due to advances in machine learning and AI, its rapid pace of change also sets it apart, making adaptability and continuous learning essential for researchers.

This page ensures that you have all the required material for your study and dissertation in one place. Note that several common documents and templates are inappropriate for our studies. Read the information provided carefully.

Data Collection versus Data Acquisition

One of the most important considerations is researching an appropriate Dataset if you are acquiring data. Deciding whether to collect new data or acquire existing datasets is a critical step in the research process. Each approach has its advantages and challenges that must be carefully weighed based on the research objectives, available resources, and the nature of the study. 

Data Collection 

Data collection involves gathering raw data firsthand through surveys, experiments, interviews, and observations. This approach allows researchers to tailor the data to their research questions and objectives, ensuring it is relevant and directly applicable. One of the primary advantages of data collection is its control over the quality and scope of the data (Creswell & Creswell, 2017). Researchers can design their data collection processes to minimize biases and errors, thereby enhancing the validity and reliability of the results. 

However, data collection can be time-consuming and resource-intensive. It often requires substantial planning, coordination, and financial investment, especially for large-scale studies or complex experimental designs. Additionally, ethical considerations such as obtaining informed consent and ensuring participant confidentiality must be meticulously managed (Patten & Newhart, 2017). 

Data Acquisition 

Data acquisition involves obtaining datasets that have already been collected and preprocessed by other researchers, organizations, or institutions. This approach can significantly reduce the time and cost associated with data gathering, allowing researchers to focus more on data analysis and interpretation (Kitchin, 2014). Existing datasets can often be sourced from repositories, government databases, academic institutions, or commercial entities and are particularly useful when large-scale data is required or the research timeline is constrained. 

The main challenge with data acquisition is ensuring the relevance and suitability of the dataset for specific research questions. Researchers must critically evaluate the quality of the data, including how it was collected and processed and any inherent biases (Waller & Fawcett, 2013). Additionally, pre-existing data often restricts how the data can be used, shared, or published, necessitating careful adherence to licensing agreements and ethical guidelines. 

Importance of Early IRB Approval 

For data science studies, particularly those involving sensitive data or human subjects, securing IRB approval early in the dissertation process is imperative. At National University, the IRB pre-approval process should be completed by the end of the first course in your dissertation sequence. Early approval helps prevent delays and unnecessary changes to your research plans. Delays in securing IRB approval can significantly impact your research timeline, potentially leading to revisions that could have been avoided with early consultation. 


CITI Training 

Researchers must complete the most current online training through the Collaborative Institutional Training Initiative (CITI) program. To review the requirements for CITI training, please visit the CITI Training page in the library. The CITI program provides comprehensive training in research ethics, compliance, and best practices, essential for conducting ethical and responsible research. This training is crucial in preparing for your dissertation and ensuring that your study meets the highest ethical standards. 

The CITI program covers various topics, including human subjects research, data management, and ethical considerations. Completing this training ensures you are well-versed in the ethical guidelines and regulatory requirements governing research. This knowledge is vital for protecting the rights and welfare of research participants and maintaining the integrity of your study. 

You should complete the CITI training before the end of this course. This timeline ensures that you are prepared to engage ethically with your research participants and handle data responsibly from the very beginning of your dissertation process. Early completion of the training also aligns with the pre-approval process for the Institutional Review Board (IRB), allowing you to avoid delays in securing IRB approval. 

Importance of Early IRB Approval for Data Science Students

For data science studies, particularly those involving sensitive data or human subjects, securing IRB approval early in the dissertation process is imperative. At National University, the IRB pre-approval process should be completed by the end of the first course in your dissertation sequence. Early approval helps prevent delays and unnecessary changes to your research plans. Delays in securing IRB approval can significantly impact your research timeline, potentially leading to revisions that could have been avoided with early consultation. 

The IRB approval process for our program should be completed during the FIRST DIS Course.

Please visit IRB HOME in NU Library for a complete explanation of the steps you must take.

Start here: Institutional Review Board (IRB): Get Started with IRB, and start early.


Important Note: Special IRB Processes from Organizations like the CDC 

In addition to the standard IRB process, certain research studies may require additional approval from specialized IRBs (Institutional Review Board), such as those managed by the Centers for Disease Control and Prevention (CDC). These special IRBs are necessary when research involves high-risk populations, sensitive health data, or specific regulatory requirements. 

CDC IRB Process

  • Specialized Protocols: Research involving public health data, infectious diseases, or high-risk populations often requires the CDC's IRB approval. This board has additional expertise in handling complex health-related ethical issues.
  • Stringent Requirements: The CDC IRB imposes stringent requirements to ensure the highest ethical standards. Researchers must provide comprehensive details about their study design, data management, and participant protection measures.
  • Coordination with Institutional IRBs: Often, researchers must coordinate between their institutional IRB and the CDC IRB. This dual approval process ensures that both local and federal regulations are met.
  • Continuous Monitoring: The CDC IRB may require ongoing monitoring and reporting throughout the research project to ensure compliance with ethical standards and address any emerging issues promptly. 

Templates for Data Science Students

The Templates are divided by CMP course and DIS sequence. Each is accompanied by an explanation document ( Guidelines). Scroll down to reach the files.
 

 

 

Preparing and Sharing Your GitHub or Google Collab Folder

Utilizing cloud storage for code in a PhD study offers several advantages, particularly for projects that necessitate collaboration, scalability, and secure data management. Storing code in the cloud enables seamless access across multiple devices, ensuring that PhD candidates, advisors, and collaborators can access the latest versions of the code from anywhere with an internet connection. This accessibility supports more flexible workflows and reduces dependency on a single device or location, which can be especially useful during fieldwork, travel, or unexpected equipment failures.

Cloud storage also enhances security and version control. Platforms like GitHub or GitLab in the cloud track changes made to the code over time, making it easy to revert to previous versions if errors occur or to review how the code has evolved. This also enables transparent documentation of the research process, which is crucial for reproducibility and accountability in academic research.

Additionally, cloud storage offers scalable resources for running code on larger datasets or more complex models, which may be limited on local machines. Many cloud platforms integrate with powerful computing resources, enabling PhD researchers to leverage these resources as needed without incurring the expense of purchasing expensive hardware.

Google Colab 

Google Colab. (n.d.). Introduction to Google Colab. Retrieved from Google Colab 

Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units). Colab is especially well suited to machine learning, data science, and education. 

GitHub Guides  

GitHub. (n.d.). GitHub Tutorial: Getting Started with GitHub. Retrieved from GitHub Guides 

Git started on your first repository in the third installment of GitHub for Beginners. Discover the essential features and settings to manage your projects effectively. 

Posit 

Posit PBC. (n.d.). Posit Cloud Documentation. Retrieved from https://docs.posit.co/cloud/ 

Posit makes it easy to deploy open-source data science work across the enterprise safely and securely. Share Jupyter notebooks, Plotly dashboards, or interactive applications built with popular R and Python frameworks. You may want to review the documentation on setting up a Posit Cloud Account. 

RPubs 

RPubs is a publishing platform created by RStudio designed to make sharing R Markdown documents simple and accessible. It allows data scientists, statisticians, and researchers to publish their R analyses, visualizations, and reports online in an easy-to-view and share format. Users can generate HTML reports directly from R Markdown and upload them to RPubs with a few clicks, making it ideal for sharing data science projects, tutorials, or reproducible research with peers and the public. This open access to R Markdown content promotes knowledge sharing, collaborative learning, and greater transparency in data-driven research, supporting the growth of the R and data science communities.

The Importance of Replicability, Reproducibility, and Generalizability in Science

Replicability refers to the ability of a study to be repeated with the same methodology and produce the same results. It is fundamental to the scientific method because it validates the reliability and consistency of research findings. When results are replicable, it builds confidence in the study's methods and conclusions (Goodman et al., 2016). 

Reproducibility involves achieving the same results using the original data and analysis code. It ensures the analysis is accurate and error-free and confirms the integrity of the computational procedures used (Peng, 2011). 

Generalizability refers to the extent to which a study's findings can be applied to broader populations or different contexts. This is crucial for determining the relevance and applicability of research outcomes to real-world settings (Shadish et al., 2002). 

Achieving Replicability, Reproducibility, and Generalizability in Data Science and AI

In data science and AI, achieving these three aspects is essential for advancing knowledge and technology.  

  • Clear DocumentationResearchers must document their methodologies, including the data collection process, preprocessing steps, algorithms used, and parameter settings (Stodden, 2010).

  • Open Data and CodeSharing data and code publicly enables other researchers to replicate the study. Platforms like GitHub and repositories like Zenodo facilitate this sharing (Sandve et al., 2013). 

  • Version Control: Version control systems like Git ensure that the exact versions of code and data used in the analysis are preserved and accessible (Ram, 2013).

  • Computational Environments: Containerization tools like Docker can encapsulate the computational environment, ensuring the code runs consistently across different systems (Boettiger, 2015). 

  • Diverse DatasetsUsing varied datasets during model training can help ensure that the findings are not limited to specific data characteristics and can be applied to broader contexts (Bengio, 2012). 

  • Cross-ValidationTechniques like cross-validation help test the model on different data subsets, enhancing the results' robustness and generalizability (Kohavi, 1995). 

Best Practices

Best Practice

Do's

Don'ts

Replicability

- Document all steps clearly
- Share data and code

- Keep methodologies vague
- Use proprietary data without sharing

Reproducibility

- Use version control
- Utilize containerization tools

- Rely on local environments only
- Ignore code dependencies

Generalizability

- Train on diverse datasets
- Use cross-validation

- Overfit to specific data
- Ignore external validity testing

Data Documentation

- Provide metadata for datasets
- Describe preprocessing

- Omit data cleaning steps
- Use undocumented datasets

Algorithm Transparency

- Explain algorithm choices
- Share parameter settings

- Use "black-box" approaches
- Hide model configurations

Open Science

- Publish in open-access journals
- Participate in peer review

- Restrict access to results
- Avoid peer scrutiny

CMP 9701 v3 - Pre Prospectus and Guidelines for Data Science Students Documents

DIS Sequence - Dissertation Template and Guidelines for Data Science Students in the Dissertation Courses

IRB and CITI Training out of order