LibGuides: Data Science Ph.D Program: Program Essentials

Program Essentials

Organize your references using a recommended method, such as RefWorks, Mendeley, Zotero, or EndNote. Share your references storage folder with both your chair and SME. This ensures they can access your sources and provide informed feedback on your literature review.

Effectively storing and organizing references is crucial for any research project, including a data science dissertation. Proper reference management ensures all sources are easily accessible, correctly cited, and systematically organized. Here are some common methods for storing references and their advantages and disadvantages.

Method	Advantages	Disadvantages
Reference Management Software
EndNote, Zotero, Mendeley, RefWorks	- Automated citation and bibliography generation.	- Learning curve to master software features.
	- Easy organization with tagging, folders, and search functions.	- Potential for software costs, although some tools offer free versions.
	- Integration with word processing software for seamless writing and citing.	- Risk of data loss if not regularly backed up or synced.
	- Ability to import references directly from academic databases.
Spreadsheets
Excel, Google Sheets	- Customizable format to suit personal preferences and specific needs.	- Time-consuming manual entry and updating process.
	- Free to use and widely accessible.	- Limited advanced features for citation and bibliography management.
	- Easy to sort and filter references based on various criteria.	- Higher potential for human error in data entry.
Manual Filing Systems
Physical folders, Digital folders	- No need for learning new software tools.	- Difficulty in quickly searching and retrieving specific references.
	- Can be organized in a way that makes sense to the researcher.	- Lack of integration with word processing software for automatic citations.
	- Useful for managing hard-copy articles and physical books.	- Takes up physical space and can become unwieldy with many references.

Your Options

EndNote: (https://www.endnote.com/)

Zotero: (https://www.zotero.org/)

Mendeley: (https://www.mendeley.com/)

RefWorks: (https://refworks.proquest.com/)

Google Drive: (https://drive.google.com/) - for a simple folder-based approach

Microsoft OneDrive: (https://onedrive.live.com/) - for a simple folder-based approach

Data Science Research

Data science research differs from other fields' focus on data-driven methods to solve problems and uncover patterns. Unlike traditional research, which often relies on controlled experiments or theoretical models, data science uses large datasets, algorithms, and statistical tools to generate insights. This field is highly interdisciplinary, combining elements of computer science, mathematics, and domain-specific knowledge to analyze and predict real-world outcomes. Data science also emphasizes practical applications, aiming to create models that can be directly used for decision-making in healthcare, finance, and technology industries. Due to advances in machine learning and AI, its rapid pace of change also sets it apart, making adaptability and continuous learning essential for researchers.

This page ensures that you have all the required material for your study and dissertation in one place. Note that several common documents and templates are inappropriate for our studies. Read the information provided carefully.

Data Collection versus Data Acquisition

One of the most important considerations is researching an appropriate Dataset if you are acquiring data. Deciding whether to collect new data or acquire existing datasets is a critical step in the research process. Each approach has its advantages and challenges that must be carefully weighed based on the research objectives, available resources, and the nature of the study.

Data Collection

Data collection involves gathering raw data firsthand through surveys, experiments, interviews, and observations. This approach allows researchers to tailor the data to their research questions and objectives, ensuring it is relevant and directly applicable. One of the primary advantages of data collection is its control over the quality and scope of the data (Creswell & Creswell, 2017). Researchers can design their data collection processes to minimize biases and errors, thereby enhancing the validity and reliability of the results.

However, data collection can be time-consuming and resource-intensive. It often requires substantial planning, coordination, and financial investment, especially for large-scale studies or complex experimental designs. Additionally, ethical considerations such as obtaining informed consent and ensuring participant confidentiality must be meticulously managed (Patten & Newhart, 2017).

Data Acquisition

Data acquisition involves obtaining datasets that have already been collected and preprocessed by other researchers, organizations, or institutions. This approach can significantly reduce the time and cost associated with data gathering, allowing researchers to focus more on data analysis and interpretation (Kitchin, 2014). Existing datasets can often be sourced from repositories, government databases, academic institutions, or commercial entities and are particularly useful when large-scale data is required or the research timeline is constrained.

The main challenge with data acquisition is ensuring the relevance and suitability of the dataset for specific research questions. Researchers must critically evaluate the quality of the data, including how it was collected and processed and any inherent biases (Waller & Fawcett, 2013). Additionally, pre-existing data often restricts how the data can be used, shared, or published, necessitating careful adherence to licensing agreements and ethical guidelines.

Importance of Early IRB Approval

For data science studies, particularly those involving sensitive data or human subjects, securing IRB approval early in the dissertation process is imperative. At National University, the IRB pre-approval process should be completed by the end of the first course in your dissertation sequence. Early approval helps prevent delays and unnecessary changes to your research plans. Delays in securing IRB approval can significantly impact your research timeline, potentially leading to revisions that could have been avoided with early consultation.

CITI Training

Researchers must complete the most current online training through the Collaborative Institutional Training Initiative (CITI) program. To review the requirements for CITI training, please visit the CITI Training page in the library. The CITI program provides comprehensive training in research ethics, compliance, and best practices, essential for conducting ethical and responsible research. This training is crucial in preparing for your dissertation and ensuring that your study meets the highest ethical standards.

The CITI program covers various topics, including human subjects research, data management, and ethical considerations. Completing this training ensures you are well-versed in the ethical guidelines and regulatory requirements governing research. This knowledge is vital for protecting the rights and welfare of research participants and maintaining the integrity of your study.

You should complete the CITI training before the end of this course. This timeline ensures that you are prepared to engage ethically with your research participants and handle data responsibly from the very beginning of your dissertation process. Early completion of the training also aligns with the pre-approval process for the Institutional Review Board (IRB), allowing you to avoid delays in securing IRB approval.

Importance of Early IRB Approval for Data Science Students

The IRB approval process for our program should be completed during the FIRST DIS Course.

Please visit IRB HOME in NU Library for a complete explanation of the steps you must take.

Start here: Institutional Review Board (IRB): Get Started with IRB, and start early.

Important Note: Special IRB Processes from Organizations like the CDC

In addition to the standard IRB process, certain research studies may require additional approval from specialized IRBs (Institutional Review Board), such as those managed by the Centers for Disease Control and Prevention (CDC). These special IRBs are necessary when research involves high-risk populations, sensitive health data, or specific regulatory requirements.

CDC IRB Process

Specialized Protocols: Research involving public health data, infectious diseases, or high-risk populations often requires the CDC's IRB approval. This board has additional expertise in handling complex health-related ethical issues.
Stringent Requirements: The CDC IRB imposes stringent requirements to ensure the highest ethical standards. Researchers must provide comprehensive details about their study design, data management, and participant protection measures.
Coordination with Institutional IRBs: Often, researchers must coordinate between their institutional IRB and the CDC IRB. This dual approval process ensures that both local and federal regulations are met.
Continuous Monitoring: The CDC IRB may require ongoing monitoring and reporting throughout the research project to ensure compliance with ethical standards and address any emerging issues promptly.

Templates for Data Science Students

The Templates are divided by CMP course and DIS sequence. Each is accompanied by an explanation document ( Guidelines). Scroll down to reach the files.

Preparing and Sharing Your GitHub or Google Collab Folder

Utilizing cloud storage for code in a PhD study offers several advantages, particularly for projects that necessitate collaboration, scalability, and secure data management. Storing code in the cloud enables seamless access across multiple devices, ensuring that PhD candidates, advisors, and collaborators can access the latest versions of the code from anywhere with an internet connection. This accessibility supports more flexible workflows and reduces dependency on a single device or location, which can be especially useful during fieldwork, travel, or unexpected equipment failures.

Cloud storage also enhances security and version control. Platforms like GitHub or GitLab in the cloud track changes made to the code over time, making it easy to revert to previous versions if errors occur or to review how the code has evolved. This also enables transparent documentation of the research process, which is crucial for reproducibility and accountability in academic research.

Additionally, cloud storage offers scalable resources for running code on larger datasets or more complex models, which may be limited on local machines. Many cloud platforms integrate with powerful computing resources, enabling PhD researchers to leverage these resources as needed without incurring the expense of purchasing expensive hardware.

Google Colab

Google Colab. (n.d.). Introduction to Google Colab. Retrieved from Google Colab

Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units). Colab is especially well suited to machine learning, data science, and education.

GitHub Guides

GitHub. (n.d.). GitHub Tutorial: Getting Started with GitHub. Retrieved from GitHub Guides

Git started on your first repository in the third installment of GitHub for Beginners. Discover the essential features and settings to manage your projects effectively.

Posit

Posit PBC. (n.d.). Posit Cloud Documentation. Retrieved from https://docs.posit.co/cloud/

Posit makes it easy to deploy open-source data science work across the enterprise safely and securely. Share Jupyter notebooks, Plotly dashboards, or interactive applications built with popular R and Python frameworks. You may want to review the documentation on setting up a Posit Cloud Account.

RPubs

RPubs is a publishing platform created by RStudio designed to make sharing R Markdown documents simple and accessible. It allows data scientists, statisticians, and researchers to publish their R analyses, visualizations, and reports online in an easy-to-view and share format. Users can generate HTML reports directly from R Markdown and upload them to RPubs with a few clicks, making it ideal for sharing data science projects, tutorials, or reproducible research with peers and the public. This open access to R Markdown content promotes knowledge sharing, collaborative learning, and greater transparency in data-driven research, supporting the growth of the R and data science communities.

The Importance of Replicability, Reproducibility, and Generalizability in Science

Replicability refers to the ability of a study to be repeated with the same methodology and produce the same results. It is fundamental to the scientific method because it validates the reliability and consistency of research findings. When results are replicable, it builds confidence in the study's methods and conclusions (Goodman et al., 2016).

Reproducibility involves achieving the same results using the original data and analysis code. It ensures the analysis is accurate and error-free and confirms the integrity of the computational procedures used (Peng, 2011).

Generalizability refers to the extent to which a study's findings can be applied to broader populations or different contexts. This is crucial for determining the relevance and applicability of research outcomes to real-world settings (Shadish et al., 2002).

Achieving Replicability, Reproducibility, and Generalizability in Data Science and AI

In data science and AI, achieving these three aspects is essential for advancing knowledge and technology.

Clear Documentation: Researchers must document their methodologies, including the data collection process, preprocessing steps, algorithms used, and parameter settings (Stodden, 2010).
Open Data and Code: Sharing data and code publicly enables other researchers to replicate the study. Platforms like GitHub and repositories like Zenodo facilitate this sharing (Sandve et al., 2013).
Version Control: Version control systems like Git ensure that the exact versions of code and data used in the analysis are preserved and accessible (Ram, 2013).
Computational Environments: Containerization tools like Docker can encapsulate the computational environment, ensuring the code runs consistently across different systems (Boettiger, 2015).
Diverse Datasets: Using varied datasets during model training can help ensure that the findings are not limited to specific data characteristics and can be applied to broader contexts (Bengio, 2012).
Cross-Validation: Techniques like cross-validation help test the model on different data subsets, enhancing the results' robustness and generalizability (Kohavi, 1995).

Best Practices

Best Practice	Do's	Don'ts
Replicability	- Document all steps clearly - Share data and code	- Keep methodologies vague - Use proprietary data without sharing
Reproducibility	- Use version control - Utilize containerization tools	- Rely on local environments only - Ignore code dependencies
Generalizability	- Train on diverse datasets - Use cross-validation	- Overfit to specific data - Ignore external validity testing
Data Documentation	- Provide metadata for datasets - Describe preprocessing	- Omit data cleaning steps - Use undocumented datasets
Algorithm Transparency	- Explain algorithm choices - Share parameter settings	- Use "black-box" approaches - Hide model configurations
Open Science	- Publish in open-access journals - Participate in peer review	- Restrict access to results - Avoid peer scrutiny

CMP 9701 v3 - Pre Prospectus and Guidelines for Data Science Students Documents

Dissertation Proposal and Dissertation Manuscript and Template and Guide
Use the Data Science Prospectus Template as your guide to success. Throughout this course, it not only provides the formatting and layout, but it is also full of hints and tips to help you improve your work. Most importantly, it also helps you build your work to create a much better Prospectus at the end of the course. Another document is the blank document, “Prospectus,” to add your content for your Prospectus. Use it from Lesson 1 to Lesson 8.
Prospectus Template
This is the document with the formatting needed for your assignments in this course. Use it for every lesson assignment, building your Prospectus in Lessons 1 through 8. Another document, the SOTE-Data Science Prospectus Template, is your guide to success. Throughout this course, it provides the formatting and layout; it is also full of hints and tips to help you improve your work. Most importantly, it also helps you build your work to create a much better Prospectus at the end of the course.

DIS Sequence - Dissertation Template and Guidelines for Data Science Students in the Dissertation Courses

DP_DM Template for Data Science Studies ( Version 4)
This is the current version of the Dissertation Template we are using in the Data Science Program. Please note that the Template has substantial differences than the general DP_DM Template.
DP_DM Guidelines for students (Version 4)
This is an accompanying document to your DP_DM Template, and it can be used in conjunction or as an introduction guide for your PhD studies while in the Dissertation Courses.
The Current Version (Version 4) of the document is attached here, and the document is subject to updates.
Change Matrix
The Change Matrix should always be used, updated, and submitted when the students make revisions to their document based on their Subject Matter Expert Suggestions or requirements.

Data Science Ph.D Program

APD Data Science PhD