Organize your references using a recommended method, such as RefWorks, Mendeley, Zotero, or EndNote. Share your references storage folder with both your chair and SME. This ensures they can access your sources and provide informed feedback on your literature review.
Effectively storing and organizing references is crucial for any research project, including a data science dissertation. Proper reference management ensures all sources are easily accessible, correctly cited, and systematically organized. Here are some common methods for storing references and their advantages and disadvantages.
Method |
Advantages |
Disadvantages |
Reference Management Software |
||
EndNote, Zotero, Mendeley, RefWorks |
- Automated citation and bibliography generation. |
- Learning curve to master software features. |
|
- Easy organization with tagging, folders, and search functions. |
- Potential for software costs, although some tools offer free versions. |
|
- Integration with word processing software for seamless writing and citing. |
- Risk of data loss if not regularly backed up or synced. |
|
- Ability to import references directly from academic databases. |
|
Spreadsheets |
||
Excel, Google Sheets |
- Customizable format to suit personal preferences and specific needs. |
- Time-consuming manual entry and updating process. |
|
- Free to use and widely accessible. |
- Limited advanced features for citation and bibliography management. |
|
- Easy to sort and filter references based on various criteria. |
- Higher potential for human error in data entry. |
Manual Filing Systems |
||
Physical folders, Digital folders |
- No need for learning new software tools. |
- Difficulty in quickly searching and retrieving specific references. |
|
- Can be organized in a way that makes sense to the researcher. |
- Lack of integration with word processing software for automatic citations. |
|
- Useful for managing hard-copy articles and physical books. |
- Takes up physical space and can become unwieldy with many references. |
EndNote: (https://www.endnote.com/)
Zotero: (https://www.zotero.org/)
Mendeley: (https://www.mendeley.com/)
RefWorks: (https://refworks.proquest.com/)
Google Drive: (https://drive.google.com/) - for a simple folder-based approach
Microsoft OneDrive: (https://onedrive.live.com/) - for a simple folder-based approach
Data science research differs from other fields' focus on data-driven methods to solve problems and uncover patterns. Unlike traditional research, which often relies on controlled experiments or theoretical models, data science uses large datasets, algorithms, and statistical tools to generate insights. This field is highly interdisciplinary, combining elements of computer science, mathematics, and domain-specific knowledge to analyze and predict real-world outcomes. Data science also emphasizes practical applications, aiming to create models that can be directly used for decision-making in healthcare, finance, and technology industries. Due to advances in machine learning and AI, its rapid pace of change also sets it apart, making adaptability and continuous learning essential for researchers.
This page ensures that you have all the required material for your study and dissertation in one place. Note that several common documents and templates are inappropriate for our studies. Read the information provided carefully.
Data Collection versus Data Acquisition
One of the most important considerations is researching an appropriate Dataset if you are acquiring data. Deciding whether to collect new data or acquire existing datasets is a critical step in the research process. Each approach has its advantages and challenges that must be carefully weighed based on the research objectives, available resources, and the nature of the study.
Data Collection
Data collection involves gathering raw data firsthand through surveys, experiments, interviews, and observations. This approach allows researchers to tailor the data to their research questions and objectives, ensuring it is relevant and directly applicable. One of the primary advantages of data collection is its control over the quality and scope of the data (Creswell & Creswell, 2017). Researchers can design their data collection processes to minimize biases and errors, thereby enhancing the validity and reliability of the results.
However, data collection can be time-consuming and resource-intensive. It often requires substantial planning, coordination, and financial investment, especially for large-scale studies or complex experimental designs. Additionally, ethical considerations such as obtaining informed consent and ensuring participant confidentiality must be meticulously managed (Patten & Newhart, 2017).
Data Acquisition
Data acquisition involves obtaining datasets that have already been collected and preprocessed by other researchers, organizations, or institutions. This approach can significantly reduce the time and cost associated with data gathering, allowing researchers to focus more on data analysis and interpretation (Kitchin, 2014). Existing datasets can often be sourced from repositories, government databases, academic institutions, or commercial entities and are particularly useful when large-scale data is required or the research timeline is constrained.
The main challenge with data acquisition is ensuring the relevance and suitability of the dataset for specific research questions. Researchers must critically evaluate the quality of the data, including how it was collected and processed and any inherent biases (Waller & Fawcett, 2013). Additionally, pre-existing data often restricts how the data can be used, shared, or published, necessitating careful adherence to licensing agreements and ethical guidelines.
For data science studies, particularly those involving sensitive data or human subjects, securing IRB approval early in the dissertation process is imperative. At National University, the IRB pre-approval process should be completed by the end of the first course in your dissertation sequence. Early approval helps prevent delays and unnecessary changes to your research plans. Delays in securing IRB approval can significantly impact your research timeline, potentially leading to revisions that could have been avoided with early consultation.
Researchers must complete the most current online training through the Collaborative Institutional Training Initiative (CITI) program. To review the requirements for CITI training, please visit the CITI Training page in the library. The CITI program provides comprehensive training in research ethics, compliance, and best practices, essential for conducting ethical and responsible research. This training is crucial in preparing for your dissertation and ensuring that your study meets the highest ethical standards.
The CITI program covers various topics, including human subjects research, data management, and ethical considerations. Completing this training ensures you are well-versed in the ethical guidelines and regulatory requirements governing research. This knowledge is vital for protecting the rights and welfare of research participants and maintaining the integrity of your study.
You should complete the CITI training before the end of this course. This timeline ensures that you are prepared to engage ethically with your research participants and handle data responsibly from the very beginning of your dissertation process. Early completion of the training also aligns with the pre-approval process for the Institutional Review Board (IRB), allowing you to avoid delays in securing IRB approval.
For data science studies, particularly those involving sensitive data or human subjects, securing IRB approval early in the dissertation process is imperative. At National University, the IRB pre-approval process should be completed by the end of the first course in your dissertation sequence. Early approval helps prevent delays and unnecessary changes to your research plans. Delays in securing IRB approval can significantly impact your research timeline, potentially leading to revisions that could have been avoided with early consultation.
The IRB approval process for our program should be completed during the FIRST DIS Course.
Please visit IRB HOME in NU Library for a complete explanation of the steps you must take.
Start here: Institutional Review Board (IRB): Get Started with IRB, and start early.
In addition to the standard IRB process, certain research studies may require additional approval from specialized IRBs (Institutional Review Board), such as those managed by the Centers for Disease Control and Prevention (CDC). These special IRBs are necessary when research involves high-risk populations, sensitive health data, or specific regulatory requirements.
The Templates are divided by CMP course and DIS sequence. Each is accompanied by an explanation document ( Guidelines). Scroll down to reach the files.
Preparing and Sharing Your GitHub or Google Collab Folder
Utilizing cloud storage for code in a PhD study offers several advantages, particularly for projects that necessitate collaboration, scalability, and secure data management. Storing code in the cloud enables seamless access across multiple devices, ensuring that PhD candidates, advisors, and collaborators can access the latest versions of the code from anywhere with an internet connection. This accessibility supports more flexible workflows and reduces dependency on a single device or location, which can be especially useful during fieldwork, travel, or unexpected equipment failures.
Cloud storage also enhances security and version control. Platforms like GitHub or GitLab in the cloud track changes made to the code over time, making it easy to revert to previous versions if errors occur or to review how the code has evolved. This also enables transparent documentation of the research process, which is crucial for reproducibility and accountability in academic research.
Additionally, cloud storage offers scalable resources for running code on larger datasets or more complex models, which may be limited on local machines. Many cloud platforms integrate with powerful computing resources, enabling PhD researchers to leverage these resources as needed without incurring the expense of purchasing expensive hardware.
Google Colab. (n.d.). Introduction to Google Colab. Retrieved from Google Colab
Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units). Colab is especially well suited to machine learning, data science, and education.
GitHub. (n.d.). GitHub Tutorial: Getting Started with GitHub. Retrieved from GitHub Guides
Git started on your first repository in the third installment of GitHub for Beginners. Discover the essential features and settings to manage your projects effectively.
Posit PBC. (n.d.). Posit Cloud Documentation. Retrieved from https://docs.posit.co/cloud/
Posit makes it easy to deploy open-source data science work across the enterprise safely and securely. Share Jupyter notebooks, Plotly dashboards, or interactive applications built with popular R and Python frameworks. You may want to review the documentation on setting up a Posit Cloud Account.
RPubs is a publishing platform created by RStudio designed to make sharing R Markdown documents simple and accessible. It allows data scientists, statisticians, and researchers to publish their R analyses, visualizations, and reports online in an easy-to-view and share format. Users can generate HTML reports directly from R Markdown and upload them to RPubs with a few clicks, making it ideal for sharing data science projects, tutorials, or reproducible research with peers and the public. This open access to R Markdown content promotes knowledge sharing, collaborative learning, and greater transparency in data-driven research, supporting the growth of the R and data science communities.
The Importance of Replicability, Reproducibility, and Generalizability in Science
Replicability refers to the ability of a study to be repeated with the same methodology and produce the same results. It is fundamental to the scientific method because it validates the reliability and consistency of research findings. When results are replicable, it builds confidence in the study's methods and conclusions (Goodman et al., 2016).
Reproducibility involves achieving the same results using the original data and analysis code. It ensures the analysis is accurate and error-free and confirms the integrity of the computational procedures used (Peng, 2011).
Generalizability refers to the extent to which a study's findings can be applied to broader populations or different contexts. This is crucial for determining the relevance and applicability of research outcomes to real-world settings (Shadish et al., 2002).
Achieving Replicability, Reproducibility, and Generalizability in Data Science and AI
In data science and AI, achieving these three aspects is essential for advancing knowledge and technology.
Clear Documentation: Researchers must document their methodologies, including the data collection process, preprocessing steps, algorithms used, and parameter settings (Stodden, 2010).
Open Data and Code: Sharing data and code publicly enables other researchers to replicate the study. Platforms like GitHub and repositories like Zenodo facilitate this sharing (Sandve et al., 2013).
Version Control: Version control systems like Git ensure that the exact versions of code and data used in the analysis are preserved and accessible (Ram, 2013).
Computational Environments: Containerization tools like Docker can encapsulate the computational environment, ensuring the code runs consistently across different systems (Boettiger, 2015).
Diverse Datasets: Using varied datasets during model training can help ensure that the findings are not limited to specific data characteristics and can be applied to broader contexts (Bengio, 2012).
Cross-Validation: Techniques like cross-validation help test the model on different data subsets, enhancing the results' robustness and generalizability (Kohavi, 1995).
Best Practices
Best Practice |
Do's |
Don'ts |
---|---|---|
Replicability |
- Document all steps clearly |
- Keep methodologies vague |
Reproducibility |
- Use version control |
- Rely on local environments only |
Generalizability |
- Train on diverse datasets |
- Overfit to specific data |
Data Documentation |
- Provide metadata for datasets |
- Omit data cleaning steps |
Algorithm Transparency |
- Explain algorithm choices |
- Use "black-box" approaches |
Open Science |
- Publish in open-access journals |
- Restrict access to results |
© Copyright 2025 National University. All Rights Reserved.