This guide contains all of the ASC's statistics resources. If you do not see a topic, suggest it through the suggestion box on the Statistics home page.

- Home
- Excel - Tutorials
- ProbabilityToggle Dropdown
- VariablesToggle Dropdown
- Statistics BasicsToggle Dropdown
- Discussing Statistics In-text
- Z-Scores and the Standard Normal DistributionToggle Dropdown
- Accessing SPSS
- SPSS-TutorialsToggle Dropdown
- Effect SizeToggle Dropdown
- G*PowerToggle Dropdown
- Testing Parametric Assumptions
- ANOVAToggle Dropdown
- Chi-Square TestsToggle Dropdown
- CorrelationToggle Dropdown
- Mediation and Moderation
- Regression AnalysisToggle Dropdown
- T-TestToggle Dropdown
- Predictive Analytics This link opens in a new window
- Quantitative Research Questions
- Hypothesis TestingToggle Dropdown
- Statistics Group Sessions

- Introduction to Parametric Assumptions
- Normality
- Outliers
- Homogeneity of Variances
- Multicollinearity
- Homoscedasticity
- Linearity
- Multicollinearity

When testing parametric assumptions, it is important to understand that assumptions are specific to the test being conducted. You can use our test guides to identify the assumptions specific to the test you're conducting. Then, consider joining the SPSS Group Session for your specific test or scheduling an individual session to learn how to test the assumptions, run the test, and interpret the output.

Use the tabs on this page to learn about the following assumptions:

- normality
- outliers
- homogeneity of variance
- homoscedasticity
- linearity
- multicollinearity

The assumption of normality indicates that continuous data is *approximately* normally distributed. I emphasize approximately because the data does not need to be perfectly normal in order to meet this assumption. What specifically needs to be approximately normal varies from test to test. Here is how this assumption applies to common tests:

**T-Tests (Independent & Dependent): **The dependent variable is approximately normally distributed *for each group* of the independent variable.

**ANOVA Family (including MANOVA and ANCOVA): **The dependent variable is approximately normally distributed *for each group.*

**Regression (Simple and Multiple):** The residuals are approximately normally distributed

**How to Evaluate Normality**

There are multiple statistics and graphs that you can consider when evaluating normality. You can choose which one(s) you want to use for your course or research. This guide will introduce each option for your consideration.

*Skewness & Kurtosis*

**Skewness **refers to the location of the peak of the distribution. In a normal distribution, the peak is in the middle of the distribution. In a skewed distribution, the peak is off to one side, with one tail longer than the other. Consider these distributions:

Negative skewness coefficients indicate the distribution is skewed left (or negatively skewed). Conversely, positive coefficients indicate the distribution is skewed right (or positively skewed). The distribution is not skewed at all if the coefficient is equal to 0, so when you're evaluating this statistic, you want a skewness value that is close to 0. **A coefficient less than .5 can be considered approximately normal.** Coefficients between (positive or negative) .5 and 1 are moderately skewed. Coefficients that exceed (positive or negative) 1 indicate the distribution is highly skewed.

You can also convert the skewness statistic to a *z*-score to determine the significance of the skewness. To do this, simply divide the skewness statistic from the output by its standard error from the output. Here is the formula for standardizing this value:

You can compare your obtained *z*-score to the critical *z* of -2.58 or +2.58. If your obtained *z*-score is less than its respective cutoff (closer to 0), then you can infer that it is approximately normally distributed. If your obtained *z-*score exceeds the critical value, it is likely skewed too much to meet the assumption of normality.

**Kurtosis** refers to how narrow or wide the distribution is. Higher coefficients indicate a taller, skinnier distribution. Consider this visual to understand what kurtosis is measuring:

The distribution has no kurtosis exceeding that of a standard normal distribution if the coefficient is equal to 3. However, SPSS reports the *excess kurtosis *or the amount of kurtosis that differs from 3. Coefficients less than 0 indicate the distribution is flatter than a normal distribution. Coefficients greater than 0 indicate the distribution is taller than a normal distribution. **A kurtosis value between -1 and +1 can be considered approximately normal**. In some cases, values between -2 and +2 may be acceptable. Choose what you feel is best, and be sure to justify your decision.

You can also convert the kurtosis statistic to a *z*-score to determine the significance of the kurtosis. To do this, simply divide the kurtosis statistic from the output by its standard error from the output. Here is the formula for standardizing this value:

You can compare your obtained *z*-score to the critical *z* of -2.58 or +2.58. If your obtained *z*-score is less than its respective cutoff (closer to 0), then you can infer that it has an approximately normal kurtosis. If your obtained *z-*score exceeds the critical value, it likely has too much kurtosis to meet the assumption of normality.

*K-S Test & Shapiro-Wilk*

There are two statistical tests that can be conducted to test the normality of a distribution: Kolmogorov-Smirnov (K-S) and Shapiro-Wilk. The **K-S Test** is appropriate for sample sizes larger than 50. The **Shapiro-Wilk Test** is used when the sample size is less than 50.

Both tests have a null hypothesis that the distribution is normal. Therefore, non-significance (*p* > .05) on these tests supports that the assumption of normality is met. Significant (*p* < .05) results suggest the data is not normally distributed. However, one should use caution before taking the results of these tests at face value, since the assumption is *approximately* normal.

*Histogram*

A histogram is a graph that shows how continuous data is distributed. A visual examination of a histogram can be used to support a decision about the approximate normality of the data. Most programs that can generate a histogram can also impose a normal curve over the graph to aid in interpreting the normality. The more the bars align to the curve, the better it fits the distribution. Consider the following examples:

**Approximately Normal Distribution**

**Skewed Distribution**

*Normal Q-Q Plots*

A normal Q-Q plot is an excellent visual to consider when making a determination about the normality of the dataset. The line represents a normal distribution. The dots represent the data. The closer the dots are to the line, the closer the data is to being normally distributed.

**Approximately Normal**

**Skewed Distribution **

An **outlier **is an unusually large or small value in a dataset for numerical (interval or ratio) data. How to check for outliers depends on the statistical test being conducted. Here are some considerations based on common tests:

**T-Tests: **Each group should have no outliers in the dependent variable.

**ANOVA Family: **The dependent variable should have no outliers when assessed in groups of the independent variable(s)

**Regression: **There should be no outliers in the residuals.

**How to Check for Outliers**

There are a few options that you can consider when checking for outliers. You can choose the one that is appropriate for your course or research. This guide will introduce each option for your consideration.

*Boxplot*

A boxplot is a great way of identifying outliers for t-tests or ANOVAs, as you can create a plot based on groups to evaluate the data. Boxplots also clearly identify outliers using Tukey's Method. This method distinguishes outliers using the interquartile range to compute limits to the data. Any values that exceed those limits are flagged as outliers in the boxplot. The boxplot below depicts the distribution of prices based on vehicle type. The dots above each box-and-whisker plot denote outliers in the data.

*Scatterplot*

If you have two continuous variables, a scatterplot is an effective way of identifying outliers. This method is well-suited to a correlation or simple linear regression analysis, as it allows you to identify unusual points in the data. The scatterplot below depicts the relationship between patient age and hospital LOS. There are two points ([20,90] and [25,71]) that need further scrutiny to determine if they deviate from the rest of the data enough to be considered an outliers.

*Standardized Residuals*

In multiple regression models, it is better to examine the residuals directly for outliers. There are different approaches to doing this. This guide will introduce the most common approach: standardized residuals. When running a multiple regression using SPSS, you can click on the **Statistics** button in the regression dialogue window to access the option to include the output needed for assessing for outliers. In the *Statistics* options, select the **Casewise diagnostics** box, as shown below.

As shown above, selecting this option will prompt SPSS to include an output table flagging outliers whose standardized residual exceeds three (3) standard deviations in either direction. You can adjust this value if desired, but three is a commonly accepted cutoff. If there are outliers that meet this criteria in the dataset, you will receive a table like the following in your output:

The homogeneity of variances assumption is most commonly tested using Levene's Test for Homogeneity. This assumption applies to analyses with independent groups, including the Independent Samples T-Test and the ANOVA family of tests. Levene's Test has a null hypothesis that the variances of the groups being compared are approximately equal. Significant results on this test suggest that the variances are *significantly different*, thus violating the assumption.

**Independent Samples T-Test**

When conducting an Independent Samples T-Test using SPSS, Levene's Test is automatically included in the output table for the test. Please refer to the T-Test guide if you're unsure of how to run that test in SPSS. When done correctly, you will see the following in your output:

When assessing the assumption, you're focusing on the area marked in red above. To interpret the results of the test, you are interpreting the **Sig.** value in this space. This is interpreted following the general rules for making a decision about the null hypothesis. In the test above, the p-value (.766) is greater than .05. So the results are *not significant*. This means that I can assume equal variances.

**ANOVA Family**

You have the option to include a test for homogeneity when conducting any of the ANOVA Family tests (i.e., ANOVA, MANOVA, ANCOVA). It is not automatically included, so you need to select the option in the settings before running the test. In the One-Way ANOVA dialogue window, you will select **Options** and check the box next to **Homogeneity of variance test **as shown below:

Checking this box will prompt SPSS to include the **Tests of Homogeneity of Variances** table in the output, as shown below.

In most cases, you will be reading across the top row (Based on Mean) to interpret the results. You are again looking at the **Sig.** value in that row to determine if the assumption is met.

This same table can be generated using the **General Linear Model** analysis in SPSS, which is used for any ANOVA that is not a One-Way ANOVA. The checkbox for including this test in the output is still found on the **Options** button in the dialogue window, as shown below:

Multicollinearity is an assumption that must be tested in regression analyses with multiple predictor (independent) variables. This assumption states that the predictor variables are not highly correlated with each other. Having too much overlap in predictors can muddle the results of the analysis. There are two analyses to consider when evaluating this assumption: correlation coefficients and Tolerance/VIF values.

**Correlation Coefficient**

When setting up the regression analysis, you can check the box next to **Descriptives** in the **Statistics** settings window, as shown below:

Checking this box will prompt SPSS to include a correlation matrix as part of the output, as shown below. The area marked in red on this matrix indicates where you would look to evaluate the assumption. You're looking for correlation coefficient values that exceed .70 (+.70 or -.70) between the predictors in the model. Since *Final Exam* is the outcome variable in the model, it does not need to be considered when exploring the assumption of multicollinearity, as the assumption speaks to the relationship between the predictors only.

**Tolerance/VIF Values**

Another setting to include when setting up the regression analysis is **Collinearity diagnostics**. This is found in the same settings box as the *Descriptives* mentioned above and shown below.

Checking this box will prompt SPSS to include the **Tolerance** and **VIF** values in the Coefficients output table, as shown below.

You need only consider the *Tolerance* or *VIF* values in the table, not both. If considering **Tolerance**, values less than 0.1 indicate the assumption has been violated. If considering **VIF**, values greater than 10 indicate the assumption has been violated.

Homoscedasticity is another assumption pertaining to the homogeneity of variances but specific to relationship tests like correlation and regression. The homoscedasticity assumption states that the predicted values' variance is equal. This assumption is tested by visually examining a scatterplot depicting the relationship between the studentized residuals and the unstandardized predicted values.

Below are three different scatterplots. The leftmost graph depicts what your graph should look like if the assumption is met. Notice how the points around the line maintain relatively the same distance from the line along the entire length of the line. The center graph shows linear heteroscedasticity. Notice how the points start to fan out the further down the line you get. The rightmost graph depicts non-linear heteroscedasticity. Notice fanning out on both ends of the line.

An assumption of most correlation and regression analyses is that of linearity. This assumption is tested by visually examining a scatterplot depicting the relationship between the studentized residuals and the unstandardized predicted values. When the pattern of dots in the scatterplot move in a roughly straight line, it is a linear relationship. Consider the scatterplots on the correlation guide for examples of linear relationships. Consider the following visual comparisons:

In a correlation or simple linear regression, you're evaluating the relationship between the two variables. In regression analyses involving multiple predictor variables, you must evaluate linearity between the outcome and each predictor as well as the linearity of the overall model.

Multicollinearity is an assumption that must be tested in regression analyses with multiple predictor (independent) variables. This assumption states that the predictor variables are not highly correlated with each other. Having too much overlap in predictors can muddle the results of the analysis. There are two analyses to consider when evaluating this assumption: correlation coefficients and Tolerance/VIF values.

**Correlation Coefficient**

When setting up the regression analysis, you can check the box next to **Descriptives** in the **Statistics** settings window, as shown below:

Checking this box will prompt SPSS to include a correlation matrix as part of the output, as shown below. The area marked in red on this matrix indicates where you would look to evaluate the assumption. You're looking for correlation coefficient values that exceed .70 (+.70 or -.70) between the predictors in the model. Since *Final Exam* is the outcome variable in the model, it does not need to be considered when exploring the assumption of multicollinearity, as the assumption speaks to the relationship between the predictors only.

**Tolerance/VIF Values**

Another setting to include when setting up the regression analysis is **Collinearity diagnostics**. This is found in the same settings box as the *Descriptives* mentioned above and shown below.

Checking this box will prompt SPSS to include the **Tolerance** and **VIF** values in the Coefficients output table, as shown below.

You need only consider the *Tolerance* or *VIF* values in the table, not both. If considering **Tolerance**, values less than 0.1 indicate the assumption has been violated. If considering **VIF**, values greater than 10 indicate the assumption has been violated.

- Last Updated: Jul 16, 2024 11:19 AM
- URL: https://resources.nu.edu/statsresources
- Print Page