Skip to Main Content

Statistics Resources

This guide contains all of the ASC's statistics resources. If you do not see a topic, suggest it through the suggestion box on the Statistics home page.

Parametric Assumptions

When testing parametric assumptions, it is important to understand that assumptions are specific to the test being conducted. You can use our test guides to identify the assumptions specific to the test you're conducting. Then, consider joining the SPSS Group Session for your specific test or scheduling an individual session to learn how to test the assumptions, run the test, and interpret the output.

Use the tabs on this page to learn about the following assumptions:

  • normality
  • outliers
  • homogeneity of variance
  • homoscedasticity
  • linearity
  • multicollinearity

The assumption of normality indicates that continuous data is approximately normally distributed. I emphasize approximately because the data does not need to be perfectly normal in order to meet this assumption. What specifically needs to be approximately normal varies from test to test. Here is how this assumption applies to common tests:

T-Tests (Independent & Dependent): The dependent variable is approximately normally distributed for each group of the independent variable.

ANOVA Family (including MANOVA and ANCOVA): The dependent variable is approximately normally distributed for each group.

Regression (Simple and Multiple): The residuals are approximately normally distributed


How to Evaluate Normality

There are multiple statistics and graphs that you can consider when evaluating normality. You can choose which one(s) you want to use for your course or research. This guide will introduce each option for your consideration.

Skewness & Kurtosis

Skewness refers to the location of the peak of the distribution. In a normal distribution, the peak is in the middle of the distribution. In a skewed distribution, the peak is off to one side, with one tail longer than the other. Consider these distributions: 

image showing a negatively skewed distribution with the peak on the right, a normal distribution with the peak in the middle, and a positively skewed distribution with the peak on the left

Negative skewness coefficients indicate the distribution is skewed left (or negatively skewed). Conversely, positive coefficients indicate the distribution is skewed right (or positively skewed). The distribution is not skewed at all if the coefficient is equal to 0, so when you're evaluating this statistic, you want a skewness value that is close to 0. A coefficient less than .5 can be considered approximately normal. Coefficients between (positive or negative) .5 and 1 are moderately skewed. Coefficients that exceed (positive or negative) 1 indicate the distribution is highly skewed.

You can also convert the skewness statistic to a z-score to determine the significance of the skewness. To do this, simply divide the skewness statistic from the output by its standard error from the output. Here is the formula for standardizing this value:

z equals skewness divided by its standard error

You can compare your obtained z-score to the critical z of -2.58 or +2.58. If your obtained z-score is less than its respective cutoff (closer to 0), then you can infer that it is approximately normally distributed. If your obtained z-score exceeds the critical value, it is likely skewed too much to meet the assumption of normality.

Kurtosis refers to how narrow or wide the distribution is. Higher coefficients indicate a taller, skinnier distribution. Consider this visual to understand what kurtosis is measuring:

graph comparing a distribution with positive kurtosis, normal kurtosis, and negative kurtosis

The distribution has no kurtosis exceeding that of a standard normal distribution if the coefficient is equal to 3. However, SPSS reports the excess kurtosis or the amount of kurtosis that differs from 3. Coefficients less than 0 indicate the distribution is flatter than a normal distribution. Coefficients greater than 0 indicate the distribution is taller than a normal distribution. A kurtosis value between -1 and +1 can be considered approximately normal. In some cases, values between -2 and +2 may be acceptable. Choose what you feel is best, and be sure to justify your decision.

You can also convert the kurtosis statistic to a z-score to determine the significance of the kurtosis. To do this, simply divide the kurtosis statistic from the output by its standard error from the output. Here is the formula for standardizing this value:

z equals kurtosis coefficient divided by its standard error

You can compare your obtained z-score to the critical z of -2.58 or +2.58. If your obtained z-score is less than its respective cutoff (closer to 0), then you can infer that it has an approximately normal kurtosis. If your obtained z-score exceeds the critical value, it likely has too much kurtosis to meet the assumption of normality.

K-S Test & Shapiro-Wilk

There are two statistical tests that can be conducted to test the normality of a distribution: Kolmogorov-Smirnov (K-S) and Shapiro-Wilk. The K-S Test is appropriate for sample sizes larger than 50. The Shapiro-Wilk Test is used when the sample size is less than 50. 

output table showing the results of the K-S Test and Shapiro-Wilk

Both tests have a null hypothesis that the distribution is normal. Therefore, non-significance (p > .05) on these tests supports that the assumption of normality is met. Significant (p < .05) results suggest the data is not normally distributed. However, one should use caution before taking the results of these tests at face value, since the assumption is approximately normal.

Histogram

A histogram is a graph that shows how continuous data is distributed. A visual examination of a histogram can be used to support a decision about the approximate normality of the data. Most programs that can generate a histogram can also impose a normal curve over the graph to aid in interpreting the normality. The more the bars align to the curve, the better it fits the distribution. Consider the following examples:

histogram depicting a normal distribution Approximately Normal Distribution

histogram depicting a right-skewed distribution

 

 

 

Skewed Distribution

 

 

 

 

 

 

Normal Q-Q Plots

A normal Q-Q plot is an excellent visual to consider when making a determination about the normality of the dataset. The line represents a normal distribution. The dots represent the data. The closer the dots are to the line, the closer the data is to being normally distributed.

q-q plot of a normal distribution                       Approximately Normal

Skewed Distribution                          q-q plot of a skewed distribution

An outlier is an unusually large or small value in a dataset for numerical (interval or ratio) data. How to check for outliers depends on the statistical test being conducted. Here are some considerations based on common tests:

T-Tests: Each group should have no outliers in the dependent variable.

ANOVA Family: The dependent variable should have no outliers when assessed in groups of the independent variable(s)

Regression: There should be no outliers in the residuals.


How to Check for Outliers

There are a few options that you can consider when checking for outliers. You can choose the one that is appropriate for your course or research. This guide will introduce each option for your consideration.

Boxplot

A boxplot is a great way of identifying outliers for t-tests or ANOVAs, as you can create a plot based on groups to evaluate the data. Boxplots also clearly identify outliers using Tukey's Method. This method distinguishes outliers using the interquartile range to compute limits to the data. Any values that exceed those limits are flagged as outliers in the boxplot. The boxplot below depicts the distribution of prices based on vehicle type. The dots above each box-and-whisker plot denote outliers in the data.

boxplot showing the price distributions based on vehicle type with outliers flagged in the chart

Scatterplot

If you have two continuous variables, a scatterplot is an effective way of identifying outliers. This method is well-suited to a correlation or simple linear regression analysis, as it allows you to identify unusual points in the data. The scatterplot below depicts the relationship between patient age and hospital LOS. There are two points ([20,90] and [25,71]) that need further scrutiny to determine if they deviate from the rest of the data enough to be considered an outliers. 

scatterplot with line of best fit that can be used to identify unusual values in the relationship

Standardized Residuals

In multiple regression models, it is better to examine the residuals directly for outliers. There are different approaches to doing this. This guide will introduce the most common approach: standardized residuals. When running a multiple regression using SPSS, you can click on the Statistics button in the regression dialogue window to access the option to include the output needed for assessing for outliers. In the Statistics options, select the Casewise diagnostics box, as shown below.

view of the statistics settings in SPSS showing the Casewise diagnostics option

As shown above, selecting this option will prompt SPSS to include an output table flagging outliers whose standardized residual exceeds three (3) standard deviations in either direction. You can adjust this value if desired, but three is a commonly accepted cutoff. If there are outliers that meet this criteria in the dataset, you will receive a table like the following in your output:

Casewise Diagnostics output table showing one outlier

The homogeneity of variances assumption is most commonly tested using Levene's Test for Homogeneity. This assumption applies to analyses with independent groups, including the Independent Samples T-Test and the ANOVA family of tests. Levene's Test has a null hypothesis that the variances of the groups being compared are approximately equal. Significant results on this test suggest that the variances are significantly different, thus violating the assumption.

Independent Samples T-Test

When conducting an Independent Samples T-Test using SPSS, Levene's Test is automatically included in the output table for the test. Please refer to the T-Test guide if you're unsure of how to run that test in SPSS. When done correctly, you will see the following in your output:

view of the Independent Samples Test output with a red box around the Levene's Test portion of the output

When assessing the assumption, you're focusing on the area marked in red above. To interpret the results of the test, you are interpreting the Sig. value in this space. This is interpreted following the general rules for making a decision about the null hypothesis. In the test above, the p-value (.766) is greater than .05. So the results are not significant. This means that I can assume equal variances.

ANOVA Family

You have the option to include a test for homogeneity when conducting any of the ANOVA Family tests (i.e., ANOVA, MANOVA, ANCOVA). It is not automatically included, so you need to select the option in the settings before running the test. In the One-Way ANOVA dialogue window, you will select Options and check the box next to Homogeneity of variance test as shown below: 

view of the Options in the One-Way ANOVA dialogue box

Checking this box will prompt SPSS to include the Tests of Homogeneity of Variances table in the output, as shown below.

view of the Tests of Homogeneity of Variances table generated in SPSS

In most cases, you will be reading across the top row (Based on Mean) to interpret the results. You are again looking at the Sig. value in that row to determine if the assumption is met.

This same table can be generated using the General Linear Model analysis in SPSS, which is used for any ANOVA that is not a One-Way ANOVA. The checkbox for including this test in the output is still found on the Options button in the dialogue window, as shown below:

view of the General Linear Model Univariate Options with the Homogeneity tests box checked

Multicollinearity is an assumption that must be tested in regression analyses with multiple predictor (independent) variables. This assumption states that the predictor variables are not highly correlated with each other. Having too much overlap in predictors can muddle the results of the analysis. There are two analyses to consider when evaluating this assumption: correlation coefficients and Tolerance/VIF values.

Correlation Coefficient

When setting up the regression analysis, you can check the box next to Descriptives in the Statistics settings window, as shown below:

view of the linear regression statistics settings with the Descriptives box checked

Checking this box will prompt SPSS to include a correlation matrix as part of the output, as shown below. The area marked in red on this matrix indicates where you would look to evaluate the assumption. You're looking for correlation coefficient values that exceed .70 (+.70 or -.70) between the predictors in the model. Since Final Exam is the outcome variable in the model, it does not need to be considered when exploring the assumption of multicollinearity, as the assumption speaks to the relationship between the predictors only.

view of the correlation matrix with the area of interest marked in red

Tolerance/VIF Values

Another setting to include when setting up the regression analysis is Collinearity diagnostics. This is found in the same settings box as the Descriptives mentioned above and shown below.

view of the linear regression statistics options with the Collinearity Diagnostics options selected

Checking this box will prompt SPSS to include the Tolerance and VIF values in the Coefficients output table, as shown below.

view of the Coefficients table with the Collinearity Statistics area marked in red

You need only consider the Tolerance or VIF values in the table, not both. If considering Tolerance, values less than 0.1 indicate the assumption has been violated. If considering VIF, values greater than 10 indicate the assumption has been violated.

Homoscedasticity is another assumption pertaining to the homogeneity of variances but specific to relationship tests like correlation and regression. The homoscedasticity assumption states that the predicted values' variance is equal. This assumption is tested by visually examining a scatterplot depicting the relationship between the studentized residuals and the unstandardized predicted values.

Below are three different scatterplots. The leftmost graph depicts what your graph should look like if the assumption is met. Notice how the points around the line maintain relatively the same distance from the line along the entire length of the line. The center graph shows linear heteroscedasticity. Notice how the points start to fan out the further down the line you get. The rightmost graph depicts non-linear heteroscedasticity. Notice fanning out on both ends of the line.

three scatterplots depicting homoscedastic and heteroscedastic distributions

An assumption of most correlation and regression analyses is that of linearity. This assumption is tested by visually examining a scatterplot depicting the relationship between the studentized residuals and the unstandardized predicted values. When the pattern of dots in the scatterplot move in a roughly straight line, it is a linear relationship. Consider the scatterplots on the correlation guide for examples of linear relationships. Consider the following visual comparisons:

graphs depicting linear and non-linear distributions

In a correlation or simple linear regression, you're evaluating the relationship between the two variables. In regression analyses involving multiple predictor variables, you must evaluate linearity between the outcome and each predictor as well as the linearity of the overall model.

Multicollinearity is an assumption that must be tested in regression analyses with multiple predictor (independent) variables. This assumption states that the predictor variables are not highly correlated with each other. Having too much overlap in predictors can muddle the results of the analysis. There are two analyses to consider when evaluating this assumption: correlation coefficients and Tolerance/VIF values.

Correlation Coefficient

When setting up the regression analysis, you can check the box next to Descriptives in the Statistics settings window, as shown below:

view of the linear regression statistics settings with the Descriptives box checked

Checking this box will prompt SPSS to include a correlation matrix as part of the output, as shown below. The area marked in red on this matrix indicates where you would look to evaluate the assumption. You're looking for correlation coefficient values that exceed .70 (+.70 or -.70) between the predictors in the model. Since Final Exam is the outcome variable in the model, it does not need to be considered when exploring the assumption of multicollinearity, as the assumption speaks to the relationship between the predictors only.

view of the correlation matrix with the area of interest marked in red

Tolerance/VIF Values

Another setting to include when setting up the regression analysis is Collinearity diagnostics. This is found in the same settings box as the Descriptives mentioned above and shown below.

view of the linear regression statistics options with the Collinearity Diagnostics options selected

Checking this box will prompt SPSS to include the Tolerance and VIF values in the Coefficients output table, as shown below.

view of the Coefficients table with the Collinearity Statistics area marked in red

You need only consider the Tolerance or VIF values in the table, not both. If considering Tolerance, values less than 0.1 indicate the assumption has been violated. If considering VIF, values greater than 10 indicate the assumption has been violated.