A Brief Note on Selecting and Reporting the Right Statistical Test

Posted: June 5, 2017 • Updated: September 17, 2023

Selecting the appropriate statistical technique for hypothesis testing hinges on both the study design and the nature of the data. This cheat sheet encapsulates the most prevalent pairings. If your intended combination is not covered, consulting an expert is advisable. While there are myriad techniques for each pairing, this guide focuses on those most often employed in Human-Computer Interaction (HCI) research. Their inclusion is based on their simplicity and/or their reliability, as they tend to provide the highest power for a specified significance level. While I am not a trained statistician, I frequently incorporate statistics into my research. I welcome any feedback or comments; please don't hesitate to reach out via email.

If the information on this page proves helpful, or if you wish to reinforce to reviewers that you have selected the appropriate statistical tests, kindly consider referencing this page using the citation provided below.

Ahmed Sabbir Arif. 2017. A Brief Note on Selecting and Reporting the Right Statistical Test. University of California, Merced, United States. https://www.theiilab.com/notes/HypothesisTesting.html

Terminologies

Independent Variables (\( \textit{IV} \)), often termed as "factors," represent the specific conditions that a researcher modifies to study their impact on the dependent variables, such as experimenting with various text entry techniques.
Conditions refer to the distinct variations or levels within a single (\( \textit{DV} \)), like diverse methods of text input.
Dependent Variables (\( \textit{DV} \)) are the outcomes or measurements a researcher is keen on observing, for instance, typing speed in words per minute or the error rate.
The degrees of freedom (\( \textit{df} \)) denote the count of values in a statistical analysis that are free to vary.

Appropriate Statistical Tests

Data Type	Normality Assumption	Sphericity Assumption	Statistical Test
Nominal Data	—	—	Non-parametric Test
Ordinal Data	—	—	Non-parametric Test
Interval Data	Yes	Yes	Parametric Test
	Yes	No	Non-parametric Test
	No	—	Non-parametric Test
Ratio Data	Yes	Yes	Parametric Test
	Yes	No	Non-parametric Test
	No	—	Non-parametric Test

Normality Tests

Researchers evaluate the null hypothesis, denoted as \( H_0 \), which posits that a specific data sample originates from a population with a normal distribution. To help with this, here are four widely-used tests to gauge data normality:

Shapiro-Wilk test
Anderson-Darling test
D'Agostino's K-squared test
Martinez-Iglewicz test

How to Report: Examples

The Shapiro-Wilk test indicated that the sample likely originates from a normally distributed population.
Using the Martinez-Iglewicz test, it was determined that the residuals of the response variable follow a normal distribution.

Non-Parametric Tests

Non-parametric tests are typically applied to data from nominal categories, questionnaires, and rating scales.

Variable	Study Design	Independent Variable
Variable	Study Design	2 Conditions	2+ Conditions
Nominal	Within-subjects Correlated Observations	McNemar's Test	Chi-Squared (\( \chi^2 \)) Test
Nominal	Between-subjects Independent Observations	Chi-Squared (\( \chi^2 \)) Test Fisher's Exact Test	Chi-Squared (\( \chi^2 \)) Test
Ordinal and Non-Normal Quantitative	Within-subjects Correlated Observations	Wilcoxon Signed-Rank Test	Friedman Test
Ordinal and Non-Normal Quantitative	Between-subjects Independent Observations	Mann-Whitney \( U \) Test	Kruskal-Wallis Test

How to Report: Examples

Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.

A [name of the test] identified a significant difference between the [independent variables].
There was a significant effect of [independent variable] on [dependent variable].
A [name of the test] identified a significant effect of [independent variable] on [dependent variable].
A [name of the test] on the data identified significance with respect to the [dependent variable].

Statistical Procedure	Example
McNemar's Test	\( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \)
Chi-Squared Test	\( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \)
Fisher's Exact Test	\( p \lt .001 \)
Wilcoxon Signed-Rank Test	\( z = -3.06, \textit{df} = 2, p \lt .005 \)
Friedman Test	\( \chi^2 = 20.67, \textit{df} = 2, p \lt .0005 \)
Mann-Whitney U Test	\( U = 75.5, Z = -0.523, p \lt .05 \)
Kruskal-Wallis Test	\( H = 4.61, \textit{df} = 1, p \lt .05 \)

Traditionally, \( p \)-values are rounded to one of the following values: \( \{.05, .01, .005, .001, .0005, .0001\} \) (MacKenzie, 2015). Exact \( p \)-values are only noted when the null hypothesis \( H_0 \) is on the brink of acceptance, for example, \( p = .051 \). Digits should never be italicized. Since \( p \)-values always fall between \( 0 \) and \( 1 \), there is no need for a \( 0 \) before the decimal point.

Parametric Tests

Parametric tests are usually performed on normally distributed quantitative data.

Variable	Study Design	Independent Variable
Variable	Study Design	2 Conditions	2+ Conditions
Normal Quantitative	Within-subjects Correlated Observations	Paired Samples T-test	Repeated-measures ANOVA	Mixed-design ANOVA
Normal Quantitative	Between-subjects Independent Observations	Independent Samples T-test	Between-subjects ANOVA	Mixed-design ANOVA

Different Types of Analysis of Variance (ANOVA) & When to Use Them

It is generally not advised to use more than two Independent Variables, as illustrated by the grayed-out cells below. Doing so increases the number of conditions, making them challenging to manage. The following table displays appropriate parametric tests for experiments with more than two conditions.

Study Design		Between-subjects Independent Observations
	Independent Variables	0	1	2	2+
Within-subjects Correlated Observations	0	NA	One-way Between-subjects ANOVA	Two-way Between-subjects ANOVA	MANOVA
	1	One-way Repeated-measures ANOVA	Two-way Mixed-design ANOVA	Mixed-design ANOVA	Mixed-design ANOVA
	2	Two-way Repeated-measures ANOVA	Mixed-design ANOVA	Mixed-design ANOVA	Mixed-design ANOVA
	2+	MANOVA	Mixed-design ANOVA	Mixed-design ANOVA	Mixed-design ANOVA

How to Report: Examples

Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.

There was a significant effect of [independent variable] on [dependent variable].
A(n) [name of the test] identified a significant effect of [independent variable] on [dependent variable].

Statistical Procedure	Example
Paired or Independent-samples T-test	\( t_{54} = 5.43, p \lt .001 \)
Paired or Independent-samples T-test	\( t = 5.43, \textit{df} = 54, p \lt .001 \)
All Variants of ANOVA	\( F_{1,11} = 38.65, p \lt .0001 \)
All Variants of ANOVA	\( F(1,11) = 38.65, p \lt .0001 \)

Effect Size

It is advisable to include the effect size when presenting statistically significant results, such as \( F_{1,11} = 38.6, p < .0001, \eta^2 = 0.4 \). Effect size serves as a quantitative indicator of the magnitude of a particular phenomenon, aimed at answering a specific research question. Within the realm of statistical tests, it gauges the intensity of the association between independent and dependent variables, representing it on a numerical scale (Arif, 2021).

The following table outlines prevalent effect size measures tailored for various statistical significance tests. It also provides the respective equations for computing them and offers guidelines for interpreting the resulting values.

Type	Statistical Procedure	Effect Size Measure	Interpretation
Non-parametric	McNemar's Test	Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \) \( A \) is the probability (count) of the event occurring \( B \) is the probability (count) of the event not occurring \( \phi = 0 \) to \( \infty \)	\( OR = 1.68 \) small \( OR = 3.47 \) medium \( OR = 6.71 \) large (Chen et al., 2010)
	Chi-Squared Test	Phi coefficient \( \phi = \sqrt{ \frac{ \chi^2 }{n} } \) \( n \) is the number of observations \( \phi = 0 \) to \( 1 \)	\( \phi = 0.1 \) small \( \phi = 0.3 \) medium \( \phi = 0.5 \) large (Cohen, 1988)
	Fisher's Exact Test	Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \) \( A \) is the probability (count) of the event occurring \( B \) is the probability (count) of the event not occurring \( \phi = 0 \) to \( \infty \)	\( OR = 1.68 \) small \( OR = 3.47 \) medium \( OR = 6.71 \) large (Chen et al., 2010)
	Wilcoxon Signed-Rank Test	Pearson's correlation coefficient \( r = \frac{n (\sum{xy}) - (\sum{x}) (\sum{y}) }{\sqrt{[ n \sum{x^2} - (\sum{x})^2 ] [n \sum{y^2} - (\sum{y})^2] } } \) \( x \) and \( y \) are the variables \( n \) is the total number of pairs Alternative \(r = \frac{\sum{Z_x Z_y}}{n} \) \( z_x \) and \( Z_y \) are \( z \) scorers \( n \) is the total number of pairs \( r = -1 \) to \( 1 \)	\( r = 0.1 \) small \( r = 0.3 \) medium \( r = 0.5 \) large (Cohen, 1988)
	Friedman Test	Kendall's coefficient of concordance \( W = \frac{12S}{m^2 (n^3 - n)} \) \( S \) is the sum of squared deviations \( m \) is the number of judges (raters) \( n \) is the total number of objects being ranked \( W = 0 \) to \( 1 \)	\( W = 0.1 \) small \( W = 0.3 \) medium \( W = 0.5 \) large (Cohen, 1988)
	Mann-Whitney U Test	Rank-biserial correlation \( r = 1 – \frac{2U}{n_1 n_2} \) \( U \) is the smaller of the two \( U \) values \( n_1 \) is the number of observations in group 1 \( n_2 \) is the number of observations in group 2 \( r = 0 \) to \( 1 \)	\( r = 0.56 \) small \( r = 0.64 \) medium \( r = 0.71 \) large (Ruscio, 2008)
	Kruskal-Wallis Test	Eta-squared \( (\eta^2) = \frac{ H - k + 1}{ n - k} \) \( H \) is the \( H \) score obtained in the test \( k \) is the number of groups \( n \) is the number of observations \( \eta^2 = 0.01 \) to \( 1 \)	\( \eta^2 = 0.01 \) small \( \eta^2 = 0.06 \) medium \( \eta^2 = 0.14 \) large (Cohen, 1992b)
Parametric	Paired or Independent-samples T-test	Cohen's \( d = \frac{\bar{x}_1 - \bar{x}_2}{s} \) \( \bar{x}_1 - \bar{x}_2 \) is the difference between two means \( s \) is the pooled standard deviation \( d = 0.01 \) to \( \infty \)	\( d = 0.2 \) small \( d = 0.5 \) medium \( d = 0.8 \) large (Cohen, 1992a)
Parametric	All Variants of ANOVA	Eta-squared \( (\eta^2) = \frac{SS_{Treatment}}{SS_{Total}} \) \( SS_{Treatment} \) is the sum of squares for the effect of interest \( SS_{Total} \) is the total sum of squares for all effects, interactions, errors \( \eta^2 = 0.01 \) to \( 1 \)	\( \eta^2 = 0.01 \) small \( \eta^2 = 0.06 \) medium \( \eta^2 = 0.14 \) large (Cohen, 1992b)

Post-hoc Multiple Comparisons Tests

While global hypothesis tests provide insights regarding the overall differences between groups, post-hoc tests discern which specific groups (or conditions) manifested differences. Post-hoc analyses are conducted subsequent to the conclusion of the experiment.

It is customary to conduct multiple comparisons tests only if the null hypothesis \( H_0 \) concerning homogeneity is rejected. Notably, Hsu (1996, pp 177) posits that the outcomes of the majority of multiple comparisons tests remain valid, even if the global hypothesis test does not identify an overarching statistically significant difference in group means. An exception to this is the Fisher LSD (least significant difference) test, which is infrequently utilized in contemporary research. This particular test operates under the premise that \( H_0 \) of homogeneity has been rejected. Nonetheless, identifying statistical significance in a post-hoc analysis becomes improbable when the global test does not ascertain overall significance.

Statistical Procedure	Post-hoc Multiple-Comparison Test
Chi-Squared Test	The Bonferroni Procedure
Friedman Test	Games-Howell Test
Kruskal-Wallis Test	The Dunn's Test
ANOVA	Tukey-Kramer Test Newman-Keuls Method The Bonferroni Procedure

How to Report: Examples

Model your statements after the examples below.

A post-hoc [name of the test] revealed that the [condition/s] and [condition/s] differed significantly at p < .05.
A post-hoc [name of the test] identified the follwoing significantly different groups: [condition/s], [condition/s], ….
A post-hoc [name of the test] suggested [condition/s] performed significantly better, in terms of [dependent variable/s] than [condition/s].
A post-hoc [name of the test] suggested [condition/s] was/were significantly different than [condition/s].

Related Notes

References

Ahmed Sabbir Arif. 2021. Statistical Grounding. Intelligent Computing for Interactive System Design: Statistics, Digital Signal Processing, and Machine Learning in Practice (1st ed.). Association for Computing Machinery, New York, NY, USA, 59–99. https://doi.org/10.1145/3447404.3447410
Henian Chen ,Patricia Cohen, Sophie Chen. 2010. How Big is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies. Communications in Statistics - Simulation and Computation 39, 4, 860-864.
Jacob Cohen. 1988. Statistical Power Analysis for the Behavior Sciences (2nd Ed.). Lawrence Erlbaum, Hillsdale, NJ, USA, 283.
Jacob Cohen. 1992a. A Power Primer. Psychological Bulletin 112, 1, 155–159.
Jacob Cohen. 1992b. Statistical Power Analysis. Current Directions in Psychological Science 1, 3, 98-101.
Jason Hsu. 1996. Multiple Comparisons: Theory and Methods (Guilford School Practitioner). Chapman and Hall/CRC, London, UK.
I. Scott MacKenzie. 2015. How to Report an F-Statistic. Retrieved February 23, 2017 from http://www.yorku.ca/mack/RN-HowToReportAnFStatistic.html
Nornadiah Razali and Yap Bee Wah. 2011. Power Comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling Tests. Journal of Statistical Modeling and Analytics 2, 1, 21–33.
J. Ruscio. 2008. A Probability-based Measure of Effect Size: Robustness to Base Rates and Other Factors. Psychological Methods 13, 1, 19-30.