A- A+

A Brief Note on Selecting and Reporting the Right Statistical Test

Posted: • Updated:

Selecting the appropriate statistical technique for hypothesis testing hinges on both the study design and the nature of the data. This cheat sheet encapsulates the most prevalent pairings. If your intended combination is not covered, consulting an expert is advisable. While there are myriad techniques for each pairing, this guide focuses on those most often employed in Human-Computer Interaction (HCI) research. Their inclusion is based on their simplicity and/or their reliability, as they tend to provide the highest power for a specified significance level. While I am not a trained statistician, I frequently incorporate statistics into my research. I welcome any feedback or comments; please don't hesitate to reach out via email.

If the information on this page proves helpful, or if you wish to reinforce to reviewers that you have selected the appropriate statistical tests, kindly consider referencing this page using the citation provided below.

Ahmed Sabbir Arif. 2017. A Brief Note on Selecting and Reporting the Right Statistical Test. University of California, Merced, United States. https://www.theiilab.com/notes/HypothesisTesting.html

Terminologies

Appropriate Statistical Tests

Data TypeNormality AssumptionSphericity AssumptionStatistical Test
Nominal DataNon-parametric Test
Ordinal DataNon-parametric Test
Interval DataYesYesParametric Test
NoNon-parametric Test
NoNon-parametric Test
Ratio DataYesYesParametric Test
NoNon-parametric Test
NoNon-parametric Test

Normality Tests

Researchers evaluate the null hypothesis, denoted as \( H_0 \), which posits that a specific data sample originates from a population with a normal distribution. To help with this, here are four widely-used tests to gauge data normality:

  1. Shapiro-Wilk test
  2. Anderson-Darling test
  3. D'Agostino's K-squared test
  4. Martinez-Iglewicz test

How to Report: Examples

  1. The Shapiro-Wilk test indicated that the sample likely originates from a normally distributed population.
  2. Using the Martinez-Iglewicz test, it was determined that the residuals of the response variable follow a normal distribution.

Non-Parametric Tests

Non-parametric tests are typically applied to data from nominal categories, questionnaires, and rating scales.

Variable Study Design Independent Variable
2 Conditions 2+ Conditions
Nominal Within-subjects
Correlated Observations
McNemar's Test Chi-Squared (\( \chi^2 \)) Test
Between-subjects
Independent Observations
Chi-Squared (\( \chi^2 \)) Test

Fisher's Exact Test
Chi-Squared (\( \chi^2 \)) Test
Ordinal and
Non-Normal Quantitative
Within-subjects
Correlated Observations
Wilcoxon Signed-Rank Test Friedman Test
Between-subjects
Independent Observations
Mann-Whitney \( U \) Test Kruskal-Wallis Test

How to Report: Examples

Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.

  1. A [name of the test] identified a significant difference between the [independent variables].
  2. There was a significant effect of [independent variable] on [dependent variable].
  3. A [name of the test] identified a significant effect of [independent variable] on [dependent variable].
  4. A [name of the test] on the data identified significance with respect to the [dependent variable].

Statistical Procedure Example
McNemar's Test \( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \)
Chi-Squared Test \( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \)
Fisher's Exact Test \( p \lt .001 \)
Wilcoxon Signed-Rank Test \( z = -3.06, \textit{df} = 2, p \lt .005 \)
Friedman Test \( \chi^2 = 20.67, \textit{df} = 2, p \lt .0005 \)
Mann-Whitney U Test \( U = 75.5, Z = -0.523, p \lt .05 \)
Kruskal-Wallis Test \( H = 4.61, \textit{df} = 1, p \lt .05 \)

Traditionally, \( p \)-values are rounded to one of the following values: \( \{.05, .01, .005, .001, .0005, .0001\} \) (MacKenzie, 2015). Exact \( p \)-values are only noted when the null hypothesis \( H_0 \) is on the brink of acceptance, for example, \( p = .051 \). Digits should never be italicized. Since \( p \)-values always fall between \( 0 \) and \( 1 \), there is no need for a \( 0 \) before the decimal point.

Parametric Tests

Parametric tests are usually performed on normally distributed quantitative data.

Variable Study Design Independent Variable
2 Conditions 2+ Conditions
Normal Quantitative Within-subjects
Correlated Observations
Paired Samples T-test Repeated-measures ANOVA Mixed-design ANOVA
Between-subjects
Independent Observations
Independent Samples T-test Between-subjects ANOVA

Different Types of Analysis of Variance (ANOVA) & When to Use Them

It is generally not advised to use more than two Independent Variables, as illustrated by the grayed-out cells below. Doing so increases the number of conditions, making them challenging to manage. The following table displays appropriate parametric tests for experiments with more than two conditions.

Study Design Between-subjects
Independent Observations
Independent Variables 0 1 2 2+
Within-subjects
Correlated Observations
0 NA One-way Between-subjects ANOVA Two-way Between-subjects ANOVA MANOVA
1 One-way Repeated-measures ANOVA Two-way Mixed-design ANOVA Mixed-design ANOVA Mixed-design ANOVA
2 Two-way Repeated-measures ANOVA Mixed-design ANOVA Mixed-design ANOVA Mixed-design ANOVA
2+ MANOVA Mixed-design ANOVA Mixed-design ANOVA Mixed-design ANOVA

How to Report: Examples

Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.

  1. There was a significant effect of [independent variable] on [dependent variable].
  2. A(n) [name of the test] identified a significant effect of [independent variable] on [dependent variable].
Statistical Procedure Example
Paired or Independent-samples T-test \( t_{54} = 5.43, p \lt .001 \)
\( t = 5.43, \textit{df} = 54, p \lt .001 \)
All Variants of ANOVA \( F_{1,11} = 38.65, p \lt .0001 \)
\( F(1,11) = 38.65, p \lt .0001 \)

Traditionally, \( p \)-values are rounded to one of the following values: \( \{.05, .01, .005, .001, .0005, .0001\} \) (MacKenzie, 2015). Exact \( p \)-values are only noted when the null hypothesis \( H_0 \) is on the brink of acceptance, for example, \( p = .051 \). Digits should never be italicized. Since \( p \)-values always fall between \( 0 \) and \( 1 \), there is no need for a \( 0 \) before the decimal point. The subscript values represent the \( \textit{df} \).

Effect Size

It is advisable to include the effect size when presenting statistically significant results, such as \( F_{1,11} = 38.6, p < .0001, \eta^2 = 0.4 \). Effect size serves as a quantitative indicator of the magnitude of a particular phenomenon, aimed at answering a specific research question. Within the realm of statistical tests, it gauges the intensity of the association between independent and dependent variables, representing it on a numerical scale (Arif, 2021).

The following table outlines prevalent effect size measures tailored for various statistical significance tests. It also provides the respective equations for computing them and offers guidelines for interpreting the resulting values.

Type Statistical Procedure Effect Size Measure Interpretation
Non-parametric McNemar's Test Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \)
\( A \) is the probability (count) of the event occurring
\( B \) is the probability (count) of the event not occurring

\( \phi = 0 \) to \( \infty \)
\( OR = 1.68 \) small
\( OR = 3.47 \) medium
\( OR = 6.71 \) large

(Chen et al., 2010)
Chi-Squared Test Phi coefficient \( \phi = \sqrt{ \frac{ \chi^2 }{n} } \)
\( n \) is the number of observations

\( \phi = 0 \) to \( 1 \)
\( \phi = 0.1 \) small
\( \phi = 0.3 \) medium
\( \phi = 0.5 \) large

(Cohen, 1988)
Fisher's Exact Test Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \)
\( A \) is the probability (count) of the event occurring
\( B \) is the probability (count) of the event not occurring

\( \phi = 0 \) to \( \infty \)
\( OR = 1.68 \) small
\( OR = 3.47 \) medium
\( OR = 6.71 \) large

(Chen et al., 2010)
Wilcoxon Signed-Rank Test Pearson's correlation coefficient \( r = \frac{n (\sum{xy}) - (\sum{x}) (\sum{y}) }{\sqrt{[ n \sum{x^2} - (\sum{x})^2 ] [n \sum{y^2} - (\sum{y})^2] } } \)
\( x \) and \( y \) are the variables
\( n \) is the total number of pairs

Alternative
\(r = \frac{\sum{Z_x Z_y}}{n} \)
\( z_x \) and \( Z_y \) are \( z \) scorers
\( n \) is the total number of pairs

\( r = -1 \) to \( 1 \)
\( r = 0.1 \) small
\( r = 0.3 \) medium
\( r = 0.5 \) large

(Cohen, 1988)
Friedman Test Kendall's coefficient of concordance \( W = \frac{12S}{m^2 (n^3 - n)} \)
\( S \) is the sum of squared deviations
\( m \) is the number of judges (raters)
\( n \) is the total number of objects being ranked

\( W = 0 \) to \( 1 \)
\( W = 0.1 \) small
\( W = 0.3 \) medium
\( W = 0.5 \) large

(Cohen, 1988)
Mann-Whitney U Test Rank-biserial correlation \( r = 1 – \frac{2U}{n_1 n_2} \) \( U \) is the smaller of the two \( U \) values
\( n_1 \) is the number of observations in group 1
\( n_2 \) is the number of observations in group 2

\( r = 0 \) to \( 1 \)
\( r = 0.56 \) small
\( r = 0.64 \) medium
\( r = 0.71 \) large

(Ruscio, 2008)
Kruskal-Wallis Test Eta-squared \( (\eta^2) = \frac{ H - k + 1}{ n - k} \)
\( H \) is the \( H \) score obtained in the test
\( k \) is the number of groups
\( n \) is the number of observations

\( \eta^2 = 0.01 \) to \( 1 \)
\( \eta^2 = 0.01 \) small
\( \eta^2 = 0.06 \) medium
\( \eta^2 = 0.14 \) large

(Cohen, 1992b)
Parametric Paired or Independent-samples T-test Cohen's \( d = \frac{\bar{x}_1 - \bar{x}_2}{s} \)
\( \bar{x}_1 - \bar{x}_2 \) is the difference between two means
\( s \) is the pooled standard deviation

\( d = 0.01 \) to \( \infty \)
\( d = 0.2 \) small
\( d = 0.5 \) medium
\( d = 0.8 \) large

(Cohen, 1992a)
All Variants of ANOVA Eta-squared \( (\eta^2) = \frac{SS_{Treatment}}{SS_{Total}} \)
\( SS_{Treatment} \) is the sum of squares for the effect of interest
\( SS_{Total} \) is the total sum of squares for all effects, interactions, errors

\( \eta^2 = 0.01 \) to \( 1 \)
\( \eta^2 = 0.01 \) small
\( \eta^2 = 0.06 \) medium
\( \eta^2 = 0.14 \) large

(Cohen, 1992b)

Post-hoc Multiple Comparisons Tests

While global hypothesis tests provide insights regarding the overall differences between groups, post-hoc tests discern which specific groups (or conditions) manifested differences. Post-hoc analyses are conducted subsequent to the conclusion of the experiment.

It is customary to conduct multiple comparisons tests only if the null hypothesis \( H_0 \) concerning homogeneity is rejected. Notably, Hsu (1996, pp 177) posits that the outcomes of the majority of multiple comparisons tests remain valid, even if the global hypothesis test does not identify an overarching statistically significant difference in group means. An exception to this is the Fisher LSD (least significant difference) test, which is infrequently utilized in contemporary research. This particular test operates under the premise that \( H_0 \) of homogeneity has been rejected. Nonetheless, identifying statistical significance in a post-hoc analysis becomes improbable when the global test does not ascertain overall significance.

Statistical Procedure Post-hoc Multiple-Comparison Test
Chi-Squared Test The Bonferroni Procedure
Friedman Test Games-Howell Test
Kruskal-Wallis Test The Dunn's Test
ANOVA Tukey-Kramer Test
Newman-Keuls Method
The Bonferroni Procedure

How to Report: Examples

Model your statements after the examples below.

  1. A post-hoc [name of the test] revealed that the [condition/s] and [condition/s] differed significantly at p < .05.
  2. A post-hoc [name of the test] identified the follwoing significantly different groups: [condition/s], [condition/s], ….
  3. A post-hoc [name of the test] suggested [condition/s] performed significantly better, in terms of [dependent variable/s] than [condition/s].
  4. A post-hoc [name of the test] suggested [condition/s] was/were significantly different than [condition/s].

Related Notes


References