6 minutes
Kruskal–Wallis
The Kruskal–Wallis test by ranks, Kruskal–Wallis H test or One-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution1, especially when the ANOVA normality assumptions may not apply. Although this test is for identical populations, it is designed to be sensitive to unequal means. … see more
Let $ n_{i} (i = 1, 2, …, k) $ represent the sample sizes for each of the k groups (i.e., samples) in the data. Next, rank the combined sample. Then compute Ri = the sum of the ranks for group i. Then the Kruskal Wallis test is:
$$ H = \left [ \frac{12}{N(N+1)} \cdot \Sigma \frac{T_{c}^{2}}{\eta _{c}}\right ]-3\cdot (N+1) $$
This statistic approximates a chi-square distribution with k-1 degrees of freedom if the null hypothesis of equal populations is true. Each of the ni should be at least 5 for the approximation to be valid.
Finally, the p-value is approximated by $ Pr(\chi _{g-1}^{2}\geq H) $.
If some $ n_{i} $ values are small (less than 5) the probability distribution of H can be quite different from this chi-squared distribution. If a table of the chi-squared probability distribution is available, the critical value of chi-squared, $ \chi _{\alpha :g-1}^{2} $, can be found by entering the table at g−1 degrees of freedom and looking under the desired significance or alpha level.
If the statistic is not significant, then there is no evidence of stochastic dominance between the samples. However, if the test is significant then at least one sample stochastically dominates another sample. Therefore, a researcher might use sample contrasts between individual sample pairs, or post hoc tests using Dunn’s test, which (1) properly employs the same rankings as the Kruskal-Wallis test, and (2) properly employs the pooled variance implied by the null hypothesis of the Kruskal-Wallis test in order to determine which of the sample pairs are significantly different2.
Example
The following example considers the Atmospheric CO2 Concentrations [ppm] in a given city. Three weather stations provide value for a semester and are compared among them. Then, the main station compares its values with a secondary device which stores data at a different resolution.
The comparison is made to find whether the stations provide similar information or have significant differences (i.e. H < p-value).
# setup libraries
library(ggplot2)
library(reshape2)
library(ggthemes)
library(stargazer)
The data from the 3 stations is loaded. The date, rather than useful to verify the source of information, is not relevant in the analysis.
StationsCity <- read.csv("myData/StationCity.csv", row.names=1)
dplyr::tbl_df(StationsCity)
StationsCity$Date <- NULL
Date | Station-GBMcity | Station-GBMnorth | Station-GBMsouth | |
---|---|---|---|---|
1 | 01/01/2013 | 377.04 | 375.71 | 338.48 |
2 | 02/01/2013 | 375.52 | 374.66 | 312.75 |
3 | 03/01/2013 | 380.69 | 367.69 | 327.79 |
4 | 04/01/2013 | 379.21 | 370.54 | 299.91 |
5 | 05/01/2013 | 377.12 | 372.72 | 307.83 |
6 | 06/01/2013 | 378.38 | 367.8 | 330.19 |
7 | 07/01/2013 | 379.44 | 355.17 | 316.36 |
8 | 08/01/2013 | 376.07 | 373.47 | 307.41 |
9 | 09/01/2013 | 377.81 | 371.22 | 331.21 |
StationsCity <- melt(StationsCity)
StationsCity$case <- seq(1:(dim(StationsCity)[1]/3))
colnames(StationsCity) <- c('station','value','case')
pStation <- ggplot(data=StationsCity, aes(x=case, y=value, colour=station)) +
geom_point() + geom_line(alpha=0.25) +
geom_hline(yintercept = median(StationsCity$value), linetype='dashed')
pStation + theme_wsj() + scale_colour_wsj("colors6", "")
pStationb <- ggplot(data=StationsCity, aes(value, colour=station)) +
geom_density() +
geom_vline(xintercept = median(StationsCity$value), linetype='dashed')
pStationb + theme_wsj() + scale_colour_wsj("colors6", "")
testStations <- kruskal.test(value ~ station, data = StationsCity)
testStations
Kruskal-Wallis rank sum test data: value by variable Kruskal-Wallis chi-squared = 359.69, df = 2, p-value < 2.2e-16
StationsCity <- read.csv("myData/StationCity.csv", row.names=1)
StationOld <- read.csv("myData/StationCityOld.csv", row.names=1)
dplyr::tbl_df(StationOld)
value | station | week | |
---|---|---|---|
1 | 366.86 | Station.GBMcity-PG | 07/01/2013 |
2 | 371.2 | Station.GBMcity-PG | 14/01/2013 |
3 | 375.63 | Station.GBMcity-PG | 21/01/2013 |
4 | 377.99 | Station.GBMcity-PG | 28/01/2013 |
5 | 373.34 | Station.GBMcity-PG | 04/02/2013 |
6 | 368.65 | Station.GBMcity-PG | 11/02/2013 |
7 | 360.67 | Station.GBMcity-PG | 18/02/2013 |
8 | 368.29 | Station.GBMcity-PG | 25/02/2013 |
9 | 369.09 | Station.GBMcity-PG | 04/03/2013 |
StationsCity <- StationsCity[,c('Date','Station.GBMcity')]
StationsCity$station <- 'GBMcity'
StationsCity$case <- rownames(StationsCity)
StationsCity$Date <- NULL
colnames(StationsCity) <- c('value','station','case')
StationOld$week <- NULL
StationOld$case <- rownames(StationOld)
StationOld$station <- 'GBMcity.B'
StationSame <- rbind(StationsCity,StationOld)
StationSame$value <- as.numeric(StationSame$value)
StationSame$station <- as.factor(StationSame$station)
StationSame$case <- as.integer(StationSame$case)
pStation <- ggplot(data=StationSame, aes(x=case, y=value, colour=station)) +
geom_point() + geom_line(alpha=0.25) +
geom_hline(yintercept = median(StationSame$value), linetype='dashed')
pStation + theme_wsj() + scale_colour_wsj("colors6", "")
pStationSec <- ggplot(data=StationSame, aes(value, colour=station)) +
geom_density() +
geom_vline(xintercept = median(StationSame$value), linetype='dashed')
pStationSec + theme_wsj() + scale_colour_wsj("colors6", "")
testStations <- kruskal.test(value ~ station, data = StationSame)
testStations
Kruskal-Wallis rank sum test data: value by variable Kruskal-Wallis chi-squared = 0.0017473, df = 1, p-value = 0.9667
To sum up…
- Since the sample is larger than 10 elements, H may be used as p-value
- H1 is 2.2e-16
- H2 is 0.9667
- Therefore, we may say that
Test 2
maintains the relationship among the different stations so the hypothesis H0 is accepted, whereasTest 1
is significantly different and so H0 is rejected—despite it can be seen from the graphic that is caused by one of the stations which is significantly different.
Now, how the test is made?
StationsCity
maxValue <- max(StationsCity$value)
minValue <- min(StationsCity$value)
sizeVal <- dim(StationsCity)[1]
StationsCityRanked <- StationsCity
StationsCityRanked$value <- rank(StationsCity$value)
StationsCityRanked <- dcast(StationsCityRanked, case ~ station)
StationsCityRanked$case <- NULL
station | value | case | Station.GBMcity | Station.GBMnorth | Station.GBMsouth | |
---|---|---|---|---|---|---|
1 | Station.GBMcity | 377.04 | 1 | 485 | 465 | 158 |
2 | Station.GBMcity | 375.52 | 2 | 463 | 444 | 45 |
3 | Station.GBMcity | 380.69 | 3 | 525 | 312 | 107 |
4 | Station.GBMcity | 379.21 | 4 | 512 | 376 | 14 |
5 | Station.GBMcity | 377.12 | 5 | 486 | 409 | 32 |
6 | Station.GBMcity | 378.38 | 6 | 506 | 315 | 120 |
178 | Station.GBMcity | 375.82 | 178 | 467 | 541 | 161 |
179 | Station.GBMcity | 376.83 | 179 | 482 | 360 | 1 |
180 | Station.GBMcity | 374.03 | 180 | 431 | 253.5 | 40 |
181 | Station.GBMcity | 366.1 | 181 | 277 | 309 | 53 |
182 | Station.GBMcity | 373.9 | 182 | 427 | 499 | 207 |
sumRanks <- colSums(StationsCityRanked)
meanRanks <- colMeans(StationsCityRanked)
SumCombined <- sum(sumRanks)
MeanCombinded <- sum(meanRanks)
CountCombinded <- table(StationsCity$station)
H <- ( 12 / (dim(StationsCity)[1]*(dim(StationsCity)[1]+1))) *
( (sumRanks[1]^2)/CountCombinded[1] +
(sumRanks[2]^2)/CountCombinded[2] +
(sumRanks[3]^2)/CountCombinded[3] ) -
3 * (dim(StationsCity)[1]+1)
H
df <- 3 -1
lim.p <- qchisq(0.05, df)
lim.p
H < lim.p
Similarly, with the other dataset:
StationSame
maxValue <- max(StationSame$value)
minValue <- min(StationSame$value)
sizeVal <- dim(StationSame)[1]
StationSameRanked <- StationSame
StationSameRanked$value <- rank(StationSame$value)
StationSameRanked <- dcast(StationSameRanked, case ~ station)
StationSameRanked$case <- NULL
sumRanks <- colSums(StationSameRanked, na.rm = T)
meanRanks <- colMeans(StationSameRanked, na.rm = T)
SumCombined <- sum(sumRanks)
MeanCombinded <- sum(meanRanks)
CountCombinded <- table(StationSame$station)
H <- ( 12 / (dim(StationSame)[1]*(dim(StationSame)[1]+1))) *
( (sumRanks[1]^2)/CountCombinded[1] +
(sumRanks[2]^2)/CountCombinded[2] ) -
3 * (dim(StationSame)[1]+1)
H
df <- 2 -1
lim.p <- qchisq(0.05, df)
lim.p
H < lim.p
Finally, the following figure plots density values for the chi-square distribution with different degrees of freedom. In the examples, H1 is significantly less than p with df=3-1, and H2, on the other hand, is more than p with df=2-1 and so H0 is accepted.
dfDist <- c(1, 2, 5, 10)
labels <- paste('df =',dfDist)
colpal <- c('tomato','green','cyan','gold')
j = 1
plot(x=-1, xlim=c(0,5), ylim=c(0,1), xlab='chi-sq', ylab='density')
for(i in dfDist){
curve(dchisq(x,i), xlim=c(0,10), col=colpal[j], lwd=1, add=TRUE)
j <- j + 1
}
legend("topright", bty='n', text.width=1, labels, lwd=2.5,
lty=c(1, 1, 1, 1), col=colpal)
abline(h=0,lty=3)
abline(v=0,lty=3)