Statistic for rank correlation / From Wikipedia, the free encyclopedia
Dear Wikiwand AI, let's keep it short by simply answering these key questions:
Can you list the top facts and stats about Kendall tau rank correlation coefficient?
Summarize this article for a 10 year old
SHOW ALL QUESTIONS
"Tau-a" redirects here. For the astronomical radio source, see Taurus A.
"Tau coefficient" redirects here. Not to be confused with Tau distribution.
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametrichypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.[2]
Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.
Let be a set of observations of the joint random variables X and Y, such that all the values of () and () are unique (ties are neglected for simplicity). Any pair of observations and , where , are said to be concordant if the sort order of and agrees: that is, if either both and holds or both and ; otherwise they are said to be discordant.
where is the binomial coefficient for the number of ways to choose two items from n items.
The number of discordant pairs is equal to the inversion number that permutes the y-sequence into the same order as the x-sequence.
Properties
The denominator is the total number of pair combinations, so the coefficient must be in the range −1≤τ≤1.
If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
If X and Y are independent and not constant, then the expectation of the coefficient is zero.
An explicit expression for Kendall's rank coefficient is .
Hypothesis test
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y or the distribution of (X,Y).
Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance .[4]
The following proof is from Valz & McLeod (1990;[5] 1995[6]).
Proof
Proof
WLOG, we reorder the data pairs, so that . By assumption of independence, the order of is a permutation sampled uniformly at random from , the permutation group on .
For each permutation, its unique inversion code is such that each is in the range . Sampling a permutation uniformly is equivalent to sampling a -inversion code uniformly, which is equivalent to sampling each uniformly and independently.
Then we have
The first term is just . The second term can be calculated by noting that is a uniform random variable on , so and , then using the sum of squares formula again.
Asymptotic normality—At the limit, converges in distribution to the standard normal distribution.
Proof
Use a result from A class of statistics with asymptotically normal distribution Hoeffding (1948).[7]
Case of standard normal distributions
If are IID samples from the same jointly normal distribution with a known Pearson correlation coefficient, then the expectation of Kendall rank correlation has a closed-form formula.[8]
Greiner's equality—If are jointly normal, with correlation , then
In the notation, we see that the number of concordant pairs, , is equal to the number of that fall in the subset . That is, .
Thus,
Since each is an IID sample of the jointly normal distribution, the pairing does not matter, so each term in the summation is exactly the same, and so
and it remains to calculate the probability. We perform this by repeated affine transforms.
First normalize by subtracting the mean and dividing the standard deviation. This does not change . This gives us
where is sampled from the standard normal distribution on .
Thus,
where the vector is still distributed as the standard normal distribution on . It remains to perform some unenlightening tedious matrix exponentiations and trigonometry, which can be skipped over.
Thus, iff
where the subset on the right is a “squashed” version of two quadrants. Since the standard normal distribution is rotationally symmetric, we need only calculate the angle spanned by each squashed quadrant.
The first quadrant is the sector bounded by the two rays . It is transformed to the sector bounded by the two rays and . They respectively make angle with the horizontal and vertical axis, where
Together, the two transformed quadrants span an angle of , so