Ball divergence

Testing for equal distributions

Summarize

Perspective

Next, we will try to give a sample version of Ball Divergence. For convenience, we can decompose the Ball Divergence into two parts: $A=\iint _{V\times V}[\mu -\nu ]^{2}({\bar {B}}(u,\rho (u,v)))\mu (du)\mu (dv),$ and $C=\iint _{V\times V}[\mu -\nu ]^{2}({\bar {B}}(u,\rho (u,v)))\nu (du)\nu (dv).$ Thus $BD(\mu ,\nu )=A+C.$

Let $\delta (x,y,z)=I(z\in {\bar {B}}(x,\rho (x,y)))$ denote whether point $z$ locates in the ball ${\bar {B}}(x,\rho (x,y))$ . Given two independent samples $\{X_{1},\ldots ,X_{n}\}$ form $\mu$ and $\{Y_{1},\ldots ,Y_{m}\}$ form $\nu$

${\begin{aligned}&A_{ij}^{X}={\frac {1}{n}}\sum _{u=1}^{n}\delta \left(X_{i},X_{j},X_{u}\right),A_{ij}^{Y}={\frac {1}{m}}\sum _{v=1}^{m}\delta \left(X_{i},X_{j},Y_{v}\right),\\&C_{kl}^{X}={\frac {1}{n}}\sum _{u=1}^{n}\delta \left(Y_{k},Y_{l},X_{u}\right),C_{ij}^{Y}={\frac {1}{m}}\sum _{v=1}^{m}\delta \left(Y_{k},Y_{l},Y_{v}\right),\end{aligned}}$ where $A_{ij}^{X}$ means the proportion of samples from the probability measure $\mu$ located in the ball ${\bar {B}}\left(X_{i},\rho \left(X_{i},X_{j}\right)\right)$ and $A_{ij}^{Y}$ means the proportion of samples from the probability measure $\nu$ located in the ball ${\bar {B}}\left(X_{i},\rho \left(X_{i},X_{j}\right)\right)$ . Meanwhile, $C_{ij}^{X}$ and $C_{ij}^{Y}$ means the proportion of samples from the probability measure $\mu$ and $\nu$ located in the ball ${\bar {B}}\left(Y_{i},\rho \left(Y_{i},Y_{j}\right)\right)$ . The sample versions of $A$ and $C$ are as follows

$A_{n,m}={\frac {1}{n^{2}}}\sum _{i,j=1}^{n}\left(A_{ij}^{X}-A_{ij}^{Y}\right)^{2},\qquad C_{n,m}={\frac {1}{m^{2}}}\sum _{k,l=1}^{m}\left(C_{kl}^{X}-C_{kl}^{Y}\right)^{2}.$

Finally, we can give the sample ball divergence

$BD_{n,m}=A_{n,m}+C_{n,m}.$

It can be proved that $BD_{n,m}$ is a consistent estimator of BD. Moreover, if ${\tfrac {n}{n+m}}\to \tau$ for some $\tau \in [0,1]$ , then under the null hypothesis $BD_{n,m}$ converges in distribution to a mixture of chi-squared distributions, whereas under the alternative hypothesis it converges to a normal distribution.

Remove ads

Properties

1. The square root of Ball Divergence is a symmetric divergence but not a metric, because it does not satisfy the triangle inequality.

2. It can be shown that Ball divergence, energy distance test,^[2] and MMD^[3] are unified within the variogram framework; for details see Remark 2.4 in.^[1]

Homogeneity Test

Summarize

Perspective

Ball divergence admits a straightforward extension to the K-sample setting. Suppose $\mu _{1},\dots ,\mu _{K}$ are $K(\geq 2)$ probability measures on a Banach space $(V,\|\cdot \|)$ . Define the K-sample BD by

$D(\mu _{1},\dots ,\mu _{K})=\sum _{1\leq k<l\leq K}\iint _{V\times V}{\bigl [}\mu _{k}{\bigl (}{\overline {B}}(u,\rho (u,v)){\bigr )}-\mu _{l}{\bigl (}{\overline {B}}(u,\rho (u,v)){\bigr )}{\bigr ]}^{2}\;{\bigl [}\mu _{k}(du)\,\mu _{k}(dv)+\mu _{l}(du)\,\mu _{l}(dv){\bigr ]}.$

It then follows from Theorems 1 and 2 that $D(\mu _{1},\dots ,\mu _{K})=0$ if and only if $\mu _{1}=\mu _{2}=\cdots =\mu _{K}.$

By employing closed balls to define a metric distribution function, one obtains an alternative homogeneity measure.^[4]

Given a probability measure ${\tilde {\mu }}$ on a metric space $(V,\rho )$ , its metric distribution function is defined by

$F_{\tilde {\mu }}^{M}(u,v)={\tilde {\mu }}{\bigl (}{\overline {B}}(u,\rho (u,v)){\bigr )}=\mathbb {E} {\bigl [}\delta (u,v,X){\bigr ]},\quad u,v\in V,$

where ${\overline {B}}(u,r)=\{\,w\in V:d(u,w)\leq r\}$ is the closed ball of radius $r\geq 0$ centered at $u$ , and $\delta (u,v,X)=\prod _{k=1}^{K}\mathbf {1} \{X^{(k)}\in {\overline {B}}_{k}(u_{k},\rho _{k}(u_{k},v_{k}))\}.$

If $(X_{1},\dots ,X_{N})$ are i.i.d. draws from $({\tilde {\mu }})$ , the empirical version is

$F_{{\tilde {\mu }},N}^{M}(u,v)={\frac {1}{N}}\sum _{i=1}^{N}\delta (u,v,X_{i}).$

Based on these, the homogeneity measure based on MDF, also called metric Cramér-von Mises (MCVM) is $\mathrm {MCVM} {\bigl (}\mu _{k}\parallel \mu {\bigr )}=\int _{V\times V}p_{k}^{2}\,w(u,v)\,{\bigl [}F_{\mu _{k}}^{M}(u,v)-F_{\mu }^{M}(u,v){\bigr ]}^{2}\,d\mu _{k}(u)\,d\mu _{k}(v),$

where $\mu =\sum _{k=1}^{K}p_{k}\,\mu _{k}$ be their mixture with weights $p_{1},\dots ,p_{K}$ , and $w(u,v)=\exp {\bigl \{}-{\tfrac {d(u,v)^{2}}{2\sigma ^{2}}}{\bigr \}}$ . The overall MCVM is then

$\mathrm {MCVM} (\mu _{1},\dots ,\mu _{K})=\sum _{k=1}^{K}p_{k}^{2}\,\mathrm {MCVM} {\bigl (}\mu _{k}\parallel \mu {\bigr )}.$

The empirical MCVM is given by

${\widehat {\mathrm {MCVM} }}{\bigl (}\mu _{k}\parallel \mu {\bigr )}={\frac {1}{n_{k}^{2}}}\sum _{X_{i}^{(k)},X_{j}^{(k)}\in {\mathcal {X}}_{k}}w{\bigl (}X_{i}^{(k)},X_{j}^{(k)}{\bigr )}\,{\bigl [}F_{\mu _{k},n_{k}}^{M}{\bigl (}X_{i}^{(k)},X_{j}^{(k)}{\bigr )}-F_{\mu ,n}^{M}{\bigl (}X_{i}^{(k)},X_{j}^{(k)}{\bigr )}{\bigr ]}^{2}.$

where ${\mathcal {X}}_{k}=\{X_{1}^{(k)},\dots ,X_{n_{k}}^{(k)}\}$ be an i.i.d. sample from $\mu _{k}$ , and ${\hat {p}}_{k}={\frac {n_{k}}{\sum _{\ell =1}^{K}n_{\ell }}}.$ A practical choice for $\sigma ^{2}$ is the median of the squared distances $\{\,d(X,X')^{2}:X,X'\in \bigcup _{k=1}^{K}{\mathcal {X}}_{k}\}.$

Remove ads

Testing for equal distributions

Properties

Homogeneity Test

References

Wikiwand - on