Friday, October 2, 2009

Hypothesis testing. The little discussed 1 vs 2 sample test case

Frequentist methods in statistical inference - Hypothesis testing (Kolmogorov Smirnov test):

Why is it needed?

The importance of statistical inference techniques in any field of CS that deals with analysis of cause and effect relation between data cannot be over stressed.
On one hand we have a statistical inference techniques like hypothesis testing etc(the frequentist method). The other type of inference technique is Bayesian. Both have been critically acclaimed. This entry will discuss about null hypothesis testing. I personally took a while to understand this Idea. There are 2 or 3 related ideas to understand before I move onto null hypothesis testing, which I will outline in section 2 of this entry . As always the blog will be corrected as it grows...
I dont really understand fully when to prefer the bayesian over the frequentist methods. Here are some typical situations why you would prefer the frequentist method(This does Not mean that bayesian wont do well in these cases)....
1) When the distribution of the data is unknown and the statistic undecidable
2) Dont know yet... Will let you know as soon as I do...!!

Hypothesis testing is quite useful if we wish to compare a sample population to a target and we wish to reject a estimation or a "hypothesis" about the sample, like for example is the sample from a normal distribution? I think if we have a stong intuition about what the nature of the target population is then we can make a reasonable assumption about its parameters and study those parameters of the sample to decide what the data is like.


Technical Jargon!!!
First what we need to do is to understand a few terms related to statistics
1) Confidence level, confidence intervals and confidence limits. These are number that tell us how often a statistic must lie in a prescribed limit. So a confidence interval of 95% means we expect the statistic to lie a confidence interval [a,b] 95% of the time. In terms of the Kolmogorov Smirnov test or any this is generally

2) Significance levels. Below is the best definition of significance level that I could think up off:
In hypothesis testing, the significance level is the criterion used for rejecting the null hypothesis. The significance level is used in hypothesis testing as follows: First, the difference between the results of the experiment and the null hypothesis is determined. Then, assuming the null hypothesis is true, the probability of a difference that large or larger is computed . Finally, this probability is compared to the significance level. If the probability is less than or equal to the significance level, then the null hypothesis is rejected and the outcome is said to be statistically significant. Traditionally, experimenters have used either the 0.05 level (sometimes called the 5% level) or the 0.01 level (1% level), although the choice of levels is largely subjective. The lower the significance level, the more the data must diverge from the null hypothesis to be significant. Therefore, the 0.01 level is more conservative than the 0.05 level. The Greek letter alpha (α) is sometimes used to indicate the significance level.


3) Sampling distribution: This is a distribution of the statistic that we gather. So if your statistic is correlation. Then the sampling distribution of the correlation is just plotting many correlations in a frequency table.

4) Empirical distribution function- This is a function specifically used in Kolmogorov Smirnov test where in each value is a cumulative sum of the values before it.

The Kolmogorov Smirnov test

There are two cases the 2-sample and the 1 sample test. The former is when we compare 2 unknown distributions. The latter is the comparison to a known ditribution with a distribution from a sample. Remember that we compare statistics and not actually the data in the samples.

2 Sample case- Comparing two unknown distributions.

Step 1) Form a hypothesis H F1=F2 or F1\>F2 or F1\<F2. There is an alternate hypothesis H' which is the complement of this.
Step 2) Calculate the Empirical Distribution function for a selected Sample set X(say). Call it F1. Calculate the distribution for the other sample
Step 3) Select a significance level alpha. This is the threshold that we will use to compare to the p-value directly. This is the threshold area under the curve.
Step 4) Two sample case: Calculate the KS Statistc K' corresponding to F1 and F2. In the two sample case you will magnify this statsitic by multiplying it by the ration sqrt(n1*n2/(n1+n2)). Now here is where the difference is from the 1 Sample case. In the 2 sample case You will plug this into the p-value equation:
\operatorname{Pr}(K\leq x)=1-2\sum_{i=1}^\infty (-1)^{i-1} e^{-2i^2 x^2}=\frac{\sqrt{2\pi}}{x}\sum_{i=1}^\infty e^{-(2i-1)^2\pi^2/(8x^2)}.
x here is K'..

Step 5) Two sample case Compare this directly to alpha(typically 0.05 or 0.01). Implicitly the above equation will give you an area under the curve that represents a probability figure which is precisely telling you weather this Kolmogorov Smirnov statistic is statistically significant or not. That is if the area(significance) is higher there is a more chance that this value is not an outlier(statistically insignificant). Beware that many software implementations calculate 1 -Pr(K<=x) which is more typical. alpha is defined as

\operatorname{Pr}(K\leq K_\alpha)=1-\alpha.\,

So for this, more the p-value more is a chance that this is an outlier. Also realize the appearance of alpha as a number in the equation above. That is because the are under the curve is 1.


Step 4) One sample case: Calculate the Kolmogorov Smirnov Statistc K' corresponding to F1 and F2. F2 in this case is given to you. You have standard tables published in books that tell you that for a given significance level(the alpha), and a sample size N what is the(critical) KS statistic. Sample table:
SAMPLE SIZE
(N)
LEVEL OF SIGNIFICANCE FOR D = MAXIMUM [ F0(X) - Sn(X) ]
.20
.15
.10
.05
.01
1
.900
.925
.950
.975
.995
2
.684
.726
.776
.842
.929
3
.565
.597
.642
.708
.828
4
.494
.525
.564
.624
.733
5
.446
.474
.510
.565
.669
6
.410
.436
.470
.521
.618
7
.381
.405
.438
.486
.577
8
.358
.381
.411
.457
.543
9
.339
.360
.388
.432
.514
10
.322
.342
.368
.410
.490
11
.307
.326
.352
.391
.468
12
.295
.313
.338
.375
.450
13
.284
.302
.325
.361
.433
14
.274
.292
.314
.349
.418
15
.266
.283
.304
.338
.404
16
.258
.274
.295
.328
.392
17
.250
.266
.286
.318
.381
18
.244
.259
.278
.309
.371
19
.237
.252
.272
.301
.363
20
.231
.246
.264
.294
.356
25
.210
.220
.240
.270
.320
30
.190
.200
.220
.240
.290
35
.180
.190
.210
.230
.270

Here we dont apply the p-value formula like we did before. The Kolmogorov Smirnov statistic value is now looked up in the table above directly. So if we had chosen a significance level of 0.05 and N= 20 then the value of the KS statistic must be greater than 0.294 to reject the Null hypothesis H.
The significance level is the probability figure(or Area under the curve). Clearly it has small values. The dea is that the statistic is belongs to a statistically different population if its lies in the area under the curve. Example if the area under the interval was defined by an integral under [a,b], and was reduced from 0.5 to 0.1 the KS statistic value will have to be more extreme to lie in it.
Remember the lesser this value more can be the leeway in the KS statistic for us to not reject the hypothesis H.


File:Integral-area-under-curve.png







In the 2-sample case this area is calculated using the formula but in case of one sample we use the tables.