Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Hypothesis testing

The concept we just discussed in the preceding section is used for a very important technique in statistics, called hypothesis testing. In hypothesis testing, we assume a hypothesis (generally related to the value of the estimator) called null hypothesis and try to see whether it holds true or not by applying the rules of a normal distribution. We have another hypothesis called alternate hypothesis.

Null versus alternate hypothesis

There is a catch in deciding what will be the null hypothesis and what will be the alternate hypothesis. The null hypothesis is the initial premise or something that we assume to be true as yet. The alternate hypothesis is something we aren't sure about and are proposing as an alternate premise (almost often contradictory to the null hypothesis) which might or might not be true.

So, when someone is doing a quantitative research to calibrate the value of an estimator, the known value of the parameter is taken as the null hypothesis while the new found value (from the research) is taken as the alternate hypothesis. In our case of finding the mean age of Tamil Nadu, we can say that based on the rich demographic pidend of India, a researcher can claim that the mean age should be less than 35. This can serve as the null hypothesis. If a new agency claims otherwise (that it is greater than 35), then it can be termed as the alternate hypothesis.

Z-statistic and t-statistic

Assume that the value of the parameter assumed in the null hypothesis is Ao. Take a random sample of 100 or 1000 people or occurrences of the event and calculate the mean of the parameter, such as mean age, mean delivery time for pizza, mean income, and so on. We can call it Am. According to the central limit theorem, the distribution of population means that random samples will follow a normal distribution.

The Z-statistic is calculated to convert a normally distributed variable (the distribution of population mean of age) to a standard normal distribution. This is because the probability values for a variable following the standard normal distribution can be obtained from a precalculated table. The Z-statistic is given by the following formula:

In the preceding formula, the σ stands for the standard deviation of the population/occurrences of events and n is the number of people in the sample.

Now, there can be two cases that can arise:

  • Z- test (normal distribution): The researcher knows the standard deviation for the parameter from his/her past experience. A good example of this is the case of pizza delivery time; you will know the standard deviation from past experiences:

    Ao (from the null hypothesis) and n are known. Am is calculated from the random sample. This kind of test is done when the standard deviation is known and is called the z-test because the distribution follows the normal distribution and the standard-normal value obtained from the preceding formula is called the Z-value.

  • t-test (Student-t distribution): The researcher doesn't know the standard deviation of the population. This might happen because there is no such data present from the historical experience or the number of people/event is very small to assume a normal distribution; hence, the estimation of mean and standard deviation by the formula described earlier. An example of such a case is a student's marks in an exam, age of a population, and so on. In this case, the mean and standard deviation become unknown and the expression assumes a distribution other than normal distribution and is called a Student-t distribution. The standard value in this case is called t-value and the test is called t-test.

    Standard distribution can also be estimated once the mean is estimated, if the number of samples is large enough. Let us call the estimated standard distribution S; then the S is estimated as follows:

    The t-statistic is calculated as follows:

The difference between the two cases, as you can see, is the distribution they follow. The first one follows a normal distribution and calculates a Z-value. The second one follows a Student-t distribution and calculates a t-value. These statistics that is Z-statistics and t-statistics are the parameters that help us test our hypothesis.

Confidence intervals, significance levels, and p-values

Let us go back a little in the last chapter and remind ourselves about the cumulative probability distribution.

Fig. 4.1: A typical normal distribution with p-values

Let us have a look to the preceding figure, it shows a standard normal distribution. Suppose, Z1 and Z2 are two Z-statistics corresponding to two values of random variable and p1 and p2 are areas enclosed by the distribution curve to the right of those values. In other words, p1 is the probability that the random variable will take a value lesser than or equal to Z1 and p2 is the probability that the random variable will take a value greater than Z2.

If we represent the random variable by X, then we can write:

Also, since the sum of all the exclusive probabilities is always 1, we can write:

For well-defined distributions, such as the normal distribution, one can define an interval in which the value of the random variable will lie with a confidence level (read probability). This interval is called the confidence interval. For example, for a normal distribution with mean μ and standard deviation σ, the value of the random variable will lie in the interval [μ-3σ, μ+3σ] with 99% probability. For any estimator (essentially a random variable) that follows a normal distribution, one can define a confidence interval if we decide on the confidence (or probability) level. One can think of confidence intervals as thresholds of the accepted values to hold a null hypothesis as true. If the value of the estimator (random variable) lies in this range, it will be statistically correct to say that the null hypothesis is correct.

To define a confidence interval, one needs to define a confidence (or probability level). This probability needs to be defined by the researcher depending on the context. Lets call this p. Instead of defining this probability p, one generally defines (1-p) that is called level of significance. Let us represent it by ß. This represents the probability that the null hypothesis won't be true. This is defined by the user for each test and is usually of the order of 0.01-0.1.

An important concept to learn here is the probability value or just a p-value of a statistic. It is the probability that the random variable assumes, it's a value greater than the Z-value or t-value:

Fig. 4.2: A typical normal distribution with p-values and significance level

Now, this Z-value and the p-value has been obtained assuming that the null hypothesis is true. So, for the null hypothesis to be accepted, the Z-value has to lie outside the area enclosed by ß. In other words, for the null hypothesis to be true, the p-value has to be greater than the significance level, as shown in the preceding figure.

To summarize:

  • Accept the null hypothesis and reject the alternate hypothesis if p-value>ß
  • Accept the alternate hypothesis and reject the null hypothesis if p-value<ß
Different kinds of hypothesis test

Due to the symmetry and nature of the normal distribution, there are three kinds of possible hypothesis tests:

  • Left-tailed
  • Right-tailed
  • Two-tailed

Left-tailed: This is the case when the alternate hypothesis is a "less-than" type. The hypothesis testing is done on the left tail of the distribution and hence the name. In this case, for:

  • Accepting a null hypothesis and rejecting an alternate hypothesis the p-value>ß or Z>Zß
  • Accepting an alternate hypothesis and rejecting a null hypothesis the p-value<ß or Z<Zß

    Fig. 4.3: Left-tailed hypothesis testing

Right-tailed: This is the case when the alternate hypothesis is of greater than type. The hypothesis testing is done on the right tail of the distribution, hence the name. In this case, for:

  • Accepting a null hypothesis and rejecting an alternate hypothesis the p-value>ß or Z<Zß
  • Accepting an alternate hypothesis and rejecting a null hypothesis the p-value<ß or Z>Zß

    Fig. 4.4: Right-tailed hypothesis testing

Two-tailed: This is the case when the alternate hypothesis has an inequality—less than or more than is not mentioned. It is just an OR operation over both kind of tests. If either of the left- or right-tailed tests reject the null hypothesis, then it is rejected. The hypothesis testing is done on both the tails of the distribution; hence, the name.

A step-by-step guide to do a hypothesis test

So how does one accept one hypothesis and reject the other? There has to be a logical way to do this. Let us summarize and put to use whatever we have learned till now in this section, to make a step-by-step plan to do a hypothesis test. Here is a step-by-step guide to do a hypothesis test:

  1. Define your null and alternate hypotheses. The null hypothesis is something that is already stated and is assumed to be true, call it Ho. Also, assume that the value of the parameter in the null hypothesis is Ao.
  2. Take a random sample of 100 or 1000 people/occurrences of events and calculate the value of estimator (for example, mean of the parameter that is mean age, mean delivery time for pizza, mean income, and so on). You can call it Am.
  3. Calculate the standard normal value or Z-value as it is called using this formula:

In the preceding formula, σ is the standard deviation of the population or occurrences of events and n is the number of people in the sample.

The probability associated with the Z-value calculated in step 3 is compared with the significance level of the test to determine whether null hypothesis will be accepted or rejected.

An example of a hypothesis test

Let us see an example of hypothesis testing now. A famous pizza place claims that their mean delivery time is 20 minutes with a standard deviation of 3 minutes. An independent market researcher claims that they are deflating the numbers for market gains and the mean delivery time is actually more. For this, he selected a random sample of 64 deliveries over a week and found that the mean is 21.2 minutes. Is his claim justified or the pizza place is correct in their claim? Assume a significance level of 5%.

First things first, let us define a null and alternate hypothesis:

Let us calculate the Z-value:

When we see the standard normal table for this Z-value, we find out that this value has an area of .9993 to the left of it; hence, the area to the right is 1-.99931, which is less than 0.05.

Hence, p-value<ß. Thus, the null hypothesis is rejected. This can be summarized in the following figure:

Fig. 4.5: Null Hypothesis is rejected because p-value<significance level

Hence, the researcher's claim that the mean delivery time is more than 20 minutes is statistically correct.