There has been a long-standing discussion among statisticians, as to what should be the sample size for any kind of advanced statistical analysis. Over time, this has led to a number of calculated solutions to mitigate the problem. The topic is embroiling, hence keeping the readers in mind, we will explore this debate on sample size, how to quickly come up with a sample size for a particular case, and study some practical relatable examples.
What’s the sample size fuss all about?
Let us consider the data universe of ours consists of 10 people with respective ages – in statistical language “Population Set” (refer to the Table 1 below).
In the above example, we see as the sample size increases the deviation from the population mean reduces and for large trials the average deviations from a smaller sample will be higher than larger sample sizes because they are tending towards the population. Thus this gives the impression that to arrive at accurate results collect more and more data points, best would be obtaining population data, but that is hardly or rather never possible in reality, because of data availability, collection, storage, and optimization issues.
The data we see around
One of the most common traits of data that we see around us is the normally distributed data (commonly known as the bell shaped curve). This distribution has most of the data concentrated and symmetrically distributed about the mean. Since this is the highest observed patterns in data, we explain how we can estimate sample sizes when data is normally distributed and we are running a test on it.
Here we see in Chart 2, a normally distributed data plotted but with just 1,000 data points while Chart 3 with 1,000,000 data points, clearly the latter is much closer to the normal distribution. Thus re-iterating the fact that the greater the data points, closer we are to the accuracy of our estimation.
Statistically it can be proved that 99%, 95% and 68% of the area under such a normally distributed curve lies within 3, 2 and 1 sigma (standard deviation) respectively (chart 4). Hence for any test involving a normally distributed data this can ease the sample size calculation. Let us take a simple example to keep the article lucid.
The mean test
Let’s setup an example. Over some years it is a known fact that the average daily visits to a website is 10,000 suddenly over a period of time it is observed that the average has spiked up to 10,250. Now is this behavior beyond a random co-incidence? Or is this a significant shift? With what sample size should I test the change?
Thus what we want to determine here is, if the 10,250 average visit value belongs to the same population as the one with mean 10,000? The hypothesis tested out is: is the population mean actually 10,000?
Average Daily Website Visits – Population 10,000
Average Daily Website Visits – Sample 10,250
Assuming Standard Deviation of the Population to be known 4034
Hypothesis to be tested
Null Hypothesis: Population Mean = 10,000
Alternative Hypothesis: Population Mean ≠ 10,000
1.96 = (10,250 – 10,000) / (4034/SQRT (n))
Hence, n = ((1.96*4034)/250) ^ 2
Hence, the required sample size, n ~ 1000 data points
There are several other robust methods that help us compute the appropriate sample size like, Mead’s resource equation, CDF (cumulative distribution function) or based on the power of the test (i.e., the probability of not committing a Type II error) which is minimized with greater sample sizes, the explanation of these processes may get a little tedious for non-statistical readers hence the topic is not discussed.
The objective of all these tests is to minimize the errors due to estimation from samples and can be considered as a population representation to a great level of confidence so that they can be used to produce models that are accurate as much as possible. To know more get in touch with our Predictive Analytics team and get your queries solved.