A long time ago, there lived a beautiful Hypothesis. It could predict everything- the weather, the Stock market, the World Wars. But no one believed it. Why? Because the Hypothesis was not validated.
Our world is very cruel to Hypothesis. They are tested time and again. They are judged. You will meet many Hypotheses in the bars telling you that they feel used. They curse their parents- the Data scientists for putting them through rigorous tests. But the ones that come out of the other end of these rigorous tests earn the title of ‘The Validated’.
If you are to be a Data Scientists, you will also need to test the Hypothesis. Some your own, and others you wish to follow to make an amazing Mathematical Model.
But what exactly is a Hypothesis and what is Hypothesis Testing.
Understanding Hypothesis
A Hypothesis is just an assumption, like Earth is round or people who eat junk food are more likely to be obese. And Hypothesis testing is just designing a test for this Hypothesis, carrying out the test and determining whether the Hypothesis is correct or not.
The definition may be trivial, but the impacts are not. Look at how the Atomic theory was found, first they believed Atom was indivisible. Then came JJ Thompson destroying this Hypothesis by discovering Electron and giving us the Plum Pudding Model. Then came Rutherford and then came Bohr to provide us with the Classical Atomic Model where Electrons revolve in Peace around a Nucleus. Finally, Einstein came to give us the Quantum Model. Over time different Hypotheses were proposed, then proved correct and later replaced when proved incorrect.
And that is how any field progresses. New Theories are formed and tested and the old ones are disapproved. So, Hypothesis and Hypothesis testing is not just a mathematical concept. It is the realization of analytical thinking making it crucial for any Data Scientist to understand and implement.
Now that we understand how crucial Hypothesis and Hypothesis testing is, let’s try to find out how to form one. And it starts with asking an effective question to which the Hypothesis will answer, like how many people will buy a soap if it smells like butterscotch ice cream or how an Atom looks like. Then, you gather the information related to the question. A scientist would then analyze the information to form an answer to this initial question. This is his Hypothesis.
Null Hypothesis
Let’s understand this better with the Atomic Theory. The Greeks and scientists before 1897 hypothesized that it was an indivisible particle, and this can be called their Null Hypothesis. A Null Hypothesis is the one assumed to be true unless proved incorrect. As no one other than God was able to split the Atom at that time (and he did not tell others), the Hypothesis that Atom was indivisible was assumed to be true.
In 1897, J.J. Thomson discovered the Electron. He had successfully divided the Atom and proved it through his experiment, thus disproving the Null Hypothesis. This led to the acceptance of another Hypothesis by J.J. Thompson. Atom is divisible, and it looks like Pie, where electrons are stuck inside. J.J. Thompson’s own student disapproved of this sweet Hypothesis.
Alternate Hypothesis
This elaborates the concept of an Alternate Hypothesis. In case your experiment disapproves of your Null Hypothesis, the Alternate Hypothesis is accepted as the one to be true.
Since mathematicians and scientists like to write things in a different way, the Null Hypothesis is conventionally written as H0 and Alternate Hypothesis as H1.
Be careful while forming the Null Hypothesis. While surveying Texas, you could very well prove that the only language spoken is English. That could have a disastrous impact if you put up the promotion boards for the promotion of Chocolate Brownie across the world only in English. People from some cultures where English is not a spoken and chocolate is not eaten may think you are packaging poop.
Error
What do you call the above scenario- A mistake, stupidity, being a buffoon. Scientists have a better term for it. They call it having an Error. They have also classified stupidity (I mean Error) into two classes-
Type A Error- A False Positive. When the Null Hypothesis was correct, you declared it as incorrect.
Type B Error- A False Negative. When the Null Hypothesis was incorrect, you accepted it as correct.
It does not take a genius to figure out that the business decisions you make based on the Erroneous Hypothesis won’t put you on the cover of a Business Magazine. A Type 1 error would mean you will make a change based on your Alternate Hypothesis.
Significance level and P-value
But let’s say you have your Hypothesis and have carried out your tests, and you have your result. How do you know whether you passed or failed? It is just like any other exam you gave. If your marks are more than a certain number, you pass; otherwise, you fail. That number, in the case of Hypothesis Testing, is determined by Significance level and P-value.
The first thing to keep in mind is that instead of trying to prove something, you accept the Null Hypothesis as correct until disapproved. Like everybody is innocent until proven guilty. The Significance level of an event is the measure that it occurred by chance. Known as alpha or α, it is the measure of the strength of the evidence that must be present in your sample before you will reject the Null Hypothesis and conclude that the effect is statistically significant.
In simpler terms, let’s say you have a Null Hypothesis to find out Oranges among fruits. The H0 states that every fruit that is round and is orange-colored is an Orange. You run your tests and find that some fruits have been incorrectly classified as Oranges. Let’s say around 30 out of 100. It looks like your Hypothesis was surely incorrect but what if only 5 were misinterpreted. You need to define that threshold when you decide that your Hypothesis was incorrect. That threshold is your Significance level, and it is called α.
So, you have your Significance level. Now, what is the P-value? It is the probability of the mistake actually happening. Let us say you find out that according to your Hypothesis, you pick up a fruit that fits the bill, and it turns out it is not an orange (it is a tangerine). The probability that you will make this mistake is your P-value (calculated using the 30 fails out of 100 in the previous example).
Confidence Interval
To know if your Hypothesis passed or failed, you compare your alpha and P-value and find out if your Null Hypothesis has what it takes to make it in the world. But what if your test was incorrect? When you run the test again, you find out that instead of 30, 70 fruits were identified incorrectly, and when you run it the next time, it was found that only 10 were incorrect. That is where Confidence Interval comes in. It is the measure that if you run the test again, the result will be around the same.
This ambiguity for Oranges occurred as you did not run the test on all available fruits around the globe but a sample of it (Let us say all fruits in France). If you ever test all the fruits, you would be 100% correct and the owner of a very expensive test. To save time, effort, and money- we take only a sample and run the tests on them. All we need from our end is to make sure that the sample size (count of samples) is enough so that the output of the test is nearly the same. The amount of faith that we have in the output will be the same is called Confidence Interval. A larger Confidence Interval means that you need to take a bigger sample.
That is most of what Hypothesis Testing is. You establish a Hypothesis. Run test on the sample. Compare the result to what you were expecting and find out if you were correct. The rest of it is just inserting values in a preset mathematical formula (Which you don’t have to write yourself or even calculate). It is never as hard as it sounds when it comes to Statistics in Data Science.
Remember to practice more on this at a place focused on Data Science. Cause you don’t need to know more than what you will use as a Data Scientist, or at least to sound smart.
Data Science with InfosecTrain
Consider taking a Data Science training course at InfosecTrain to learn about the information and abilities needed to land one of today’s hottest careers. The Data Science course is intended to have a 360-degree impact on your chances of success in the field. Stay focused, and you can accomplish any goal you want. We wish you the best in your journey!