Exploratory Data Analysis (EDA) Revision IN SHORT
All the topics in EDA for reference (links are attached for in-depth reference)
Hypothesis space
Hypothesis are trained models and Space means possibilities. So hypothesis space means a set of possible models for the given training dataset. It defines all possible parameters that the cost function (model) can assume.
Cost Function
The cost function measures the performance of a machine-learning model. The purpose of the function is to either minimise the cost (regression-based models/error minimising models) or maximise the cost (reward-based model).
Hypothesis Testing
Hypothesis means assumptions. So in the hypothesis test, we assume two or more exclusive statements on population using sample data points. Exclusive statements example, let’s consider in population we have two genders (male/female) so males cannot be part of females and vice versa.
2 kinds of hypotheses: Null hypothesis and Alternate hypothesis
The null hypothesis suggests there is no statistically significant relationship between the two variables. The alternate hypothesis suggests there is some statistically significant relationship between the two variables
For example, senior citizens will invest in fixed deposits (Yes/No)? The null hypothesis is there is no relationship between the age of the customer and fixed deposit investment. The alternate hypothesis is there is some kind of relationship between the age of customers and fixed deposit investment.
Reject or Accept Null Hypothesis: Based on the level of significance and confidence interval.
Level of Significance or Significance Level (Alpha)
The significance level shows how likely a pattern in the details is due to chance.
The level of significance or significance level is denoted by alpha (SL = alpha), most statistical packages show the P value of significance level.
Suppose in a dataset, the P value of the variable is 0.09 then 91% chance there is some kind of relationship between the independent variable and dependent variable.
Confidence Interval (1 — Alpha or 1 — Significance Level)
The confidence interval tells us how much we are confident about the result
For instance, a 95% confidence interval means if we repeat the same experiment or the survey over and over again 95% of the time it will match the result. For example, suppose we say in the country approximately senior citizens reinvest their Fixed Deposits with a 95% confidence interval. It means if we repeat the survey with the same technique there will be 95% of the time result senior citizens reinvest their fixed deposits.
Factors affecting Confidence Interval
- Variation: If variation within samples is almost similar, then data will have low variations, leading to a narrow confidence interval. If samples have a lot of variation will lead to a wider confidence interval.
- Sample Size: Small sample sizes will lead to high variation between different sample sets and higher confidence intervals. A large sample size will have more similarities between the different sample sets, leading to a smaller confidence interval.
P Value
P value is a probability value. For example, consider laptop touchpads we say exactly in the middle of the touchpad click P value is 0.8 meaning out of 100 times 80 times we will touch in the middle of the touchpad. Suppose for the top left corner touchpad click we have a P value of 0.1 means out of 100 times 10 times will touch the top left corner of the touchpad.