All the things I am writing good or bad
Data Analysis Tools - Week 2
Data for this study comes from the Gapminder World Dataset collected by the Gapminder Foundation. The Gapminder World Dataset contains data collected from more than 200 countries/areas for more 500 variables.
Below is the description of the variables
Continents
Female Employment Rate (variable code: femaleemployrate, Unit: Percentage) - Employed females (age > 15) as a percentage of the total female population. Female Employment Rate is the response variable
Start with import
data = pd.read_csv('gapminder.csv',low_memory=False)
I will be using url to get the data online
join the two dataframe
We create a dataframe sub
out from the merge dataframe df_outer
Since we are doing an Chi Square test of Independence, both the response variable and the explanatory variable have to be categorical.
Binning (categorizing) the response variable:
Here the response variable Female Employment Rate has values between 10 and 90. We would categorize it into two categories.
Female Employment Rate lesser than 70 : Categorized as 0 (Low Female Employment Rate)
Female Employment Rate greater than 70 : Categorized as 1 (High Female Employment Rate)
Selecting only the Continents and Categorize Female Employment Rate from the data set and dropping the NA values.
Null Hypothesis: There is no association between the Continents and Female Employment Rate.
Alternative Hypothesis: There is an association between the Continents and Female Employment Rate.
The p-value is 0.00011960497881348971 (< 0.05), which concludes that the Continents and Female Employment Rate are significantly associated. They are not independent. So we can reject the NULL hypothesis.
Since the explanatory variable has more than 2 levels and the statistical test is significant, we need to perform POST HOC test.
We would need to perform 15 comparisons for the four level of explanatory variable.
Hence the adjusted p-value due to Bonferroni Adjustment would be = 0.05/15 = 0.003333.
We would need to compare against this adjusted p-value of 0.003333.
The p-values for Comparison 11 (Between Europe and Oceania : 0.001892) and Comparison 7 (Between Asia and Oceania : 6.15627e-05) are less than the adjusted p-value due of 0.0033.
Thus we can only reject the Null hypothesis on these two comparisons.