Daddy Awesome's Writings

All the things I am writing good or bad


Project maintained by daddyawesome Hosted on GitHub Pages —

CHI Square Test of Independence on Continent’s female employment rate

Data Analysis Tools - Week 2

Back

Data

Data for this study comes from the Gapminder World Dataset collected by the Gapminder Foundation. The Gapminder World Dataset contains data collected from more than 200 countries/areas for more 500 variables.

Description of Variables

Below is the description of the variables

  1. Continents

  2. Female Employment Rate (variable code: femaleemployrate, Unit: Percentage) - Employed females (age > 15) as a percentage of the total female population. Female Employment Rate is the response variable

Start with import

load gapminder dataset

data = pd.read_csv('gapminder.csv',low_memory=False)

I will be using url to get the data online


join the two dataframe


New DataFrame for Analysis

We create a dataframe sub out from the merge dataframe df_outer

Binning the variables

Since we are doing an Chi Square test of Independence, both the response variable and the explanatory variable have to be categorical.

Binning (categorizing) the response variable:

Here the response variable Female Employment Rate has values between 10 and 90. We would categorize it into two categories.

Female Employment Rate lesser than 70 : Categorized as 0 (Low Female Employment Rate)

Female Employment Rate greater than 70 : Categorized as 1 (High Female Employment Rate)



Choosing the required variables

Selecting only the Continents and Categorize Female Employment Rate from the data set and dropping the NA values.

Chi Square Test of Indepedence

Hypothesis Testing

Null Hypothesis: There is no association between the Continents and Female Employment Rate.

Alternative Hypothesis: There is an association between the Continents and Female Employment Rate.


Chi Square Test of Independence


The p-value is 0.00011960497881348971 (< 0.05), which concludes that the Continents and Female Employment Rate are significantly associated. They are not independent. So we can reject the NULL hypothesis.

POST HOC Test

Since the explanatory variable has more than 2 levels and the statistical test is significant, we need to perform POST HOC test.

Bonferroni Adjustment

We would need to perform 15 comparisons for the four level of explanatory variable.

Hence the adjusted p-value due to Bonferroni Adjustment would be = 0.05/15 = 0.003333.

We would need to compare against this adjusted p-value of 0.003333.


















The p-values for Comparison 11 (Between Europe and Oceania : 0.001892) and Comparison 7 (Between Asia and Oceania : 6.15627e-05) are less than the adjusted p-value due of 0.0033.

Thus we can only reject the Null hypothesis on these two comparisons.