Regression Modeling in Practice
Test a Basic Linear Regression Model.
- About
About
This notebook is a demonstration of a Basic Linear Regression Model using Google Colab.
Regression Modeling in Practice
Week 2: Test a Basic Linear Regression Model
We are now going to test a basic linear regression model.
To answer the question, what is the realtionship between internet use rate and income per person in Asian Countries.
But before we run the model though, we will need to center the mean of the explanatory variable, income per person.
Data Data for this study comes from the Gapminder World Dataset collected by the Gapminder Foundation. The Gapminder World Dataset contains data collected from more than 200 countries/areas for more 500 variables.
Description of Variables Below is the description of the variables
- Internet User Rate
The internet use rate of a country was collected by the World Bank in their World Development Indicators.
- Income per Person
Income per person is simply Gross Domestic Product per capita (the country’s total, country-wide income divided by the population)
- Continents (I will use this data to get Asian contries from gapminder)
First, Start with import
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# df as dataframe from gapminder dataset
url = 'https://raw.githubusercontent.com/daddyawesome/Coursera_Capstone/master/data/gapminder.csv'
df = pd.read_csv(url)
df.head()
# df_continent as dataframe for teh continent data set
url2='https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv'
df_continent = pd.read_csv(url2)
df_continent = df_continent.rename(columns={'Country': 'country'})
df_continent.head()
df_outer = pd.merge(df_continent, df, on='country', how='outer')
df_outer.head()
sub = df_outer[['Continent','incomeperperson','internetuserate']].dropna()
sub.head()
Set the variables to numeric
sub['internetuserate'] = sub['internetuserate'].apply(pd.to_numeric, errors='coerce')
sub['incomeperperson'] = sub['incomeperperson'].apply(pd.to_numeric, errors='coerce')
We only need Asian countries so we create another datframe for asian countire we names it sub_asia
# creating New DataFrame For Each Continents
df_clean=sub.dropna()
sub_asia=df_clean[(df_clean['Continent']== 'Asia')]
sub_asia.head()
data_centered = sub_asia.copy()
data_centered['incomeperperson'] = data_centered['incomeperperson'].subtract(data_centered['incomeperperson'].mean())
print ('Mean of', data_centered[['incomeperperson']].mean())
scat1 = sns.regplot(x="incomeperperson", y="internetuserate", scatter=True, data=data_centered)
plt.xlabel('Income Per Person')
plt.ylabel('Internet Use Rate')
plt.title ('Scatterplot for the Association Between Income Per Person and Internet Use Rate in Asia')
plt.show()
reg1 = smf.ols('internetuserate ~ incomeperperson', data=data_centered).fit()
reg1.summary()
Note that the majority of data points are clustered near, but below, 0 with a long tail reaching to the right. This is in agreement with the mean centered at 0 and also indicative that the vast majority of countries have low incomes with a few having very, very high incomes.
The results indicate that income per person is significantly and positively associated with internet use rate in a country according to the equation [internet use rate] = .002 * [income per person] + 30.362
. I suspect a linear regression line is not the best fit possible, the curve does appear to be logarithmic in shape; but for the sake of this demonstration a linear line is fine.