Correlações

Criamos umas sequências de dados artificiais ...

In [66]:
import numpy as np

# data (imaginary!)
N = 100 # sample dimension
X, Y  = [], [] # observables 
for n in range(N):
    x = np.random.normal(0,1)# random d's between -1 and 1
    y = 50 * x   + 30 * np.random.normal(0,1)
    X, Y = X + [x] , Y + [y]

Um scatter plot dos dados é obtido usando o método scatter da interface matplotlib.pyplot da livraria matplotlib

In [67]:
from matplotlib import pyplot as plt
ax = plt.scatter(X, Y, s=20, label='data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

O coeficiente de correlação de Pearson e o $p$-value podem ser calculados usando o método pearson do módulo scipy.stats

In [68]:
import scipy.stats as stat
R = stat.pearsonr(X,Y)
R
Out[68]:
(0.8585717029671287, 3.4353675521924076e-30)

O coeficiente de correlação de Spearman e o $p$-value podem ser calculados usando o método spearmanr do módulo scipy.stats

In [69]:
import scipy.stats as stat
rho = stat.spearmanr(X,Y)
rho
Out[69]:
SpearmanrResult(correlation=0.8362676267626762, pvalue=2.5496594516310923e-27)

Outra possibilidade é introduzir diretamente os dados como séries, e usar a função OLS (Ordinary Least Squares, ou seja, mínimos quadrados) do módulo statsmodels, que produz uma tabela de resultados mais completa.

Por exemplo, na segunte tabela, produzida por Galileo, foram registadas as alturas iniciais $x$ e as distâncias atingidas $y$ de um objeto lançado com a mesma força horizontal.

In [79]:
import statsmodels.api as sm
X = [100, 200, 300, 450, 600, 800, 1000] # valores das alturas iniciais
Y = [253, 337, 395, 451, 495, 534, 574 ] # valores das distâncias atingidas
model = sm.OLS(Y, sm.add_constant(X), missing='drop') # modelo Y=hX+b, os dados em faltas (com NANs) são ignorados
results = model.fit()
print(results.summary()) # print results inline
results.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.913
Method:                 Least Squares   F-statistic:                     63.88
Date:                Sun, 10 Jan 2021   Prob (F-statistic):           0.000495
Time:                        18:36:04   Log-Likelihood:                -33.336
No. Observations:                   7   AIC:                             70.67
Df Residuals:                       5   BIC:                             70.56
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        269.4661     24.184     11.142      0.000     207.299     331.634
x1             0.3341      0.042      7.992      0.000       0.227       0.442
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.843
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.687
Skew:                          -0.553   Prob(JB):                        0.709
Kurtosis:                       1.935   Cond. No.                     1.10e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
/opt/anaconda3/lib/python3.7/site-packages/statsmodels/stats/stattools.py:71: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  "samples were given." % int(n), ValueWarning)
/opt/anaconda3/lib/python3.7/site-packages/statsmodels/stats/stattools.py:71: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  "samples were given." % int(n), ValueWarning)
Out[79]:
OLS Regression Results
Dep. Variable: y R-squared: 0.927
Model: OLS Adj. R-squared: 0.913
Method: Least Squares F-statistic: 63.88
Date: Sun, 10 Jan 2021 Prob (F-statistic): 0.000495
Time: 18:36:04 Log-Likelihood: -33.336
No. Observations: 7 AIC: 70.67
Df Residuals: 5 BIC: 70.56
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 269.4661 24.184 11.142 0.000 207.299 331.634
x1 0.3341 0.042 7.992 0.000 0.227 0.442
Omnibus: nan Durbin-Watson: 0.843
Prob(Omnibus): nan Jarque-Bera (JB): 0.687
Skew: -0.553 Prob(JB): 0.709
Kurtosis: 1.935 Cond. No. 1.10e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

O valor de $R^2$, assim como o $p$-value, parecem sugerir uma boa correlação linear, embora um gráfico sugere claramente uma lei diferente da lei linear.

In [81]:
from seaborn import regplot
ax3 = regplot(x=X, y=Y)

Mais uma possibilidade é, naturalmente, importar um ficheiro de dados num DataFrame...

Por exemplo o seguinte ficheiro, de Our World in Data, relaciona a pobreza (definida como a percentagem de população que vive com menos de 3.1 dolares por dia) e a expetativa de vida.

In [73]:
import pandas as pd
df = pd.read_csv('/Users/sal/Lectures/mqq/Data/poverty-vs-life-expectancy.csv')
df
Out[73]:
Entity Code Year Total population (Gapminder, HYDE & UN) Continent Life expectancy at birth, total (years) Poverty headcount ratio at $3.10 a day (2011 PPP) (% of population)
0 Abkhazia OWID_ABK 2015 NaN Asia NaN NaN
1 Afghanistan AFG 1800 3280000.0 NaN NaN NaN
2 Afghanistan AFG 1801 3280000.0 NaN NaN NaN
3 Afghanistan AFG 1802 3280000.0 NaN NaN NaN
4 Afghanistan AFG 1803 3280000.0 NaN NaN NaN
... ... ... ... ... ... ... ...
49529 Zimbabwe ZWE 2016 14030000.0 NaN 61.163 NaN
49530 Zimbabwe ZWE 2017 14237000.0 NaN NaN NaN
49531 Zimbabwe ZWE 2018 14439000.0 NaN NaN NaN
49532 Zimbabwe ZWE 2019 14645000.0 NaN NaN NaN
49533 Åland Islands ALA 2015 NaN Europe NaN NaN

49534 rows × 7 columns

In [74]:
df2000 = df[df["Year"] > 2013 ]
df2000
Out[74]:
Entity Code Year Total population (Gapminder, HYDE & UN) Continent Life expectancy at birth, total (years) Poverty headcount ratio at $3.10 a day (2011 PPP) (% of population)
0 Abkhazia OWID_ABK 2015 NaN Asia NaN NaN
215 Afghanistan AFG 2014 33371000.0 NaN 62.895 NaN
216 Afghanistan AFG 2015 34414000.0 Asia 63.288 NaN
217 Afghanistan AFG 2016 35383000.0 NaN 63.673 NaN
218 Afghanistan AFG 2017 36296000.0 NaN NaN NaN
... ... ... ... ... ... ... ...
49529 Zimbabwe ZWE 2016 14030000.0 NaN 61.163 NaN
49530 Zimbabwe ZWE 2017 14237000.0 NaN NaN NaN
49531 Zimbabwe ZWE 2018 14439000.0 NaN NaN NaN
49532 Zimbabwe ZWE 2019 14645000.0 NaN NaN NaN
49533 Åland Islands ALA 2015 NaN Europe NaN NaN

1640 rows × 7 columns

In [75]:
import statsmodels.api as sm
L = df2000.iloc[:,5].values # life expectancy
P = df2000.iloc[:,6].values # poverty
model = sm.OLS(L, sm.add_constant(P), missing='drop') # model L=hP+b, os dados em faltas (com NANs) são ignorados
results = model.fit()
print(results.summary()) # print results inline
results.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.549
Model:                            OLS   Adj. R-squared:                  0.533
Method:                 Least Squares   F-statistic:                     35.23
Date:                Sun, 10 Jan 2021   Prob (F-statistic):           1.90e-06
Time:                        18:26:57   Log-Likelihood:                -84.550
No. Observations:                  31   AIC:                             173.1
Df Residuals:                      29   BIC:                             176.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         75.6603      0.923     82.015      0.000      73.774      77.547
x1            -0.2040      0.034     -5.936      0.000      -0.274      -0.134
==============================================================================
Omnibus:                        5.358   Durbin-Watson:                   1.989
Prob(Omnibus):                  0.069   Jarque-Bera (JB):                4.099
Skew:                          -0.877   Prob(JB):                        0.129
Kurtosis:                       3.311   Cond. No.                         36.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Out[75]:
OLS Regression Results
Dep. Variable: y R-squared: 0.549
Model: OLS Adj. R-squared: 0.533
Method: Least Squares F-statistic: 35.23
Date: Sun, 10 Jan 2021 Prob (F-statistic): 1.90e-06
Time: 18:26:57 Log-Likelihood: -84.550
No. Observations: 31 AIC: 173.1
Df Residuals: 29 BIC: 176.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 75.6603 0.923 82.015 0.000 73.774 77.547
x1 -0.2040 0.034 -5.936 0.000 -0.274 -0.134
Omnibus: 5.358 Durbin-Watson: 1.989
Prob(Omnibus): 0.069 Jarque-Bera (JB): 4.099
Skew: -0.877 Prob(JB): 0.129
Kurtosis: 3.311 Cond. No. 36.0


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [76]:
from seaborn import regplot
ax3 = regplot(x=P, y=L)
In [ ]: