Tag Archives: Python scikit-learn

Python scikit-learn (metrics): difference between r2_score and explained_variance_score?

I noticed that that ‘r2_score’ and ‘explained_variance_score’ are both build-in sklearn.metrics methods for regression problems.

I was always under the impression that r2_score is the percent variance explained by the model. How is it different from ‘explained_variance_score’?

When would you choose one over the other?

Thanks!

OK, look at this example:

In [123]:
#data
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print metrics.explained_variance_score(y_true, y_pred)
print metrics.r2_score(y_true, y_pred)
0.957173447537
0.948608137045
In [124]:
#what explained_variance_score really is
1-np.cov(np.array(y_true)-np.array(y_pred))/np.cov(y_true)
Out[124]:
0.95717344753747324
In [125]:
#what r^2 really is
1-((np.array(y_true)-np.array(y_pred))**2).sum()/(4*np.array(y_true).std()**2)
Out[125]:
0.94860813704496794
In [126]:
#Notice that the mean residue is not 0
(np.array(y_true)-np.array(y_pred)).mean()
Out[126]:
-0.25
In [127]:
#if the predicted values are different, such that the mean residue IS 0:
y_pred=[2.5, 0.0, 2, 7]
(np.array(y_true)-np.array(y_pred)).mean()
Out[127]:
0.0
In [128]:
#They become the same stuff
print metrics.explained_variance_score(y_true, y_pred)
print metrics.r2_score(y_true, y_pred)
0.982869379015
0.982869379015

So, when the mean residue is 0, they are the same. Which one to choose dependents on your needs, that is, is the mean residuesupposeto be 0?

Most of the answers I found (including here) emphasize on the difference betweenR2andExplained Variance Score, that is:The Mean Residue(i.e. The Mean of Error).

However, there is an important question left behind, that is: Why on earth I need to consider The Mean of Error?


Refresher:

R2: is theCoefficient of Determinationwhich measures the amount of variation explained by the (least-squares) Linear Regression.

You can look at it from a different angle for the purpose of evaluating thepredicted values ofylike this:

Varianceactual_y × R2actual_y = Variancepredicted_y

So intuitively, the more R2is closer to1, the more actual_y and predicted_y will havesamevariance (i.e. same spread)


As previously mentioned, the main difference is theMean of Error; and if we look at the formulas, we find that’s true:

R2 = 1 - [(Sum of Squared Residuals/n)/Variancey_actual]

Explained Variance Score = 1 - [Variance(Ypredicted - Yactual)/Variancey_actual]

in which:

Variance(Ypredicted - Yactual) = (Sum of Squared Residuals - Mean Error)/n

So, obviously the only difference is that we are subtracting theMean Errorfrom the first formula! …But Why?


When we compare theR2Scorewith theExplained Variance Score, we are basically checking theMean Error; so if R2= Explained Variance Score, that means: The Mean Error =Zero!

The Mean Error reflects the tendency of our estimator, that is: theBiased v.s Unbiased Estimation.