How to choose error metrics for Classification and Regression

Read Time: 6 min

In the first place, we need to understand, what is error metrics and why is it important to choose right one for any machine learning model.

What is error metrics?

After building machine learning model, we need to check it’s validity that how much accurate our prediction or classification is. Evaluation or error metrics plays important role for deciding model’s validity.

Why is it important to select right one?

If we use wrong error metrics for checking the model’s validity, then even if model shows 99% accuracy or similar results, it will not be of any use. In such case, even if training or testing accuracy may seems very high, in real time application, it will fail to provide the appropriate results.

First of all, question arises that why do we not use same evaluation or error metrics for classification and regression models?

The answer is …

For classification, we get output in discrete numbers i.e. classes while in case of regression, we get the continuous value as output i.e. predicted value. As for both the problems output type is different, we need different metrics to map them.

Error metrics for classification

To understand evaluation methods for classification, let’s check some use cases first:

  • First, suppose you have data of cancer patients in which you have to predict if a person is has diagnosed with cancer or not?
  • Suppose if we have to design machine learning model to predict is a day is bad day to launch the satellite.
  • If you have Iris dataset and you have to classify that a flower belongs to which category.

First two are binary classification problems and third one is multi-class classification.

For such kind of problems, following methods can be used for evaluation purpose.

  • Confusion Matrix
  • Classification Accuracy
  • ROC Curve (Area under the curve)
  • F1 score

Confusion Matrix

It is used to evaluate classification model that if a class is identified correctly or not.

To begin with confusion matrix, first we need to understand few terms:

Let’s take example of detection of cancer where we have to figure out if a patient has cancer or not.

True Positive: If a patient has cancer (in actual) and through our machine learning model, the patient is diagnosed with cancer, then this will be the case of True Positive.

True Negative: If patient doesn’t have cancer then it’s diagnosed as negative i.e. ‘no cancer’, then it will be the case of True negative.

False Negative: If a patient has cancer but it’s diagnosed as ‘he does not has cancer’, it means it’s False negative. It means our negative detection (he does not has cancer) is wrong i.e. false.

False Positive: If a patient does not has cancer but it’s diagnosed as he has cancer through our machine learning model, then it will be considered as False positive.

Yes(Actual)True PositiveFalse Negative
No(Actual)False PositiveTrue Negative
from sklearn.metrics import accuracy_score, confusion_matrix
cm = confusion_matrix(pred_y,y_test)
Output:  array([[153,  35],
                [ 22,  85]], dtype=int64) 
# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
                     index = ['Survived','Not Survived'], 
                     columns = ['Survived','Not Survived'])
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
sns.heatmap(cm_df, annot=True,fmt="d")
plt.title('Logistic regression \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, pred_y)))
plt.ylabel('True label')
plt.xlabel('Predicted label')

Classification Accuracy

Classification accuracy is based on confusion matrix only. It is basically ratio of sum of diagonal elements of confusion matrix and sum of total elements.


Accuracy = (TP + TN )/Total data points

Classification accuracy is defined as accuracy_score in sklearn library as follows:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred_y)
Output:  0.8067796610169492  

Precision and Recall


Recall (Sensitivity/True Positive Rate)

Recall means among total predicted or classified values, how many are correctly predicted/classified.


Recall: True Positive/(True Positive + False Negative)

Let’s understand it with example.

As an illustration, suppose if you have to predict weather is good or bad for launching a satellite (you might have seen in movie “Mission Mangal” , they wanted to know when can they launch their satellite successfully).

In this case, true positive is if they predict the weather is good and it’s actually good then they can launch their satellite. If given that weather is good but they predict that it won’t be good (i.e. the case of false negative) then they might postpone the launching of satellite. Still there will not be much harm.

But if weather is actually not good and machine learning model predicts that this is good weather to launch a satellite(i.e. the case of false positive) then mishap may happen. hard work of years will be in trash.

So, conclusion is, in some cases, we can’t afford “false positive” . Means Recall value is important to calculate. For such case, recall should be as minimums as possible.

from sklearn.metrics import recall_score
recall_score(y_test, pred_y, average='weighted')
 Output: 0.8067796610169492 


Precision means among total positively predicted values how many are correctly predicted or classified.


Precision : True Positive/(True Positive + False Positive)

Let’s understand it with example again.

If you are dealing with a problem statement that wants you to identify that if a patient is diagnosed with cancer correctly or not. If a person is actually suffering from cancer and it’s predicted that he has cancer, it will be the case of true positive.

Similarly, if a person is actually not suffering from cancer but it’s detected through machine learning model(False Positive), it won’t harm much except making that person worried.

In case, if a person is actually suffering from cancer and it’s not detected through the machine learning model (False negative), then it may cost to his life because of no treatment.

In such case, it’s important to focus on precision than recall.

from sklearn.metrics import precision_score
precision_score(y_test, pred_y, average='macro')
 Output: 0.8041111552992642 

Relationship between recall and precision

from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, pred_y)
import matplotlib.pyplot as plt
plt.title('precision recall relationship')
plt.plot(precision,recall, color='blue')
plt.legend(loc = 'lower right')

So cutshort, which error metrics is to be analysed is based on domain knowledge and problem statement basically.

Receiver Operating Characteristics (ROC)

ROC curve is ratio of true positive rate and false positive rate.

For evaluation of binary classification model, we usually check for area under the ROC curve. More is the area, more will be classification accuracy. As AUC is high, it means probabilities for classification are more separable.

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, pred_y)
roc_auc = auc(false_positive_rate, true_positive_rate)
Output:  0.7913095238095239
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],linestyle='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

F1 score

F1 score is harmonic mean of precision and recall.


F1 = 2 * (precision * recall) / (precision + recall)

from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='weighted')

How to decide if accuracy_score is used or f1_score for evaluation?

Accuracy_score can be used when True positive and true negatives are more important. But if data is imbalanced or False Negatives and False positives are deterministic factors then F1 Score should be used as it takes recall and precision both in account.

from sklearn.metrics import classification_report
print(classification_report(y_test, pred_y))

Evaluation methods for regression

  • Mean Absolute Error(MAE)
  • Mean Squared Error(MSE)
  • Root Mean Squared Error(RMSE)

Let’s understand one by one:

Mean Absolute Error(MAE)

It is difference between predicted value and actual value of target variable.

from sklearn.metrics import mean_absolute_error
mean_absolute_error(Y_test, y_pred)
Output:  3.7252449081714416 

Mean Squared Error(MSE)

Sometimes, it may happen that predicted value is negative and actual value is positive so difference between these values may lead to inaccurate results.

Therefore, Mean squared error is average of squares of error i.e. difference between actual value and predicted value.

from sklearn.metrics import mean_squared_error
mean_squared_error(Y_test, y_pred)
Output: 24.92238672931211 

Root Mean Squared Error(RMSE)

Root Mean Squared Error is the square root of the average squared error.

RMSE is most widely used and preferred over other methods because RMSE takes square of errors first, because of that larger errors get more penalty. So whenever larger errors are considered, RMSE is very useful.

from sklearn.metrics import mean_squared_error
rmse = (np.sqrt(mean_squared_error(Y_test, y_pred)))
print('RMSE is {}'.format(rmse))
 Output: RMSE is 4.9922326397426

Note: Here I have focused on error metrics only, whatever accuracy I am getting here, is not prefect one. We can use pre-processing techniques, fine tuning and different machine learning models according to the problem given to achieve best results.

Here is the github link for detailed explanation and code:

Please let me know if you have any doubts or suggestions, would like to discuss more.