4
$\begingroup$

I am evaluating a model that predicts the existence or not existence of a "characteristic" (for example, "there is a dog in this image") using several datasets. The system outputs me for every dataset their TP, TN, FP, FN.

I would like a metric(s) to judge how good is the model doing its work but I realize that I can not plot for example just the TP because for example the first dataset has 20 instances where there is the characteristic (there is a dog) and the second dataset has say only 10. Even if the model is perfect the second dataset would have only 10 TP.

I am thinking of calculating accuracy, precision and recall for each dataset and for all datasets.

I have also run the model three times per each dataset, with small variations

I am investigating also precision-recall curves but it seem that these are for different threshold values and obviously I have only one set of precision , recall per dataset

Are there any good way to judge if a model is "good"? Due to my inexperience I can not come with a good judge criteria

At first I thought plotting the distribution of each (TP etc) for all datasets Then I thought of plotting a confusion matrix combining all datasets Any advice will be greatly appreciated


As a simple fictitious example I thought of

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score

# Example fictitious data
datasets = {
    'datasetA': {'TP': 150, 'TN': 200, 'FP': 50, 'FN': 100, 'no_GT': 34},
    'datasetB': {'TP': 180, 'TN': 220, 'FP': 40, 'FN': 81, 'no_GT': 20},
    'datasetC': {'TP': 160, 'TN': 240, 'FP': 70, 'FN': 110, 'no_GT': 30},
    'datasetD': {'TP': 190, 'TN': 250, 'FP': 60, 'FN': 90, 'no_GT': 42},
}

def calculate_metrics(TP, TN, FP, FN):
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Aggregate counts
total_TP = sum(data['TP'] for data in datasets.values())
total_TN = sum(data['TN'] for data in datasets.values())
total_FP = sum(data['FP'] for data in datasets.values())
total_FN = sum(data['FN'] for data in datasets.values())

# Calculate overall metrics
overall_metrics = calculate_metrics(total_TP, total_TN, total_FP, total_FN)

# Calculate metrics for each dataset
metrics_df = pd.DataFrame({dataset: calculate_metrics(data['TP'], data['TN'], data['FP'], data['FN']) for dataset, data in datasets.items()})

# Add overall metrics
metrics_df['Overall'] = overall_metrics

print(metrics_df)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, (dataset, data) in enumerate(datasets.items()):
    cm = confusion_matrix([1] * data['TP'] + [0] * data['TN'] + [1] * data['FN'] + [0] * data['FP'],
                          [1] * (data['TP'] + data['FP']) + [0] * (data['TN'] + data['FN']))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
    axes[i].set_title(f'Confusion Matrix - {dataset}')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('True')

plt.tight_layout()
plt.show()

and I get

           datasetA  datasetB  datasetC  datasetD   Overall
Accuracy   0.700000  0.767754  0.689655  0.745763  0.725696
Precision  0.750000  0.818182  0.695652  0.760000  0.755556
Recall     0.600000  0.689655  0.592593  0.678571  0.640905
F1 Score   0.666667  0.748441  0.640000  0.716981  0.693524

and confusion

$\endgroup$
6
  • 2
    $\begingroup$ I'd add to the metrics you are alreay using the ROC curve and the F1 score. $\endgroup$
    – MarcoM
    Commented Jul 9 at 5:50
  • $\begingroup$ Thanks. I thought about it but I read the ROC curve is the plot the true positive rate against the false positive rate for different threshold values. It is this "different threshold values" what I don't understand. I got the results of the model application through their TP, etc. There are no different threshold values, are there? (sorry I am a bit confused about this) $\endgroup$ Commented Jul 9 at 5:53
  • 2
    $\begingroup$ The F1 score suffers from all the same issues as accuracy etc., see the links in my answer. The AUROC is better. @KansaiRobot: there absolutely are different thresholds here, it's just that they are often swept under the rug and implicitly set to 0.5, which is usually not a good choice, see this thread. You should either think carefully about your costs of "misclassification" and set your threshold accordingly, or use a model that performs well over many thresholds - through proper scoring rules. $\endgroup$ Commented Jul 9 at 6:40
  • $\begingroup$ Please check also an overview of measures stats.stackexchange.com/q/586342/3277 $\endgroup$
    – ttnphns
    Commented Jul 9 at 20:43
  • $\begingroup$ Ok when you say 'model' you mean 'binary classifier'. e.g. one that predicts the existence or not existence of a "characteristic" (e.g. "there is a dog in this image"). But to evaluate the classifier performance, you have to tell us what is the relative cost of a False Positive vs a False Negative. If this was a diagnostic for a rare but fatal condition, a low FN rate is crucial but FP is not bad. Without knowing how rare/common dogs in your images are and what significance that carries, we can't say. $\endgroup$
    – smci
    Commented Jul 10 at 10:37

3 Answers 3

7
$\begingroup$

Do not use any of accuracy, precision, recall, or the F1 score. They all suffer from the same issues, especially - but not only - for "unbalanced" data: Why is accuracy not the best measure for assessing classification models?

Instead, use probabilistic predictions and assess these using proper scoring rules.

$\endgroup$
10
  • $\begingroup$ Can you please expand more why the F1 measure is not appropriate for unbalanced data? It is often mentioned as a good measure of a model against unbalanced data, such as here, here, here, and here. Obviously there is no perfect measure of success for a model, but F1 seems to be widely accepted so I was surprised to see it shot down in the same list as accuracy. $\endgroup$ Commented Jul 9 at 15:05
  • 1
    $\begingroup$ @PoissonFish: the problem with the F1 (and indeed all F-beta scores) is quite analogous to the one with accuracy, and all the other KPIs that rely on "hard" 0-1 classifications. In a nutshell, they all assume a very specific cost structure to misclassifications - albeit a different one in each case, which is why they will usually not agree on which of multiple models is the best one. I'll be honest here: we get so many questions that are minor variations on this theme that I have gotten a bit tired of going through all this each time, and rather answer as I do here. $\endgroup$ Commented Jul 10 at 16:41
  • 1
    $\begingroup$ @PoissonFish Note that many people do not understand this issue and will post bad advice on LinkedIn, Twitter/X, Medium, etc, as I suspect Stephan Kolassa will agree. $\endgroup$
    – Dave
    Commented Jul 10 at 16:43
  • 1
    $\begingroup$ @Dave and Stephan, thank you both for the explanation, I believe I am beginning to understand it better. I have reread through the linked answer in that light and understand your points about the F1 score (and other similar measures). I am currently doing some work that involves using the F1 measure and perhaps will change my approach based on this new (to me) information. Sorry for picking at this topic, but because of that work, I had a vested interest in understanding your point of view. Thank you for the chance to change my mind! $\endgroup$ Commented Jul 10 at 16:49
  • 1
    $\begingroup$ @PoissonFish I have some links to related material in my profile. You may be interested in diving down the rabbit hole. $\endgroup$
    – Dave
    Commented Jul 10 at 17:00
4
$\begingroup$

What scoring rule you should use depends on the particular situation and how good the TP and TN are and how bad FP and FN are.

These vary hugely by situation. For instance, if the event is "the Challenger shuttle will explode" then even a very small false negative is really bad. OTOH, for some high risk surgeries, even a small false positive may be bad.

As so often happens, you need subject matter expertise to solve the problem.

$\endgroup$
2
$\begingroup$

I will provide an answer, coming from a different prespective. My background is from the diagnostic tests (think Covid, pregnancy, etc.), not ML. But maybe that different angle can give you some insights.
The performance of your classifiers (dog/no dog) is indeed fully specified by the 4 metrics you get (TP, TN, FP, FN). Every other statistic based on these 4 original metrics is just re-arranging them in various ways.
In the diagnostics domain, these 4 basic metrics are re-arranged into 4 different other metrics:

  • Sensitivity ($\frac {TP} {(TP+FN)}$: how good is your model at finding positives) aka Recall, Hit Rate, TPR (true positive rate), with 1-Sensitivity=Miss rate or False negative rate (FNR).
  • Specificity ($\frac {TN} {(TN+FP)}$: how good is your mode at finding negatives) aka TNR (true negative rate) wit 1-Specifiity=Fallout or False positive rate (FPR).
  • Positive Predictive Value (PPV) ($\frac {TP} {(TP+FP)}$: if your model gives you a positive result, how likely is it to be truly positive) aka Precision (with 1-PPV=FDR false discovery rate).
  • Negative Predictive Value (NPV) ($\frac {TN} {(TN+FN)}$: if your model gives you a negative result, how likely is it to be truly negative) aka ????(seems other fields do not have a synonim) with 1-NPV=FOR false omission rate).
    These 4 metrics are basically all you need (because you can regenerate the TP,TN,FP,FN from them). Any other combination is then redundant; for example Accuracy, Negative likelihood ratio (LR-), Positive likelihood ratio (LR+), F1 etc. are redundant (they do not add new information. Some however may prefer one over another).
    As you can see from the various acronyms, synonims, etc. different fields have different terms, which only adds to the confusion, w/o adding much value. But that is the current state. Now you could pick another set of 4 to summarize/represent TP,TN,FP,FN. I presonally like this particular set of 4 (and you can call them by whatever your domain terminology is) because of their nice symmetry: sensitivity/specificity are ratios to the column totals (of a confusion matrix), PPV/NPV are rations to the row totals, sensitivity/specificity do not depend on prevalence, while PPV/NPV do, Sensitivity/PPV tell you about "positives (subjects or results), specificity/NPV tell you about negatives. Sensitivity/Specificity tell you how good a job the classifier does at finding positives/negatives, while PPV/NPV tell you how much you can believe the positive/negative results.
    Now, all these metrics are just observed values. So you should compute CI's for all 4 of them. A Sensitivity or PPV value, w/o a CI is not of much use.

Now, if you have results where you "tweaked" the models differently, you can indeed create a ROC plot. And you should also add the CI bounds on the plot. The "best" version of your model is the one closest to the top-left corner.

$\endgroup$
3
  • $\begingroup$ "The performance of your classifiers (dog/no dog) is indeed fully specified by the 4 metrics you get (TP, TN, FP, FN). Everything else is just re-arranging these 4 metrics in various ways." It depends very much on what you mean by "everything else". For instance, you cannot derive the Brier or the log score from these statistics - which stands to reason, since these values depend crucially on the threshold used, which is usually by default set to 0.5, which is often not a good idea. $\endgroup$ Commented Jul 9 at 21:12
  • 2
    $\begingroup$ Indeed, the Brier score requires a probabilistic assessment of the classification. A confusion matrix is te result of a dichotomous test. So one can not derive one metric from the other, or as you say, only poorly so. My point is that, when we stick to dichotomous tests, TN,TP,FN,FP tells you all you need, you can substitute a properly chosen set of 4 other numbers, but more than 4 is redundant. $\endgroup$
    – jginestet
    Commented Jul 9 at 21:37
  • $\begingroup$ Precisely. So perhaps you could change the "everything else" to "all other statistics that depend on this specific choice of threshold", just to be more precise? $\endgroup$ Commented Jul 10 at 7:24

Not the answer you're looking for? Browse other questions tagged or ask your own question.