Judging a model through the TP, TN, FP, and FN values

Question

I am evaluating a model that predicts the existence or not existence of a "characteristic" (for example, "there is a dog in this image") using several datasets. The system outputs me for every dataset their TP, TN, FP, FN.

I would like a metric(s) to judge how good is the model doing its work but I realize that I can not plot for example just the TP because for example the first dataset has 20 instances where there is the characteristic (there is a dog) and the second dataset has say only 10. Even if the model is perfect the second dataset would have only 10 TP.

I am thinking of calculating accuracy, precision and recall for each dataset and for all datasets.

I have also run the model three times per each dataset, with small variations

I am investigating also precision-recall curves but it seem that these are for different threshold values and obviously I have only one set of precision , recall per dataset

Are there any good way to judge if a model is "good"? Due to my inexperience I can not come with a good judge criteria

At first I thought plotting the distribution of each (TP etc) for all datasets Then I thought of plotting a confusion matrix combining all datasets Any advice will be greatly appreciated

As a simple fictitious example I thought of

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score

# Example fictitious data
datasets = {
    'datasetA': {'TP': 150, 'TN': 200, 'FP': 50, 'FN': 100, 'no_GT': 34},
    'datasetB': {'TP': 180, 'TN': 220, 'FP': 40, 'FN': 81, 'no_GT': 20},
    'datasetC': {'TP': 160, 'TN': 240, 'FP': 70, 'FN': 110, 'no_GT': 30},
    'datasetD': {'TP': 190, 'TN': 250, 'FP': 60, 'FN': 90, 'no_GT': 42},
}

def calculate_metrics(TP, TN, FP, FN):
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Aggregate counts
total_TP = sum(data['TP'] for data in datasets.values())
total_TN = sum(data['TN'] for data in datasets.values())
total_FP = sum(data['FP'] for data in datasets.values())
total_FN = sum(data['FN'] for data in datasets.values())

# Calculate overall metrics
overall_metrics = calculate_metrics(total_TP, total_TN, total_FP, total_FN)

# Calculate metrics for each dataset
metrics_df = pd.DataFrame({dataset: calculate_metrics(data['TP'], data['TN'], data['FP'], data['FN']) for dataset, data in datasets.items()})

# Add overall metrics
metrics_df['Overall'] = overall_metrics

print(metrics_df)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, (dataset, data) in enumerate(datasets.items()):
    cm = confusion_matrix([1] * data['TP'] + [0] * data['TN'] + [1] * data['FN'] + [0] * data['FP'],
                          [1] * (data['TP'] + data['FP']) + [0] * (data['TN'] + data['FN']))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
    axes[i].set_title(f'Confusion Matrix - {dataset}')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('True')

plt.tight_layout()
plt.show()

and I get

           datasetA  datasetB  datasetC  datasetD   Overall
Accuracy   0.700000  0.767754  0.689655  0.745763  0.725696
Precision  0.750000  0.818182  0.695652  0.760000  0.755556
Recall     0.600000  0.689655  0.592593  0.678571  0.640905
F1 Score   0.666667  0.748441  0.640000  0.716981  0.693524

and

I'd add to the metrics you are alreay using the ROC curve and the F1 score. — MarcoM, Commented Jul 9 at 5:50
Thanks. I thought about it but I read the ROC curve is the plot the true positive rate against the false positive rate for different threshold values. It is this "different threshold values" what I don't understand. I got the results of the model application through their TP, etc. There are no different threshold values, are there? (sorry I am a bit confused about this) — KansaiRobot, Commented Jul 9 at 5:53
The F1 score suffers from all the same issues as accuracy etc., see the links in my answer. The AUROC is better. @KansaiRobot: there absolutely are different thresholds here, it's just that they are often swept under the rug and implicitly set to 0.5, which is usually not a good choice, see this thread. You should either think carefully about your costs of "misclassification" and set your threshold accordingly, or use a model that performs well over many thresholds - through proper scoring rules. — Stephan Kolassa, Commented Jul 9 at 6:40
Please check also an overview of measures stats.stackexchange.com/q/586342/3277 — ttnphns, Commented Jul 9 at 20:43
Ok when you say 'model' you mean 'binary classifier'. e.g. one that predicts the existence or not existence of a "characteristic" (e.g. "there is a dog in this image"). But to evaluate the classifier performance, you have to tell us what is the relative cost of a False Positive vs a False Negative. If this was a diagnostic for a rare but fatal condition, a low FN rate is crucial but FP is not bad. Without knowing how rare/common dogs in your images are and what significance that carries, we can't say. — smci, Commented Jul 10 at 10:37

Stephan Kolassa · Accepted Answer · 2024-07-09 06:35:16Z

7

Do not use any of accuracy, precision, recall, or the F1 score. They all suffer from the same issues, especially - but not only - for "unbalanced" data: Why is accuracy not the best measure for assessing classification models?

Instead, use probabilistic predictions and assess these using proper scoring rules.

answered Jul 9 at 6:35

Stephan Kolassa

128k21 gold badges253 silver badges482 bronze badges

$\begingroup$ Can you please expand more why the F1 measure is not appropriate for unbalanced data? It is often mentioned as a good measure of a model against unbalanced data, such as here, here, here, and here. Obviously there is no perfect measure of success for a model, but F1 seems to be widely accepted so I was surprised to see it shot down in the same list as accuracy. $\endgroup$
– Poisson Fish
Commented Jul 9 at 15:05
1

$\begingroup$ @PoissonFish: the problem with the F1 (and indeed all F-beta scores) is quite analogous to the one with accuracy, and all the other KPIs that rely on "hard" 0-1 classifications. In a nutshell, they all assume a very specific cost structure to misclassifications - albeit a different one in each case, which is why they will usually not agree on which of multiple models is the best one. I'll be honest here: we get so many questions that are minor variations on this theme that I have gotten a bit tired of going through all this each time, and rather answer as I do here. $\endgroup$
– Stephan Kolassa
Commented Jul 10 at 16:41
1

$\begingroup$ @PoissonFish Note that many people do not understand this issue and will post bad advice on LinkedIn, Twitter/X, Medium, etc, as I suspect Stephan Kolassa will agree. $\endgroup$
– Dave
Commented Jul 10 at 16:43
1

$\begingroup$ @Dave and Stephan, thank you both for the explanation, I believe I am beginning to understand it better. I have reread through the linked answer in that light and understand your points about the F1 score (and other similar measures). I am currently doing some work that involves using the F1 measure and perhaps will change my approach based on this new (to me) information. Sorry for picking at this topic, but because of that work, I had a vested interest in understanding your point of view. Thank you for the chance to change my mind! $\endgroup$
– Poisson Fish
Commented Jul 10 at 16:49
1

$\begingroup$ @PoissonFish I have some links to related material in my profile. You may be interested in diving down the rabbit hole. $\endgroup$
– Dave
Commented Jul 10 at 17:00

| Show 5 more comments

Peter Flom · Accepted Answer · 2024-07-09 10:29:23Z

4

What scoring rule you should use depends on the particular situation and how good the TP and TN are and how bad FP and FN are.

These vary hugely by situation. For instance, if the event is "the Challenger shuttle will explode" then even a very small false negative is really bad. OTOH, for some high risk surgeries, even a small false positive may be bad.

As so often happens, you need subject matter expertise to solve the problem.

answered Jul 9 at 10:29

Peter Flom

125k36 gold badges179 silver badges410 bronze badges

Add a comment |

jginestet · Accepted Answer · 2024-07-10 16:30:35Z

I will provide an answer, coming from a different prespective. My background is from the diagnostic tests (think Covid, pregnancy, etc.), not ML. But maybe that different angle can give you some insights.
The performance of your classifiers (dog/no dog) is indeed fully specified by the 4 metrics you get (TP, TN, FP, FN). Every other statistic based on these 4 original metrics is just re-arranging them in various ways.
In the diagnostics domain, these 4 basic metrics are re-arranged into 4 different other metrics:

Sensitivity ($\frac {TP} {(TP+FN)}$: how good is your model at finding positives) aka Recall, Hit Rate, TPR (true positive rate), with 1-Sensitivity=Miss rate or False negative rate (FNR).
Specificity ($\frac {TN} {(TN+FP)}$: how good is your mode at finding negatives) aka TNR (true negative rate) wit 1-Specifiity=Fallout or False positive rate (FPR).
Positive Predictive Value (PPV) ($\frac {TP} {(TP+FP)}$: if your model gives you a positive result, how likely is it to be truly positive) aka Precision (with 1-PPV=FDR false discovery rate).
Negative Predictive Value (NPV) ($\frac {TN} {(TN+FN)}$: if your model gives you a negative result, how likely is it to be truly negative) aka ????(seems other fields do not have a synonim) with 1-NPV=FOR false omission rate).
These 4 metrics are basically all you need (because you can regenerate the TP,TN,FP,FN from them). Any other combination is then redundant; for example Accuracy, Negative likelihood ratio (LR-), Positive likelihood ratio (LR+), F1 etc. are redundant (they do not add new information. Some however may prefer one over another).
As you can see from the various acronyms, synonims, etc. different fields have different terms, which only adds to the confusion, w/o adding much value. But that is the current state. Now you could pick another set of 4 to summarize/represent TP,TN,FP,FN. I presonally like this particular set of 4 (and you can call them by whatever your domain terminology is) because of their nice symmetry: sensitivity/specificity are ratios to the column totals (of a confusion matrix), PPV/NPV are rations to the row totals, sensitivity/specificity do not depend on prevalence, while PPV/NPV do, Sensitivity/PPV tell you about "positives (subjects or results), specificity/NPV tell you about negatives. Sensitivity/Specificity tell you how good a job the classifier does at finding positives/negatives, while PPV/NPV tell you how much you can believe the positive/negative results.
Now, all these metrics are just observed values. So you should compute CI's for all 4 of them. A Sensitivity or PPV value, w/o a CI is not of much use.

Now, if you have results where you "tweaked" the models differently, you can indeed create a ROC plot. And you should also add the CI bounds on the plot. The "best" version of your model is the one closest to the top-left corner.

"The performance of your classifiers (dog/no dog) is indeed fully specified by the 4 metrics you get (TP, TN, FP, FN). Everything else is just re-arranging these 4 metrics in various ways." It depends very much on what you mean by "everything else". For instance, you cannot derive the Brier or the log score from these statistics - which stands to reason, since these values depend crucially on the threshold used, which is usually by default set to 0.5, which is often not a good idea. — Stephan Kolassa, Commented Jul 9 at 21:12
Indeed, the Brier score requires a probabilistic assessment of the classification. A confusion matrix is te result of a dichotomous test. So one can not derive one metric from the other, or as you say, only poorly so. My point is that, when we stick to dichotomous tests, TN,TP,FN,FP tells you all you need, you can substitute a properly chosen set of 4 other numbers, but more than 4 is redundant. — jginestet, Commented Jul 9 at 21:37
Precisely. So perhaps you could change the "everything else" to "all other statistics that depend on this specific choice of threshold", just to be more precise? — Stephan Kolassa, Commented Jul 10 at 7:24

Stack Exchange Network

Judging a model through the TP, TN, FP, and FN values

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
precision-recall
metric
confusion-matrix
precision
or ask your own question.

Linked

Hot Network Questions

Judging a model through the TP, TN, FP, and FN values

3 Answers 3

Not the answer you're looking for? Browse other questions tagged precision-recallmetricconfusion-matrixprecision or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
precision-recall
metric
confusion-matrix
precision
or ask your own question.