Object detection sits underneath most of the interesting computer vision work being done right now: self-driving, surveillance, industrial QC, sports analytics, you name it. If you're training or evaluating these models, you need to know how to measure them properly. This guide is the practical reference: what each metric means, how it's calculated, and when to use which one. Precision, recall, IoU, AP, mAP. The whole toolkit.
The aim is to give you enough to read a benchmark table critically and to know which dial to turn when your model is underperforming. No equations beyond what's necessary, plenty of intuition about when each metric matters.
Why the metrics matter
Say you've trained a pedestrian detector. The obvious thing is to count the pedestrians it gets right. The problem: that ignores the pedestrians it missed, and the bicycles it called pedestrians. Both of those failures matter, and they matter differently depending on the application. Counting right answers doesn't capture either. The metrics in this guide do.
Beyond just measuring, the metrics tell you what to fix. A high-precision, low-recall model has a different problem from a low-precision, high-recall one, and you tune them differently. And when you're picking between models, the metrics let you make the choice on the dimension that actually matters for your use case. Safety-critical workloads weight recall heavily. Workloads that pay a cost per false alert weight precision heavily.
The vocabulary
Before the metrics themselves, the terms they're built from:
- True Positive (TP). The model predicted an object and there really is one in that location.
- False Positive (FP). The model predicted an object but nothing's there, or it called the wrong class.
- True Negative (TN). Not used for object detection in any practical way. The "negative" space is the entire image background, which would inflate the count to nothing useful.
- False Negative (FN). An object that was actually there and the model missed it.
- Intersection over Union (IoU). How much the predicted box overlaps the ground truth box. Area of overlap divided by area of union. A detection counts as a TP only if its IoU with the matching ground truth exceeds a threshold (typically 0.5).
Everything else builds on top of these.
The metrics that matter
Five metrics you actually need to know:
- Precision. Of the detections the model made, what fraction were correct. TP / (TP + FP). High precision means few false alarms.
- Recall. Of the objects actually present, what fraction did the model find. TP / (TP + FN). High recall means few misses.
- F1-Score. The harmonic mean of precision and recall. A single number that punishes you for being weak on either axis.
- Average Precision (AP). The area under the precision-recall curve for a single class. It captures the precision-recall trade-off over all thresholds, not just one.
- Mean Average Precision (mAP). The average AP across all classes. This is the headline number in almost every object detection benchmark.
Which of these you optimise depends on the application. Safety-critical: bias toward recall. Cost-per-alert: bias toward precision. Balanced: F1 or mAP.
Recall in more detail
Recall, sometimes called sensitivity, measures how good the model is at finding everything it should. High recall means few missed objects. For medical imaging, for safety monitoring, for anything where the cost of missing an object is high, recall is the metric to watch. Tumour detection is the canonical example: if the model misses the tumour, the patient pays the price.
The formula:
Recall = True Positives / (True Positives + False Negatives)
Where:
- True Positives (TP): objects correctly detected.
- False Negatives (FN): objects actually present that the model failed to detect.
Pushing recall up usually pushes precision down. You're catching more real objects but you're also raising more false alarms. Where to land on that curve is a product decision, not a technical one.

Worked example: AP and mAP
Quick worked example. Two-class detector: cars and pedestrians.
You evaluate on a held-out set and get the precision-recall curve for each class. Say the car curve gives AP = 0.85 and the pedestrian curve gives AP = 0.70.
mAP is just the average across classes:
mAP = (AP(car) + AP(pedestrian)) / 2 = (0.85 + 0.70) / 2 = 0.775
So this model scores 0.775 mAP. The fact that car AP is much higher than pedestrian AP tells you where the model is weaker, which is more useful than the headline number on its own.

In production, AP is computed with one of a few standard interpolation schemes (COCO uses 101-point interpolation across IoU thresholds from 0.5 to 0.95). Libraries handle this for you, but it's worth knowing the variant so you can compare like-for-like across benchmarks.
Moving the numbers
If your metrics are below where you need them, these are the levers that actually work:
- Data augmentation. Rotations, scaling, colour jitter, mosaic, mixup. Cheap to do, usually moves the needle.
- Balanced classes. If 95% of your training boxes are cars, the pedestrian AP will be terrible. Oversample, undersample, or weight the loss.
- Hyperparameter tuning. Learning rate, batch size, optimiser, IoU threshold for NMS. Grid or Bayesian search beats hand-tuning.
- Ensembles. Averaging predictions across multiple models almost always helps. Slow at inference, but a useful baseline for "what's possible".
- Transfer learning. Start from a backbone pre-trained on COCO or similar. Almost always faster and better than training from scratch, especially with limited data.

That's the working toolkit. Know the difference between precision and recall, know how IoU shapes what counts as a TP, know how AP and mAP roll those up into a single comparable number. Pick the metric that matches your use case, not the one that makes your model look best. And remember that the headline number is almost always less useful than the per-class breakdown, which is where the actual failures show up.
If you found this guide helpful, consider starring our GitHub repository for more resources and updates: Securade Hub.
