Object detection used to be a craft skill. Get a copy of OpenCV, pick from Haar cascades, SIFT, HOG, hand-tune the features for your specific use case, and hope your model holds up when the lighting changes. The work was real but the results were fragile. Modern deep-learning vision models have largely replaced that workflow. The models are bigger, the training is heavier, but the outputs are much better and the development effort is lower.

Where traditional CV runs out

The old techniques had a specific shape. You'd pick a feature descriptor (HOG worked for pedestrians, SIFT for general object matching, Haar for faces), tune it for your scenario, and ship. In a controlled lab, this was fine. In the real world, it usually wasn't.

  • Doesn't generalise. Each detector targets specific object classes. Adding a new class meant designing a new feature pipeline from scratch.
  • Fragile to environment changes. Lighting, occlusion, viewpoint shifts: each one degrades the detector. Real-world deployment was a constant battle.
  • Feature engineering is expensive. Designing good handcrafted features is genuinely hard and takes domain expertise that's in short supply.
  • Not always fast. Some of the more sophisticated traditional methods are computationally heavy enough to rule out real-time use.

These limitations are why large vision models took over the field. The deep-learning approach learns features automatically from data; the human effort goes into the architecture and the dataset rather than into hand-tuning feature descriptors.

What large vision models actually do differently

A modern object detector takes raw pixels in and outputs structured detections directly. The model learns the features it needs as a side effect of training; you don't write them. Train on enough data and the model becomes robust to the kinds of variations that broke handcrafted methods.

Four practical advantages compared to the older approach:

  1. Features learn themselves. Engineering effort shifts from feature design to dataset curation and architecture choice.
  2. More resilient to environment. The same trained model handles lighting changes, occlusion, and viewpoint shifts that would have broken a Haar cascade.
  3. Scales with data. More training examples generally produce a better model. The ceiling is much higher than what handcrafted features can reach.
  4. Generalises across classes. A single backbone can be fine-tuned for many different detection tasks.
Deep Learning Architecture

The detector landscape today is mostly:

  • YOLO. The default choice for real-time work. Fast, accurate, easy to deploy. Multiple actively-maintained variants (v8, v9, v10).
  • Faster R-CNN. Two-stage detector. Higher accuracy at the cost of speed. Use when accuracy matters more than latency.
  • SSD. Single-shot, fast, lighter than YOLO in some configs. Useful on constrained hardware.
  • DETR. Transformer-based, end-to-end. Newer, less battle-tested than YOLO but interesting for the architectural simplicity.

YOLOv7 in particular

YOLOv7 was a meaningful step forward in the YOLO line when it came out. The team built on the design improvements from v5 and v6 and added several architectural refinements that improved both speed and accuracy without making the model harder to deploy. It runs well on GPUs, CPUs, and edge devices, which makes it practical for a wide range of production scenarios.

What's new in v7 specifically:

  • Slimmer architecture. Less compute per inference, same accuracy.
  • "Bag of freebies" training tricks. Training-time improvements that don't slow down inference at all.
  • Planned re-parameterised convolution. Fewer parameters at inference, faster and lighter to ship.
  • E-ELAN (Extended Efficient Layer Aggregation Networks). Better feature mixing across layers, modest accuracy gain.

v7 is still in active use even though newer versions have shipped, because it hits a useful sweet spot for many deployments.

Detecting arbitrary objects with HUB

The classical workflow needs thousands of labelled examples per object class. That's a lot of human labelling work, especially for objects you only need to detect a few times a year. Generative AI changes the picture: you can synthesise training data for objects you can describe but don't have many real examples of.

GANs and diffusion models can produce realistic synthetic training images. Combined with a YOLO-style detector, the result is a system that can be trained on a new object class with very little manual labelling. The synthetic data covers most of the variation, real-world examples fill in the edge cases.

Securade HUB is our open-source implementation of this. YOLOv7 as the backbone, generative AI for the synthetic data, end-to-end pipeline from "I want to detect X" to a deployed model.

  • Describe what you want. Define the object class with a text prompt or a few example images.
  • Train. HUB handles the synthetic data generation and YOLO fine-tuning.
  • Deploy. Push the trained model to whatever runtime fits your use case.

The whole loop runs in hours rather than weeks. Useful for the long-tail object classes that traditional supervised learning was never going to be cost-effective for.

Securade Hub Interface

Where object detection actually gets used

A non-exhaustive list of the production applications we see most often:

  • Autonomous vehicles. Pedestrians, vehicles, signs, road markings. Object detection is the perception layer.
  • Surveillance. Public spaces, suspicious activity, person tracking.
  • Medical imaging. Tumour detection, anomaly highlighting, radiology assistance.
  • Industrial automation. Defect detection on production lines, robotic assembly, quality assurance.
  • Retail analytics. Customer flow, product positioning, loss prevention.
  • Agriculture. Crop monitoring, pest detection, automated harvesting.

Large vision models have made object detection a much more practical capability than it used to be. The architecture choices are well-understood, the tooling is mature, the deployment patterns are documented. YOLOv7 specifically remains a solid default for real-time work, and the broader ecosystem is moving fast.

For custom-object detection without the labelling overhead, the generative-AI approach we've baked into HUB is worth a look. Hours from "I want to detect this" to a trained model is a different working pattern than the old "weeks of labelling" approach, and it's what makes per-site customisation actually feasible.