A typical mid-sized facility might have 200 cameras, each generating 25 frames per second. That's 5,000 frames a second of footage that nobody is actually watching in real time. The cameras record, the recordings get filed away, and most events never get reviewed unless something bad happens.
Custom deep-learning models change that math. The cameras stay the same. What changes is that a model sits behind them and watches the frames as they come in, flagging things you actually care about: weapons, intrusions, unattended bags, unsafe behaviour. This is a walkthrough of what's involved in building and deploying that kind of model, from the data work through to the deployment side.
It's written for engineers and security folks who want to know what they're getting into. The work is doable. It's also more about good data and good integration than it is about exotic architectures.
The basics of what's actually happening
AI threat detection is computer vision running on video frames. The model sees a frame, runs inference, and outputs predictions: bounding boxes, labels, confidence scores. If a prediction crosses a threshold, you fire an alert. The underlying technology is deep learning, which mostly just means convolutional neural networks plus the ecosystem that has built up around them over the past decade.
Computer vision: making the camera useful
Computer vision covers a lot of ground: object detection, classification, segmentation, tracking, pose estimation. For threat detection you mostly care about object detection (finding things in the frame) and sometimes pose estimation (figuring out what people are doing).
CNNs are still the workhorse for this. They pull features out of the image at multiple scales and combine them into detections. There are newer transformer-based approaches that beat them on some benchmarks, but for production video at edge latency budgets, a good CNN is still where most deployments land.
Deep learning: where the data goes
A deep model is only as good as what you trained it on. That means labelled data. A lot of it. The accuracy of your detector is almost entirely a function of how representative and well-labelled your training set is. You can have the best architecture in the world; if your data is bad, your detector is bad.
Architecture matters too, but in a more limited way. The right architecture for your problem usually picks itself once you know your latency budget, your hardware, and your accuracy targets.
Building the thing, step by step
The actual workflow has four steps. None of them are glamorous. Most of the work, by hours, is in the first one.
Step 1: data
You need a dataset that covers the threats you want to detect, plus a lot of negative examples. If you're building a weapon detector, you need footage of weapons in your actual environments under your actual lighting. Footage from a different building under different lighting is not the same thing, and the model will struggle when it sees yours.
Augmentation helps stretch a small dataset. Rotations, crops, colour shifts, simulated lighting changes, all of these expand what the model has seen. It's not a substitute for real data, but it pushes a small dataset further than you'd expect.
Then comes labelling. This is the part everyone underestimates. Labelling 10,000 frames accurately is genuinely a week of someone's work. There are tools that help (CVAT, Roboflow, Label Studio), and there are commercial labelling services, but the labels still need someone who knows your domain to review them.
What to actually pay attention to:
- Dataset size: bigger and more diverse usually beats clever. Aim for thousands of labelled examples, not hundreds.
- Label quality: inconsistent labels make the model learn the wrong things. Have a written labelling standard and review samples.
- Privacy: if you're capturing real people, that's personal data. Treat it like it.
Step 2: pick a model and train it
Don't train from scratch. Start with a pre-trained backbone and fine-tune it on your data. The standard options are YOLO (fast, decent accuracy), SSD (balanced), and Faster R-CNN (higher accuracy, slower). For real-time video, YOLO variants are usually what you'll end up running.
Training itself is mostly babysitting. You set a learning rate, you start the loop, you watch the validation loss. If it overfits, you adjust regularisation or get more data. If it underfits, you train longer or try a bigger model. The cycle is repetitive, but the diagnostics are well understood and there's tooling for all of it.
The three architectures you'll actually consider:
- YOLO: default choice for live video. The recent variants (v8, v9) are very good.
- SSD: still around, sometimes wins on specific hardware.
- Faster R-CNN: good if you have GPU budget and care about accuracy more than latency.
Step 3: evaluate honestly
Hold out a test set the model has never seen. Score it on precision (of the things you flagged, how many were real?), recall (of the things that were real, how many did you flag?), and F1 (the balance between the two). Look at the failure cases. Don't just look at the aggregate score.
If it doesn't pass, fix the data first. Adding more labelled examples of the cases the model is getting wrong almost always helps more than changing the architecture.
Step 4: deploy it next to the cameras
Once the model is good enough, you wire it up to your camera feeds, run inference, and send alerts. The deployment side has its own optimisations: quantising the model so it runs faster, pruning weights you don't need, picking the right hardware for the inference target.
From there, the alerts need to land somewhere useful: an access control system, a fire panel, a security ops dashboard. The integration layer is usually where deployments stall, so plan for it early.
Squeezing latency out of the pipeline
A live surveillance model with a 2-second inference delay is useless. You're trying to act on what just happened, not what happened a moment ago. So latency optimisation is part of the job, not an afterthought.
Quantisation and pruning
Quantising converts your model's 32-bit floats to 8-bit integers. That's a 4x memory reduction and usually a 2-3x speed-up on supported hardware, with a small drop in accuracy. Post-training quantisation is the easy version; quantisation-aware training is the version that doesn't lose much accuracy.
Pruning removes weights the model doesn't really need. Combined with quantisation, you can usually get a model down to a fraction of its training-time footprint without hurting performance much.
The right hardware
GPUs are the default. They handle the parallelism that CNNs are built around. Cheaper edge accelerators like Jetson Nano or Coral USB get you a long way for less power. Cloud TPUs are great for batch inference but rarely the right answer for live streams.
Putting the model near the cameras
Running inference on a box at the site, rather than shipping frames to a cloud cluster, cuts the network round trip entirely. For real-time use cases, that's not a nice-to-have; it's the only way to hit the latency budget.
The things that will bite you
Three problems show up in every deployment.
False positives
A model that fires constantly is worse than no model at all, because the people on the receiving end stop reading the alerts. Tune your detection threshold conservatively and accept that you'll miss some edge cases in exchange for trust. A human-in-the-loop review of borderline cases helps a lot, especially in the first few months when you're still learning where the model goes wrong.
Scaling
Going from 10 cameras to 1000 changes the engineering problem. You'll need to think about how you push model updates to all the edge devices, how you collect new training data from the fleet, and how you monitor the health of the model across sites.
The ethics, which are not optional
You're building something that watches people. Bias in the training data shows up as bias in the detections. Privacy expectations vary by jurisdiction and by industry. Accountability for what the model decides has to sit with a person, not the model. None of this is decorative; people will ask you about it, and you should have answers.
Custom threat detection isn't easy, but none of the pieces are mysterious. Most of the work is in the data, some of it is in the deployment, and a smaller-than-you'd-think portion is in the model itself. If you get the data right and the integration right, the model becomes the easy part.
The tooling has matured a lot. The models have matured. The bottlenecks now are operational: labelling, fleet management, alert routing, and making sure the system stays honest as your environment changes.
If you want a starting point that already has the camera I/O, model serving, and alert routing wired up, take a look at securade.ai HUB on GitHub. It's open source.
