Off-the-shelf computer vision models are great at the common stuff. They can find people, vehicles, hard hats, fire extinguishers. What they cannot do is recognise the specific, weird, often dangerous stuff that exists only at your site. A missing wet-floor sign by the loading bay. A specific valve that needs to be in the closed position. A piece of equipment that should never be in this aisle. The long-tail of workplace hazards is unique to your facility, and detecting it needs a custom model.

This is a walkthrough of how to build one. Capturing the data, labelling it well, picking the architecture, training, deploying, iterating. The whole loop. Aimed at engineers who already have basic familiarity with computer vision and want a practical guide to making it work on a real workplace.

The same approach works whether you're trying to spot chemical spills, exposed cabling, forklifts in zones they shouldn't be in, or any other site-specific compliance issue.

Step 1: figure out what you actually need to detect

Don't start with the model. Start with the hazards. Walk the floor, talk to the safety team, pull the incident log for the last 12 months. The patterns you find are what you build the model around. The mistake teams make is starting from "let's detect everything"; the better start is "what are the three things we miss most often?"

A useful frame for finding your long-tail:

  • Equipment-specific: hazards tied to particular machines or fixtures on your site. A custom valve, a one-off conveyor design, an older piece of kit that doesn't fit standard guards.
  • Environmental: blind spots, narrow passages, uneven floors, weird sightlines. The physical layout that introduces risk.
  • Process-related: hazards that show up in how work gets done. The way materials are stacked, the way two tools end up in the same hand, the specific moment of transition between shifts.
  • Compliance: blocked fire exits, missing safety guards, mislabelled storage, expired signage.

A few real long-tail hazards we've helped clients detect:

  • Wet-floor sign missing in a zone where it should be deployed during cleaning hours.
  • A specific machine's status light going red.
  • Unattended ladder leaning where a passing forklift could clip it.
  • Particular chemical spill (the colour and texture are recognisable).
  • Boxes stacked in front of an emergency exit.

Scope down to a handful. You can always add more detectors later.

Step 2: build the dataset

Once you know what to detect, you need a dataset. Photos and short clips of the hazards, in the actual conditions your model will see in production. Generic stock images won't cut it.

Collecting the images:

  • Take a lot of them, from varied angles. Different distances, different times of day, different cameras if you can. Variability in the training set is what makes the model robust on the floor.
  • Capture the variants. A spill on concrete looks different from a spill on metal. A blocked exit at midnight looks different from one at noon. Get coverage across the realistic range.
  • Mind the privacy. If workers are in the frame, follow your local data protection rules. Anonymise where appropriate. If your jurisdiction requires consent, get it before you start collecting.

Labelling the dataset:

Labelling is bounding-box work: draw a tight box around each instance of the hazard in each image, give it a class label. Tedious, but the quality of the labels is the ceiling for the model's performance, so do it carefully.

  • Use the right tools. LabelImg, CVAT, or VGG Image Annotator are all solid. CVAT is probably the most popular for team work.
  • Be consistent. Same labelling rules across the whole dataset. If two annotators label the same thing differently, the model learns the inconsistency. Write a labelling guide if more than one person is doing the work.
  • Handle the ambiguity explicitly. Decide upfront how you'll handle partial occlusion, edge cases, ambiguous overlap. Document your decisions so the labels stay consistent.
  • Augment. Rotations, flips, brightness shifts, crops, all multiply your effective dataset size cheaply. Don't go crazy; mirror only what's realistic in the production setting.

Step 3: pick the architecture

The right model depends on your latency budget, accuracy requirements, and what hardware you'll run it on. Don't pick by reputation; pick by what your constraints demand.

The three usual suspects:

  • Faster R-CNN: two-stage detector, slower, often more accurate. Good if you have GPU budget and care about precise localisation.
  • YOLO (You Only Look Once): single-stage, fast, designed for live video. The current generations (v8, v9, v10) are very good. Default for most real-time use cases.
  • SSD (Single Shot MultiBox Detector): single-stage, lighter than YOLO in some configs. Useful for embedded or low-power devices.

Choosing between them:

  • Real-time vs. accuracy. Live video means YOLO or SSD. Offline batch analysis means you can afford Faster R-CNN's slower inference.
  • Hardware budget. Edge device with a small GPU favours quantised YOLO. Cloud GPU server can run whatever you want.
  • Dataset size. Smaller datasets are more prone to overfitting; pick a model with fewer parameters or use heavier regularisation.

Always use transfer learning. Start with an ImageNet or COCO pre-trained backbone, fine-tune on your custom data. That cuts training time from days to hours and usually gives better accuracy than training from scratch, especially with limited data.

Step 4: train it, evaluate honestly

Training is where you turn the dataset into a working model. The recipe is well known; the details matter.

How to set it up:

  • Split the data three ways. Train (70%), validation (15%), test (15%). The test set gets touched exactly once, at the end. Treat it like a treasured artefact.
  • Pick an optimiser and loss function. Adam with a learning rate around 1e-4 is a safe starting point. For loss, the standard combination of focal loss for classification and IoU-based loss for boxes works for most detectors.
  • Watch the metrics. Precision, recall, F1, mAP. Don't fixate on aggregate accuracy; the per-class breakdown is where problems hide.
  • Tune the hyperparameters. Learning rate matters most. Batch size, epochs, weight decay follow. Use the validation set for the tuning; never touch the test set.
  • Stop early. Watch validation loss; when it stops improving, stop training. Otherwise you're just memorising the training set.
Image

What the metrics actually mean:

  • Precision: when the model says "hazard", how often is it right. Low precision means too many false alarms.
  • Recall: when there's a real hazard, how often does the model catch it. Low recall means real hazards slip through.
  • F1-score: harmonic mean of precision and recall. Single number that captures both.
  • mAP: the standard object-detection metric, averaging across classes and IoU thresholds.

Tuning the precision-recall trade-off: for safety-critical detectors, the trade-off matters a lot. Low recall means you miss things people get hurt by. Low precision means workers stop trusting the alerts because they fire too often. Lean toward higher recall for catastrophic hazards (PPE near operating machinery), tighter precision for nuisance alerts (compliance checks that are merely embarrassing).

Step 5: deploy and learn from real use

A model that scores well in evaluation can still struggle on the actual floor. Real cameras have different angles, lighting, and occlusion patterns than your test set. The deployment phase is where you find this out.

Getting it running:

  • Pick where it'll run. Local server, edge box, or cloud. The constraints we covered above apply.
  • Wire it to the cameras. A Python script that pulls frames from RTSP, runs inference, and emits events is enough to start. Production deployments use proper streaming pipelines (DeepStream, OpenCV plus a worker queue, etc.).
  • Build the alerting. Email, SMS, webhook into your incident management tool. Whichever the safety team will actually pay attention to.

Iterating from there:

  • Get worker feedback. The people on the floor will tell you what the model is missing and what it's flagging too often. Listen.
  • Look at the errors. Pull a sample of false positives and false negatives every week. The patterns will tell you where to focus.
  • Use active learning. Let the model surface the cases it's least confident about, label those, retrain. Way more efficient than labelling random new data.
  • Keep iterating. Sites change. New equipment shows up, layouts shift, lighting evolves. A static model becomes a stale model in 6-12 months.

Beyond Initial Deployment: Active Learning and Continuous Improvement

The journey of building a robust workplace safety AI doesn't end with the initial deployment. As your work environment evolves and new, unforeseen hazards emerge, your model must adapt to maintain its effectiveness. This is where active learning and a commitment to continuous improvement become crucial.

Active learning, the underused technique:

Most teams just feed random new images into the next training round. Active learning is smarter: the model picks the images it's most uncertain about, and those go to the annotators. Same labelling cost, much better learning per labelled example.

  1. The model surfaces its uncertain cases. Images where confidence is borderline. Usually these are partial occlusion, unusual lighting, or novel compositions the model hasn't seen before.
  2. Annotators label those. Humans-in-the-loop, focused on the cases where the human input has the most teaching value.
  3. Retrain on the new batch. The model improves fastest on the boundaries it was struggling with.

Why it pays off:

  • Less labelling overhead. You label far fewer images for the same model improvement.
  • Faster model gains. The hard cases pull the model up faster than random easy ones would.
  • Adapts to new hazards. As your site changes, the active-learning loop catches the new patterns quickly.

Keeping the model honest over time:

A custom detector isn't a one-and-done project. To stay useful it needs ongoing care:

  • Track performance. Precision, recall, F1, mAP. If any drift, find out why.
  • Make feedback easy. Workers should have a one-click way to report a missed detection or a false alarm.
  • Schedule retrains. Monthly or quarterly. Each cycle incorporates new data and any architecture improvements.
  • Share what you learn. Across sites, across teams, across vendors. Patterns of safety failure are surprisingly portable.

Building a custom hazard detector isn't conceptually hard. The pieces are all standard computer vision practice. What separates a working deployment from a stalled one is usually the unglamorous stuff: clean labelling, honest evaluation, willingness to iterate on the feedback that comes back from the floor.

Start with one hazard. Get it working end to end. Add the next one only when the first is stable. The teams that ship the best safety AI aren't the ones with the fanciest models; they're the ones who treat the model as part of an ongoing safety program rather than a one-off project.

Support our open-source projects by starring our GitHub repository: https://github.com/securade/hub. By starring the repo, you join a collaborative community that drives innovation in real-time AI surveillance solutions and share insights on building robust computer vision systems.