Diagnosing and Rebuilding a Computer Vision Pipeline Under Real-World Constraints

TL;DR

Context: Computer vision challenge (Bosch) estimating age/gender from low-resolution, occluded CCTV footage
Problem: Initial pipeline with face detection, pose estimation, and super-resolution failed systematically
Intervention: Diagnosed failure modes, simplified architecture to person-centric approach, rebuilt tracking and prediction
Impact: Achieved 82% gender accuracy and stable real-world performance by removing brittle dependencies

Intro

This project addressed a real-world computer vision problem posed by Bosch: estimating demographic attributes (age and gender) from low-resolution CCTV video under conditions of occlusion, motion blur, and poor lighting. The focus was not just model accuracy, but diagnosing why complex pipelines fail and designing architectures that remain stable under real-world constraints.

Problem

Extremely low-resolution faces in wide-angle CCTV footage
Initial pipeline (face detection → pose estimation → super-resolution → prediction) failed at multiple stages
Face detection failed completely; pose estimation and super-resolution didn't generalize to target domain
No clean labeled datasets aligned with CCTV characteristics

Intervention

Systematically evaluated each pipeline component and documented failure modes
Removed face detection, pose estimation, and super-resolution as unreliable dependencies
Redesigned around person-centric approach: person detection + tracking → direct prediction
Selected ByteTrack over DeepSORT for better occlusion handling
Implemented pseudo-labeling strategy to compensate for missing annotations

Impact

Achieved ~82% gender classification accuracy on validation data
Pipeline remained stable across occlusions and viewpoint changes
Temporal aggregation via tracking improved noisy per-frame predictions
Delivered complete end-to-end demo on real CCTV video (VIRAT dataset)

Why This Matters

Many ML projects fail not because models are weak, but because pipelines accumulate brittle dependencies. Knowing when to simplify architecture—and having the discipline to remove components that don't work—is often more valuable than adding sophistication.

Technical Deep Dive (Optional)

This section expands on failure analysis, architectural redesign, and data strategy for readers who want technical depth.

View technical deep dive

Initial Pipeline & Systematic Failure Analysis

Proposed Pipeline

Person detection (YOLOv5)
Face detection
Pose estimation
Face super-resolution
Feature extraction
Age & gender prediction

Failure Modes Identified

| Component | Outcome | |-----------|---------| | Face detection | Failed (Haar, DLIB, HOG+SVM, CNN-based) — insufficient facial detail | | Pose estimation | OpenPose failed entirely on low-resolution inputs | | Super-resolution | SRGAN/ESRGAN/SISN didn't generalize to CCTV domain |

Key Insight Attempting to recover fine-grained facial features from degraded video introduced cascading failures. Each stage amplified errors from previous stages.

Architectural Redesign: Person-Centric Approach

New Pipeline

Video frames
Person detection + tracking
Age & gender prediction (person-level)

Rationale

Removed all brittle dependencies (face detection, pose, super-resolution)
Focused on person-level features observable at CCTV resolution
Leveraged temporal consistency via tracking

Result Significantly improved stability and reliability.

Tracking: ByteTrack vs DeepSORT

Requirement

Maintain consistent IDs across occlusions
Handle low-confidence detections
Enable temporal aggregation of predictions

Evaluation

| Tracker | Strength | Limitation | |---------|----------|------------| | DeepSORT | Mature, widely used | Struggled with low-confidence detections | | ByteTrack | Better occlusion handling, ID consistency | Newer, less documented |

Decision ByteTrack selected for superior performance on degraded video.

Data Strategy: Handling Missing Labels

Challenge

No labeled CCTV dataset with age/gender annotations
Existing datasets (RAP, VIRAT, PA100K, SAIVT-SoftBio) had coarse or missing labels

Solution: Pseudo-Labeling

Used pretrained models to generate initial labels
Modeled age as normal distribution (μ = predicted age, σ = 5)
Leveraged PA-100K and PETA datasets for knowledge transfer
Applied augmentation to compensate for limited data

Result Enabled training despite imperfect annotations.

Model Training

Gender Classification

Binary classifier on person-level crops
~82% validation accuracy
Near-perfect training accuracy (highlighted domain noise, not modeling failure)

Age Estimation

Modeled as age-range classification
Converted to exact age via weighted averaging over bins
~66% validation accuracy (reflected dataset ambiguity)

Real-World Validation

VIRAT Video Demo

Complete end-to-end pipeline demonstration
Person detection → tracking → prediction → visualization
Validated system feasibility on real CCTV footage

Outcome System performed robustly under occlusion, motion blur, and viewpoint changes.

Key Engineering Learnings

Simplicity beats complexity under constraints

Removing components improved reliability more than adding sophistication

Face-centric approaches fail at CCTV resolution

Person-level features more robust than facial attributes

Tracking enables temporal filtering

Aggregating predictions across frames reduced noise

Data strategy matters more than model architecture

Pseudo-labeling and augmentation had outsized impact

Completed as part of a Bosch problem statement focused on real-world computer vision system design.