TL;DR
- Context: Computer vision challenge (Bosch) estimating age/gender from low-resolution, occluded CCTV footage
- Problem: Initial pipeline with face detection, pose estimation, and super-resolution failed systematically
- Intervention: Diagnosed failure modes, simplified architecture to person-centric approach, rebuilt tracking and prediction
- Impact: Achieved 82% gender accuracy and stable real-world performance by removing brittle dependencies
Intro
This project addressed a real-world computer vision problem posed by Bosch: estimating demographic attributes (age and gender) from low-resolution CCTV video under conditions of occlusion, motion blur, and poor lighting. The focus was not just model accuracy, but diagnosing why complex pipelines fail and designing architectures that remain stable under real-world constraints.
Problem
- Extremely low-resolution faces in wide-angle CCTV footage
- Initial pipeline (face detection → pose estimation → super-resolution → prediction) failed at multiple stages
- Face detection failed completely; pose estimation and super-resolution didn't generalize to target domain
- No clean labeled datasets aligned with CCTV characteristics
Intervention
- Systematically evaluated each pipeline component and documented failure modes
- Removed face detection, pose estimation, and super-resolution as unreliable dependencies
- Redesigned around person-centric approach: person detection + tracking → direct prediction
- Selected ByteTrack over DeepSORT for better occlusion handling
- Implemented pseudo-labeling strategy to compensate for missing annotations
Impact
- Achieved ~82% gender classification accuracy on validation data
- Pipeline remained stable across occlusions and viewpoint changes
- Temporal aggregation via tracking improved noisy per-frame predictions
- Delivered complete end-to-end demo on real CCTV video (VIRAT dataset)
Why This Matters
Many ML projects fail not because models are weak, but because pipelines accumulate brittle dependencies. Knowing when to simplify architecture—and having the discipline to remove components that don't work—is often more valuable than adding sophistication.
Technical Deep Dive (Optional)
This section expands on failure analysis, architectural redesign, and data strategy for readers who want technical depth.
View technical deep dive
Initial Pipeline & Systematic Failure Analysis
Proposed Pipeline
- Person detection (YOLOv5)
- Face detection
- Pose estimation
- Face super-resolution
- Feature extraction
- Age & gender prediction
Failure Modes Identified
| Component | Outcome | |-----------|---------| | Face detection | Failed (Haar, DLIB, HOG+SVM, CNN-based) — insufficient facial detail | | Pose estimation | OpenPose failed entirely on low-resolution inputs | | Super-resolution | SRGAN/ESRGAN/SISN didn't generalize to CCTV domain |
Key Insight Attempting to recover fine-grained facial features from degraded video introduced cascading failures. Each stage amplified errors from previous stages.
Architectural Redesign: Person-Centric Approach
New Pipeline
- Video frames
- Person detection + tracking
- Age & gender prediction (person-level)
Rationale
- Removed all brittle dependencies (face detection, pose, super-resolution)
- Focused on person-level features observable at CCTV resolution
- Leveraged temporal consistency via tracking
Result Significantly improved stability and reliability.
Tracking: ByteTrack vs DeepSORT
Requirement
- Maintain consistent IDs across occlusions
- Handle low-confidence detections
- Enable temporal aggregation of predictions
Evaluation
| Tracker | Strength | Limitation | |---------|----------|------------| | DeepSORT | Mature, widely used | Struggled with low-confidence detections | | ByteTrack | Better occlusion handling, ID consistency | Newer, less documented |
Decision ByteTrack selected for superior performance on degraded video.
Data Strategy: Handling Missing Labels
Challenge
- No labeled CCTV dataset with age/gender annotations
- Existing datasets (RAP, VIRAT, PA100K, SAIVT-SoftBio) had coarse or missing labels
Solution: Pseudo-Labeling
- Used pretrained models to generate initial labels
- Modeled age as normal distribution (μ = predicted age, σ = 5)
- Leveraged PA-100K and PETA datasets for knowledge transfer
- Applied augmentation to compensate for limited data
Result Enabled training despite imperfect annotations.
Model Training
Gender Classification
- Binary classifier on person-level crops
- ~82% validation accuracy
- Near-perfect training accuracy (highlighted domain noise, not modeling failure)
Age Estimation
- Modeled as age-range classification
- Converted to exact age via weighted averaging over bins
- ~66% validation accuracy (reflected dataset ambiguity)
Real-World Validation
VIRAT Video Demo
- Complete end-to-end pipeline demonstration
- Person detection → tracking → prediction → visualization
- Validated system feasibility on real CCTV footage
Outcome System performed robustly under occlusion, motion blur, and viewpoint changes.
Key Engineering Learnings
Simplicity beats complexity under constraints
- Removing components improved reliability more than adding sophistication
Face-centric approaches fail at CCTV resolution
- Person-level features more robust than facial attributes
Tracking enables temporal filtering
- Aggregating predictions across frames reduced noise
Data strategy matters more than model architecture
- Pseudo-labeling and augmentation had outsized impact
Completed as part of a Bosch problem statement focused on real-world computer vision system design.