TL;DR
- Context: Building gesture recognition for human-computer interaction under real-time latency constraints
- Problem: Unclear which temporal architecture could handle dynamic gestures at production speed
- Intervention: Evaluated multiple architectures, chose TCN for latency, integrated MediaPipe for real-world deployment
- Impact: Achieved 90% accuracy with real-time capable inference, validated end-to-end RGB→prediction pipeline
Intro
This project focused on designing a production-viable hand gesture recognition system for touchless interfaces. The core constraint was real-time performance: the system needed to classify dynamic gestures from hand skeleton data with minimal latency while remaining robust to viewpoint, user, and environmental variations. The challenge was determining which modeling approach could meet accuracy and speed requirements simultaneously.
Problem
- Temporal modeling required for dynamic gestures, but many architectures were too slow for real-time use
- Available gesture datasets were small and lacked viewpoint diversity, risking overfitting
- Benchmark results alone didn't guarantee real-world feasibility with RGB camera input
- Deployment target (NVIDIA TAO ecosystem) imposed additional architecture constraints
Intervention
- Evaluated DeepGRU and Temporal Convolutional Networks (TCN) on standardized dynamic gesture dataset
- Selected TCN based on inference latency, training stability, and deployment compatibility
- Designed skeleton-space augmentations (rotation, scaling, temporal jittering) to improve generalization
- Built end-to-end pipeline: RGB video → MediaPipe hand tracking → TCN classification
- Collected custom real-world gesture dataset using MediaPipe to validate production feasibility
Impact
- Achieved ~90% Top-1 accuracy on validation data with real-time inference capability
- TCN showed superior stability and lower latency compared to RNN-based alternatives
- End-to-end MediaPipe integration validated system worked beyond benchmark datasets
- Delivered deployment-ready architecture compatible with NVIDIA TAO for edge deployment
Why This Matters
Many ML projects optimize for benchmark accuracy without validating real-world constraints. Evaluating architectures against latency, deployment compatibility, and data availability early prevents expensive rewrites when moving from research to production.
Technical Deep Dive (Optional)
This section expands on architecture selection, data strategy, and real-world validation for readers who want technical depth.
View technical deep dive
Architecture Evaluation
Models Trained
- DeepGRU (recurrent temporal modeling)
- Temporal Convolutional Network (TCN) (1D convolutions over time)
Dataset
- Dynamic Hand Gesture (DHG) dataset
- Hand joint sequences as time-series inputs
Evaluation Criteria
- Validation accuracy
- Inference latency
- Training stability
- Deployment compatibility
Results
- Both achieved ~90% Top-1 accuracy
- TCN showed:
- More stable convergence
- 30-40% lower inference latency
- Better compatibility with NVIDIA TAO export
Decision TCN selected as production architecture due to speed and deployment readiness.
Data Strategy: Augmentation Over Scale
Challenge Limited training data with narrow viewpoint coverage risked overfitting.
Solution: Skeleton-Space Augmentations
- Rotation and scaling in 3D space
- Temporal jittering (speed variations)
- Viewpoint perturbations
Design Principles
- Domain-agnostic (no dataset-specific hacks)
- Camera-position invariant
- Robust to hand size and motion speed variations
Impact
- Reduced overfitting on training distribution
- Improved generalization to unseen users and viewpoints
Real-World Validation: MediaPipe Integration
Problem Benchmark datasets didn't reflect production input modality (RGB video).
Approach
- Integrated MediaPipe hand tracking for RGB→skeleton conversion
- Collected custom dynamic gesture dataset from real-world usage
- Applied same augmentation and normalization pipeline
- Trained TCN classifier on custom data
Result
- Validated full pipeline: RGB input → real-time classification
- Exposed data distribution differences invisible in benchmarks
- Confirmed system feasibility for deployment scenarios
Static Gesture Baseline
Parallel Work
- Constructed TAO-compatible dataset for static hand gestures
- Trained GestureNet-style models
- Achieved ~80% accuracy
Purpose
- Validated dataset construction and preprocessing pipeline
- Established baseline for temporal model comparison
System Characteristics
| Dimension | Outcome | |-----------|---------| | Latency | Real-time capable (TCN inference) | | Accuracy | ~90% Top-1 (dynamic gestures) | | Robustness | Viewpoint & user invariant | | Deployment | NVIDIA TAO-compatible, MediaPipe input | | Extensibility | New gestures via incremental training |
Key Engineering Learnings
Data strategy > model complexity
- Skeleton normalization and augmentation had outsized impact
- Custom data collection was essential for real-world validation
Architecture selection requires multi-criteria evaluation
- Accuracy alone insufficient; latency and deployment matter equally
- TCNs are strong default for real-time temporal classification
Benchmark datasets are necessary but insufficient
- MediaPipe-based validation exposed distribution gaps
- Production feasibility requires end-to-end testing
Completed as an independent research and engineering project focused on production ML system design.