TL;DR
- Context: AI intern at edge ML company deploying CNNs on resource-constrained IoT devices
- Problem: Manual quantization-aware training was bottlenecking deployment and limiting product iteration
- Intervention: Built automated QAT pipeline with observability layer for distributed training on AWS
- Impact: Reduced per-model deployment time from days to hours, enabled self-service ML workflows
Intro
This work was done during my internship at EdgeNeural, a company focused on deploying deep learning models on edge and IoT hardware. The core challenge was enabling real-time inference on resource-constrained devices without sacrificing model accuracy. Quantization-aware training was critical, but treating it as a manual, per-model process was preventing the team from scaling to multiple customer deployments simultaneously.
Problem
- Manual QAT required deep model-specific expertise, preventing non-ML team members from deploying models
- No standardized training workflow meant duplicated effort across architectures (ResNet, MobileNet, EfficientNet)
- Training ran in opaque Docker containers on EC2 with zero visibility into progress or failures
- Lack of observability meant ops and product teams couldn't coordinate effectively
Intervention
- Designed architecture-agnostic QAT pipeline using PyTorch and MMClassification framework
- Standardized training workflow across all supported CNN architectures (ResNet, EfficientNet, MobileNet, ShuffleNet)
- Implemented custom training hooks to expose metrics, dataset stats, and artifacts via REST APIs
- Built observability layer connecting distributed training to frontend dashboard for real-time visibility
- Integrated artifact storage with S3 for reliable model persistence and deployment pipelines
Impact
- Reduced marginal cost of adding QAT support for new architectures from days to hours
- Enabled product and ops teams to track training progress without ML expertise
- Established repeatable, production-grade workflow instead of ad-hoc experimentation
- Created foundation reusable across multiple customer use cases and datasets
Why This Matters
Early-stage ML companies often treat training as "research work" rather than product infrastructure. Building observable, reproducible training systems early prevents the transition from prototype to production from becoming a complete rewrite.
Technical Deep Dive (Optional)
This section expands on the pipeline architecture, framework choices, and observability implementation for readers who want technical depth.
View technical deep dive
Architecture Decision: MMClassification as Foundation
Why MMClassification
- Provided standardized config-driven training for multiple architectures
- Reduced duplicated training logic across model families
- Enabled faster experimentation through consistent interfaces
Integration with QAT
- Wrapped PyTorch's quantization APIs into MMClassification training loops
- Ensured quantized models remained exportable to edge runtimes
- Maintained accuracy parity with full-precision baselines
Observability Layer Design
Problem Training jobs ran inside Docker containers on remote EC2 instances with no external visibility.
Solution
- Implemented custom PyTorch hooks/callbacks to capture:
- Training metrics (loss, accuracy, learning rate)
- Dataset statistics (class distribution, sample counts)
- Training status (epoch progress, ETA, failures)
- Artifact metadata (checkpoint locations, model configs)
Integration
- Exposed metrics via REST APIs consumed by frontend dashboard
- Enabled real-time monitoring without SSH access or container inspection
- Provided coordination interface for cross-functional teams
Standardized Training Workflow
Before
- Each architecture required custom training scripts
- Configuration inconsistencies across experiments
- No version control for training parameters
After
- Config-driven training using MMClassification
- Version-controlled experiment definitions
- Automated checkpoint management and artifact tracking
Measured Impact
- ~70% reduction in setup time for new models
- Clear audit trail for all training runs
Deployment Integration
Artifact Pipeline
- Training completes → checkpoint saved locally
- Hook uploads to S3 with metadata
- Deployment system polls S3 for new artifacts
- Automated testing on target edge hardware
Result Seamless handoff from training to deployment without manual intervention.
System Architecture Summary
Training Container (EC2) ↓ PyTorch + MMClassification + QAT ↓ Custom Hooks (metrics, artifacts) ↓ REST API Layer ↓ Frontend Dashboard + S3 Storage ↓ Deployment Pipeline
Key Trade-offs
Chose MMClassification over custom training loops
- Pro: Faster iteration, better reproducibility
- Con: Less flexibility for exotic architectures
- Decision: Standardization mattered more than edge cases
Chose REST APIs over message queues for observability
- Pro: Simpler integration with existing frontend
- Con: No built-in retry or buffering
- Decision: Training jobs were long-running enough that transient failures were acceptable
Completed during a 6-month internship at EdgeNeural focused on production ML systems for edge deployments.