Scaling Edge ML Infrastructure Through Automated Quantization

TL;DR

Context: AI intern at edge ML company deploying CNNs on resource-constrained IoT devices
Problem: Manual quantization-aware training was bottlenecking deployment and limiting product iteration
Intervention: Built automated QAT pipeline with observability layer for distributed training on AWS
Impact: Reduced per-model deployment time from days to hours, enabled self-service ML workflows

Intro

This work was done during my internship at EdgeNeural, a company focused on deploying deep learning models on edge and IoT hardware. The core challenge was enabling real-time inference on resource-constrained devices without sacrificing model accuracy. Quantization-aware training was critical, but treating it as a manual, per-model process was preventing the team from scaling to multiple customer deployments simultaneously.

Problem

Manual QAT required deep model-specific expertise, preventing non-ML team members from deploying models
No standardized training workflow meant duplicated effort across architectures (ResNet, MobileNet, EfficientNet)
Training ran in opaque Docker containers on EC2 with zero visibility into progress or failures
Lack of observability meant ops and product teams couldn't coordinate effectively

Intervention

Designed architecture-agnostic QAT pipeline using PyTorch and MMClassification framework
Standardized training workflow across all supported CNN architectures (ResNet, EfficientNet, MobileNet, ShuffleNet)
Implemented custom training hooks to expose metrics, dataset stats, and artifacts via REST APIs
Built observability layer connecting distributed training to frontend dashboard for real-time visibility
Integrated artifact storage with S3 for reliable model persistence and deployment pipelines

Impact

Reduced marginal cost of adding QAT support for new architectures from days to hours
Enabled product and ops teams to track training progress without ML expertise
Established repeatable, production-grade workflow instead of ad-hoc experimentation
Created foundation reusable across multiple customer use cases and datasets

Why This Matters

Early-stage ML companies often treat training as "research work" rather than product infrastructure. Building observable, reproducible training systems early prevents the transition from prototype to production from becoming a complete rewrite.

Technical Deep Dive (Optional)

This section expands on the pipeline architecture, framework choices, and observability implementation for readers who want technical depth.

View technical deep dive

Architecture Decision: MMClassification as Foundation

Why MMClassification

Provided standardized config-driven training for multiple architectures
Reduced duplicated training logic across model families
Enabled faster experimentation through consistent interfaces

Integration with QAT

Wrapped PyTorch's quantization APIs into MMClassification training loops
Ensured quantized models remained exportable to edge runtimes
Maintained accuracy parity with full-precision baselines

Observability Layer Design

Problem Training jobs ran inside Docker containers on remote EC2 instances with no external visibility.

Solution

Implemented custom PyTorch hooks/callbacks to capture:
- Training metrics (loss, accuracy, learning rate)
- Dataset statistics (class distribution, sample counts)
- Training status (epoch progress, ETA, failures)
- Artifact metadata (checkpoint locations, model configs)

Integration

Exposed metrics via REST APIs consumed by frontend dashboard
Enabled real-time monitoring without SSH access or container inspection
Provided coordination interface for cross-functional teams

Standardized Training Workflow

Before

Each architecture required custom training scripts
Configuration inconsistencies across experiments
No version control for training parameters

After

Config-driven training using MMClassification
Version-controlled experiment definitions
Automated checkpoint management and artifact tracking

Measured Impact

~70% reduction in setup time for new models
Clear audit trail for all training runs

Deployment Integration

Artifact Pipeline

Training completes → checkpoint saved locally
Hook uploads to S3 with metadata
Deployment system polls S3 for new artifacts
Automated testing on target edge hardware

Result Seamless handoff from training to deployment without manual intervention.

System Architecture Summary

Training Container (EC2) ↓ PyTorch + MMClassification + QAT ↓ Custom Hooks (metrics, artifacts) ↓ REST API Layer ↓ Frontend Dashboard + S3 Storage ↓ Deployment Pipeline

Key Trade-offs

Chose MMClassification over custom training loops

Pro: Faster iteration, better reproducibility
Con: Less flexibility for exotic architectures
Decision: Standardization mattered more than edge cases

Chose REST APIs over message queues for observability

Pro: Simpler integration with existing frontend
Con: No built-in retry or buffering
Decision: Training jobs were long-running enough that transient failures were acceptable

Completed during a 6-month internship at EdgeNeural focused on production ML systems for edge deployments.