ML Infrastructure
ML Infrastructure
We build and operate ML infrastructure on Kubernetes, from GPU cluster provisioning to model serving and cost optimization.
GPU & Compute
- NVIDIA GPU Operator - Automated GPU driver, container toolkit, and device plugin management on Kubernetes
- NVIDIA DCGM Exporter - GPU metrics collection for Prometheus (utilization, memory, temperature, power)
- Node Feature Discovery (NFD) - Automatic detection and labeling of GPU nodes
- Time-slicing & MIG - Multi-instance GPU partitioning for improved utilization
- CUDA workload scheduling - Tolerations, node selectors, and resource limits for GPU pods
Autoscaling
- Karpenter - Just-in-time node provisioning with GPU-aware instance selection
- Cluster Autoscaler - Node pool scaling based on pending pod requests
- KEDA - Event-driven pod autoscaling based on queue depth, HTTP requests, or custom metrics
- Horizontal Pod Autoscaler (HPA) - CPU/memory and custom metrics scaling for inference workloads
- Provisioner configurations - Spot vs on-demand strategies, instance type constraints, taints
Cost Optimization
- Spot/Preemptible instances - GPU workloads on spot with checkpointing and graceful termination
- Kubecost / OpenCost - Per-namespace and per-workload cost attribution
- Right-sizing - Resource request tuning based on actual utilization metrics
- Idle GPU detection - Alerts for underutilized GPU nodes
- Reserved capacity planning - Savings plans and committed use discounts for baseline workloads
ML Workflow & Orchestration
- Argo Workflows - DAG-based ML training pipelines with artifact passing
- Kubeflow Pipelines - End-to-end ML workflow orchestration
- Ray - Distributed training and hyperparameter tuning
- MLflow - Experiment tracking, model registry, and versioning
Model Serving
- Triton Inference Server - Multi-framework model serving with dynamic batching
- vLLM - High-throughput LLM inference with PagedAttention
- Ollama - Local LLM deployment for on-prem and edge
- KServe - Serverless inference with autoscaling to zero
Observability
- Prometheus + Grafana - GPU metrics dashboards and alerting
- DCGM metrics - GPU utilization, memory, SM occupancy, NVLink throughput
- Custom metrics - Inference latency, throughput, queue depth
- Logging - Centralized log aggregation for training jobs and inference services
Cloud Providers
We work across AWS (EKS), GCP (GKE), and Azure (AKS) with equivalent tooling for each platform.