Status: Design Proposal (Issue #50) Related: pulse/budget package, pulse/async worker pool
Running QNTX on shared infrastructure (beefy server with GPU) alongside other processes:
Track actual resource utilization across ALL processes, not just QNTX:
Resources to monitor:
Implementation sketch:
type SystemResourceMonitor struct {
GPUUtilizationPercent float64 // Current GPU load (0-100%)
GPUMemoryUsedMB int // Total VRAM in use by all processes
SystemCPUPercent float64 // Overall CPU usage
SystemMemoryUsedMB int // Total RAM in use
QNTXProcessShare float64 // qntx's estimated share (0-1)
}
func (m *SystemResourceMonitor) GetCurrentLoad() (*SystemResourceMonitor, error) {
// Call nvidia-smi, parse output
// Read /proc/stat for CPU
// Read /proc/meminfo for memory
}
Adapt QNTX behavior to current system contention:
func (bt *Tracker) GetAdaptiveQuota() float64 {
sysLoad, _ := bt.sysMonitor.GetCurrentLoad()
baseQuota := bt.config.DailyGPUMinutes // e.g., 30 GPU-minutes/day
// Apply backpressure based on system contention
if sysLoad.GPUUtilizationPercent > 70 {
return baseQuota * 0.1 // Throttle to 10% when system busy
} else if sysLoad.GPUUtilizationPercent > 30 {
return baseQuota * 0.5 // Reduce to 50% during medium load
}
return baseQuota // Full quota when system idle
}
Pause or slow down job processing when other processes need resources:
Integration point: pulse/async/worker.go processJobs() loop
func (w *Worker) processJobs(ctx context.Context) {
for {
// Check system load before dequeuing
if systemLoad := w.sysMonitor.GetCurrentLoad(); systemLoad.GPUUtilizationPercent > 80 {
w.logger.Info("System under load, deferring job processing",
"gpu_util", systemLoad.GPUUtilizationPercent)
time.Sleep(30 * time.Second)
continue
}
// ... proceed with job dequeue
}
}
Use OS-level mechanisms to coordinate:
Respect resource limits set by K8s/Docker:
/sys/fs/cgroup/memory/memory.limit_in_bytesExample: If K8s sets limits.memory = 8Gi, limits.nvidia.com/gpu = 1, QNTX should never exceed those limits even if internal config says otherwise.
Prioritize critical jobs when resources are scarce:
Implementation: Add priority field to Job struct, check system load + priority before dequeue.
What if another process doesn't play nicely?
a) Per-Process GPU Monitoring:
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
Track per-process utilization over time to identify hogs.
b) Sustained High Utilization Detection:
Flag processes sustaining >80% GPU for >10 minutes:
type ProcessGPUStats struct {
PID int
Name string
GPUUtilPercent float64
VRAMUsedMB int
DurationMinutes float64
}
func (m *SystemResourceMonitor) DetectGPUHogs(threshold float64, duration time.Duration) []ProcessGPUStats {
// Returns processes exceeding threshold for longer than duration
}
c) Fair Share Violation Detection:
If your quota is 30% GPU and you measure:
Then PID 12345 is violating fair share (should be ~70%).
1. Self-Protective Throttling:
When GPU hog detected, throttle even more aggressively:
if hogs := sysMonitor.DetectGPUHogs(80.0, 10*time.Minute); len(hogs) > 0 {
log.Printf("WARNING: GPU hog: %s (PID %d) at %.1f%% for %.1f min",
hogs[0].Name, hogs[0].PID, hogs[0].GPUUtilPercent, hogs[0].DurationMinutes)
return baseQuota * 0.1 // Ultra-conservative when hog present
}
2. Alerting and Logging:
3. Admin Reporting:
Generate reports for sysadmins:
qntx system gpu-report --last 24h
Output:
GPU Utilization Report (Last 24 hours)
Fair Share Violations:
- python (PID 12345): 18.5 hours at >90% (expected quota: 30%)
- inference-server (PID 67890): 12.2 hours at >70% (expected quota: 40%)
Recommendation: Contact owners or adjust quotas
4. Cooperative vs Defensive Mode:
Auto-detect environment behavior:
if historicalHogFrequency > 0.3 { // Hogs detected >30% of time
switchToDefensiveMode()
}
5. Advanced: cgroups Enforcement (Linux only):
If running with sufficient privileges:
Note: Requires root or CAP_SYS_ADMIN.
6. Fallback: Time-Based Quota:
If GPU constantly saturated:
[pulse.gpu]
hog_detection_enabled = true
hog_threshold_percent = 80.0 # Processes using >80% considered hogs
hog_duration_minutes = 10.0 # Must sustain for 10+ minutes
defensive_quota_multiplier = 0.1 # Use 10% of quota when hog detected
alert_on_hog = true # Send alerts when hogs detected
QNTX becomes a "good citizen" on shared infrastructure:
You're allocated 30% GPU capacity on shared server. Another user's training job runs 24/7 at 95% GPU, violating fair share. QNTX detects this:
Result: QNTX protects itself from badly behaved neighbors while maintaining observability.