Optimizing CPU Inference Performance
CPU Resource Management for High Core Count Systems
Overview
When deploying View on servers with high core count processors (80, 128, 192+ cores), giving Ollama access to all available CPU cores does not always result in better performance. In fact, it can lead to degraded inference speeds. This guide explains how to configure CPU resource limits using Docker Compose override files, ensuring your changes persist across system updates.
Why More Cores Isn't Always Better
Machine learning inference workloads, particularly large language models (LLMs), often don't scale linearly with core count. Several factors contribute to diminishing returns on high core count systems:
- Memory Bandwidth Saturation: LLM inference is frequently memory-bound rather than compute-bound. Adding more cores doesn't help when the bottleneck is memory bandwidth.
- Cache Thrashing: When threads are spread across too many cores, L3 cache efficiency decreases as threads compete for cache space and frequently evict each other's data.
- NUMA Penalties: On multi-socket systems, threads accessing memory from a remote NUMA node incur significant latency penalties.
- Thread Synchronization Overhead: More threads means more coordination overhead, which can exceed the benefit of parallel execution.
- Context Switching: The operating system spends more time switching between threads than doing useful work.
For these reasons, limiting Ollama to a subset of available cores often improves both throughput and latency.
Using Docker Compose Override Files
The recommended approach for customizing View's resource configuration is to use a Docker Compose override file rather than modifying compose.yaml directly. This has a significant advantage: your changes won't be overwritten when you update View to a new version.
Create a file called docker-compose.override.yaml in the same directory as your compose.yaml:
services:
assistant:
environment:
- OLLAMA_MAX_THREADS=48Docker Compose automatically merges this file with compose.yaml when you run commands.
Important: When using an override file, you must run standard
docker composecommands (e.g.,docker compose up -d,docker compose down,docker compose restart) rather thanviewctlcommands. Theviewctlutility does not process override files.
Calculating Optimal Thread Allocation
Each instance of Ollama can use as many cores as you configure via OLLAMA_MAX_THREADS. When planning your configuration, you need to account for:
- Total available CPU cores
- Cores reserved for other system processes (OS, other services, embeddings, etc.)
- Desired number of concurrent Ollama instances
- Threads per instance
Example Calculation
Consider a server with 192 CPU cores where you want to:
- Reserve 48 cores for system overhead and other processes
- Allow each Ollama instance to use 48 threads
Total cores: 192
Reserved for system: - 48
Available for Ollama: = 144
Threads per instance: 48
Maximum concurrent instances: 144 ÷ 48 = 3
In this scenario, you would configure:
OLLAMA_MAX_THREADS=48to limit each instance- Set your maximum concurrent model instances to 3
General Guidelines
| Total Cores | Suggested OLLAMA_MAX_THREADS | Reserved for System |
|---|---|---|
| 32-64 | 24-32 | 8-16 |
| 64-128 | 32-48 | 16-32 |
| 128-192 | 48-64 | 32-48 |
| 192+ | 48-64 | 48-64 |
These are starting points. Benchmark your specific workload to find optimal values.
CPU Pinning with cpuset
For more precise control, you can pin the assistant container to specific CPU cores using the cpuset option. This is useful when you need to:
- Isolate Ollama from other workloads
- Ensure consistent performance by avoiding NUMA boundaries
- Reserve specific cores for other applications
Example: Pin to Specific Cores
services:
assistant:
cpuset: "48-95"
environment:
- OLLAMA_MAX_THREADS=48This configuration restricts the assistant container to cores 48 through 95 (48 cores total) and limits Ollama to 48 threads.
Example: Avoid NUMA Boundaries
On a dual-socket system where cores 0-95 are on socket 0 and cores 96-191 are on socket 1:
services:
assistant:
cpuset: "0-63"
environment:
- OLLAMA_MAX_THREADS=64This keeps all Ollama threads on a single NUMA node, avoiding cross-socket memory access penalties.
Complete Override File Example
Here's a complete example combining thread limits and CPU pinning:
services:
assistant:
cpuset: "48-143"
environment:
- OLLAMA_MAX_THREADS=48This configuration:
- Pins the assistant to cores 48-143 (96 cores)
- Limits each Ollama instance to 48 threads
- Leaves cores 0-47 available for system processes
- Leaves cores 144-191 available for other workloads
Applying Changes
After creating or modifying your override file:
# Navigate to your View directory
cd /path/to/view
# Restart services with the new configuration
docker compose down
docker compose up -d
# Verify the configuration is applied
docker compose psPerformance Testing
To find the optimal configuration for your environment:
- Start with the suggested values from the guidelines table
- Run representative inference workloads
- Monitor CPU utilization and inference latency
- Adjust
OLLAMA_MAX_THREADSin increments of 8-16 - Test again and compare results
Key metrics to monitor:
- Tokens per second (throughput)
- Time to first token (latency)
- CPU utilization across cores
- Memory bandwidth utilization
Troubleshooting
Changes not taking effect
- Ensure you're using
docker composecommands, notviewctl - Verify the override file is named exactly
docker-compose.override.yaml - Check that the file is in the same directory as
compose.yaml - Run
docker compose configto see the merged configuration
Performance worse after limiting threads
- Try increasing
OLLAMA_MAX_THREADSslightly - Check if you've crossed a NUMA boundary with
cpuset - Ensure reserved cores are sufficient for system processes
Container fails to start
- Verify the core numbers in
cpusetexist on your system - Check syntax in the override file with
docker compose config
Updated about 1 month ago