Optimizing CPU Inference Performance
Overview
When running machine learning inference workloads on high core count processors (such as Ampere processors with 80, 128, or more cores), allowing the system to use all available cores can paradoxically lead to degraded performance. This document explains why this happens and how to configure thread limits to achieve optimal inference performance.
The Problem: Thread Contention in High Core Count Systems
Modern server-grade processors offer unprecedented core counts, which is beneficial for many workloads. However, for machine learning inference tasks, particularly with frameworks like Ollama, using all available cores can lead to:
- Thread Contention: Too many threads competing for shared resources (memory bandwidth, cache, etc.)
- Context Switching Overhead: Increased CPU time spent switching between threads rather than doing useful work
- Memory Access Patterns: Inefficient memory access patterns when threads are spread across too many cores
- NUMA Effects: Non-Uniform Memory Access penalties when threads are distributed across multiple CPU sockets
These issues can significantly reduce inference throughput and increase latency, especially for large language models (LLMs) that have complex memory access patterns.
Solution: Limiting Thread Count with OLLAMA_MAX_THREADS
To address these performance issues, you can limit the number of threads Ollama uses for inference by setting the OLLAMA_MAX_THREADS
environment variable in your Docker Compose configuration.
How to Configure Thread Limits
Add the OLLAMA_MAX_THREADS
environment variable to your compose.yaml
file in the View directory:
assistant:
image: viewio/view-assistant:v1.4.0-test
networks:
- private
environment:
<<: *service-variables
EMBEDDINGS_PRELOAD_MODELS: "all-MiniLM-L6-v2" # space separated list
CORS_ORIGINS: "*" # comma separated list of allowed origins
OLLAMA_MAX_THREADS: 32 # Limit threads for optimal performance
HF_HOME: "/app/models/"
#HF_TOKEN: token # huggingface token only for preloading
stdin_open: true
tty: true
volumes:
- ./working/assistant/logs/:/app/logs/
restart: unless-stopped
depends_on:
pgvector:
condition: service_healthy
Finding the Optimal Thread Count
The ideal thread count depends on several factors:
- Model Size and Architecture: Different models have different computational characteristics
- Hardware Configuration: CPU type, memory bandwidth, and cache sizes all affect optimal thread count
- Concurrent Workloads: Other applications running on the same system may compete for resources
General Guidelines:
- Starting Point: Begin with a value between 32 and 64 threads
- Never Exceed Physical Cores: Setting a value higher than your physical core count provides no benefit
- Benchmark Different Values: Test performance with different thread counts to find the optimal setting
- Consider Per-Model Settings: Optimal thread counts may vary between different models
Performance Testing
To determine the optimal thread count for your specific workload:
- Start with a thread count of 32
- Run representative inference workloads and measure throughput and latency
- Incrementally adjust the thread count (try values like 16, 32, 48, 64)
- Select the configuration that provides the best performance for your specific use case
Technical Background
Limiting threads works because machine learning inference, especially for LLMs, often doesn't scale linearly with core count due to:
- Memory-Bound Operations: Many ML operations are limited by memory bandwidth rather than compute
- Cache Locality: Keeping threads on fewer cores can improve cache hit rates
- Reduced Synchronization: Fewer threads means less synchronization overhead
By finding the right balance of threads, you can maximize the useful work done while minimizing the overhead associated with thread management.
Updated about 6 hours ago