Optimizing CPU Inference Performance

Overview

When running machine learning inference workloads on high core count processors (such as Ampere processors with 80, 128, or more cores), allowing the system to use all available cores can paradoxically lead to degraded performance. This document explains why this happens and how to configure thread limits to achieve optimal inference performance.

The Problem: Thread Contention in High Core Count Systems

Modern server-grade processors offer unprecedented core counts, which is beneficial for many workloads. However, for machine learning inference tasks, particularly with frameworks like Ollama, using all available cores can lead to:

Thread Contention: Too many threads competing for shared resources (memory bandwidth, cache, etc.)
Context Switching Overhead: Increased CPU time spent switching between threads rather than doing useful work
Memory Access Patterns: Inefficient memory access patterns when threads are spread across too many cores
NUMA Effects: Non-Uniform Memory Access penalties when threads are distributed across multiple CPU sockets

These issues can significantly reduce inference throughput and increase latency, especially for large language models (LLMs) that have complex memory access patterns.

Solution: Limiting Thread Count with OLLAMA_MAX_THREADS

To address these performance issues, you can limit the number of threads Ollama uses for inference by setting the OLLAMA_MAX_THREADS environment variable in your Docker Compose configuration.

How to Configure Thread Limits

Add the OLLAMA_MAX_THREADS environment variable to your compose.yaml file in the View directory:

assistant:
  image: viewio/view-assistant:v1.4.0-test
  networks:
    - private
  environment:
    <<: *service-variables
    EMBEDDINGS_PRELOAD_MODELS: "all-MiniLM-L6-v2" # space separated list
    CORS_ORIGINS: "*" # comma separated list of allowed origins
    OLLAMA_MAX_THREADS: 32  # Limit threads for optimal performance
    HF_HOME: "/app/models/"
    #HF_TOKEN: token # huggingface token only for preloading
  stdin_open: true
  tty: true
  volumes:
    - ./working/assistant/logs/:/app/logs/
  restart: unless-stopped
  depends_on:
    pgvector:
      condition: service_healthy

Finding the Optimal Thread Count

The ideal thread count depends on several factors:

Model Size and Architecture: Different models have different computational characteristics
Hardware Configuration: CPU type, memory bandwidth, and cache sizes all affect optimal thread count
Concurrent Workloads: Other applications running on the same system may compete for resources

General Guidelines:

Starting Point: Begin with a value between 32 and 64 threads
Never Exceed Physical Cores: Setting a value higher than your physical core count provides no benefit
Benchmark Different Values: Test performance with different thread counts to find the optimal setting
Consider Per-Model Settings: Optimal thread counts may vary between different models

Performance Testing

To determine the optimal thread count for your specific workload:

Start with a thread count of 32
Run representative inference workloads and measure throughput and latency
Incrementally adjust the thread count (try values like 16, 32, 48, 64)
Select the configuration that provides the best performance for your specific use case

Technical Background

Limiting threads works because machine learning inference, especially for LLMs, often doesn't scale linearly with core count due to:

Memory-Bound Operations: Many ML operations are limited by memory bandwidth rather than compute
Cache Locality: Keeping threads on fewer cores can improve cache hit rates
Reduced Synchronization: Fewer threads means less synchronization overhead

By finding the right balance of threads, you can maximize the useful work done while minimizing the overhead associated with thread management.