Optimizing CPU Inference Performance


CPU Resource Management for High Core Count Systems

Overview

When deploying View on servers with high core count processors (80, 128, 192+ cores), giving Ollama access to all available CPU cores does not always result in better performance. In fact, it can lead to degraded inference speeds. This guide explains how to configure CPU resource limits using Docker Compose override files, ensuring your changes persist across system updates.

Why More Cores Isn't Always Better

Machine learning inference workloads, particularly large language models (LLMs), often don't scale linearly with core count. Several factors contribute to diminishing returns on high core count systems:

  • Memory Bandwidth Saturation: LLM inference is frequently memory-bound rather than compute-bound. Adding more cores doesn't help when the bottleneck is memory bandwidth.
  • Cache Thrashing: When threads are spread across too many cores, L3 cache efficiency decreases as threads compete for cache space and frequently evict each other's data.
  • NUMA Penalties: On multi-socket systems, threads accessing memory from a remote NUMA node incur significant latency penalties.
  • Thread Synchronization Overhead: More threads means more coordination overhead, which can exceed the benefit of parallel execution.
  • Context Switching: The operating system spends more time switching between threads than doing useful work.

For these reasons, limiting Ollama to a subset of available cores often improves both throughput and latency.

Using Docker Compose Override Files

The recommended approach for customizing View's resource configuration is to use a Docker Compose override file rather than modifying compose.yaml directly. This has a significant advantage: your changes won't be overwritten when you update View to a new version.

Create a file called docker-compose.override.yaml in the same directory as your compose.yaml:

services:
  assistant:
    environment:
      - OLLAMA_MAX_THREADS=48

Docker Compose automatically merges this file with compose.yaml when you run commands.

Important: When using an override file, you must run standard docker compose commands (e.g., docker compose up -d, docker compose down, docker compose restart) rather than viewctl commands. The viewctl utility does not process override files.

Calculating Optimal Thread Allocation

Each instance of Ollama can use as many cores as you configure via OLLAMA_MAX_THREADS. When planning your configuration, you need to account for:

  1. Total available CPU cores
  2. Cores reserved for other system processes (OS, other services, embeddings, etc.)
  3. Desired number of concurrent Ollama instances
  4. Threads per instance

Example Calculation

Consider a server with 192 CPU cores where you want to:

  • Reserve 48 cores for system overhead and other processes
  • Allow each Ollama instance to use 48 threads
Total cores:                    192
Reserved for system:           - 48
Available for Ollama:          = 144

Threads per instance:            48
Maximum concurrent instances:  144 ÷ 48 = 3

In this scenario, you would configure:

  • OLLAMA_MAX_THREADS=48 to limit each instance
  • Set your maximum concurrent model instances to 3

General Guidelines

Total CoresSuggested OLLAMA_MAX_THREADSReserved for System
32-6424-328-16
64-12832-4816-32
128-19248-6432-48
192+48-6448-64

These are starting points. Benchmark your specific workload to find optimal values.

CPU Pinning with cpuset

For more precise control, you can pin the assistant container to specific CPU cores using the cpuset option. This is useful when you need to:

  • Isolate Ollama from other workloads
  • Ensure consistent performance by avoiding NUMA boundaries
  • Reserve specific cores for other applications

Example: Pin to Specific Cores

services:
  assistant:
    cpuset: "48-95"
    environment:
      - OLLAMA_MAX_THREADS=48

This configuration restricts the assistant container to cores 48 through 95 (48 cores total) and limits Ollama to 48 threads.

Example: Avoid NUMA Boundaries

On a dual-socket system where cores 0-95 are on socket 0 and cores 96-191 are on socket 1:

services:
  assistant:
    cpuset: "0-63"
    environment:
      - OLLAMA_MAX_THREADS=64

This keeps all Ollama threads on a single NUMA node, avoiding cross-socket memory access penalties.

Complete Override File Example

Here's a complete example combining thread limits and CPU pinning:

services:
  assistant:
    cpuset: "48-143"
    environment:
      - OLLAMA_MAX_THREADS=48

This configuration:

  • Pins the assistant to cores 48-143 (96 cores)
  • Limits each Ollama instance to 48 threads
  • Leaves cores 0-47 available for system processes
  • Leaves cores 144-191 available for other workloads

Applying Changes

After creating or modifying your override file:

# Navigate to your View directory
cd /path/to/view

# Restart services with the new configuration
docker compose down
docker compose up -d

# Verify the configuration is applied
docker compose ps

Performance Testing

To find the optimal configuration for your environment:

  1. Start with the suggested values from the guidelines table
  2. Run representative inference workloads
  3. Monitor CPU utilization and inference latency
  4. Adjust OLLAMA_MAX_THREADS in increments of 8-16
  5. Test again and compare results

Key metrics to monitor:

  • Tokens per second (throughput)
  • Time to first token (latency)
  • CPU utilization across cores
  • Memory bandwidth utilization

Troubleshooting

Changes not taking effect

  • Ensure you're using docker compose commands, not viewctl
  • Verify the override file is named exactly docker-compose.override.yaml
  • Check that the file is in the same directory as compose.yaml
  • Run docker compose config to see the merged configuration

Performance worse after limiting threads

  • Try increasing OLLAMA_MAX_THREADS slightly
  • Check if you've crossed a NUMA boundary with cpuset
  • Ensure reserved cores are sufficient for system processes

Container fails to start

  • Verify the core numbers in cpuset exist on your system
  • Check syntax in the override file with docker compose config