GPU and CPU Utilization While Running Open-Source LLMs Locally using Ollama
Author(s): Muaaz Originally published on Towards AI. Large Language Models (LLMs) are powerful, but running them locally requires significant hardware resources. Many users rely on open-source models due to their accessibility, as closed source models often come with restrictive licensing and high costs. In this blog, I will explain how open-source LLMs function, using DeepSeek as an example. Installing Ollama and Running LLMs Locally To get started, you need to install Ollama, which provides an easy way to run and manage LLMs locally. Follow these steps: Download and install Ollama from the official website: https://ollama.com Or install via the command line: curl -fsSL https://ollama.com/install.sh | sh Download and Run a Model Locally Once Ollama is installed, you can easily download and run LLMs using the command line cmd: Download and run DeepSeek-R1 7B: ollama run deepseek-r1:7b Download and run DeepSeek-R1 32B: ollama run deepseek-r1:32b When you run any of the above commands, it downloads the model and starts inference mode for the LLM, like this: Download DeepSeek-R1:7B and Run Inference with the LLM Experiment Setup I used Ollama to run two different DeepSeek models: DeepSeek-R1 7B (small model) DeepSeek-R1 32B (large model) Hardware Used: GPU: NVIDIA RTX A4000 (16GB VRAM) CPU: Intel Core i7–13700 RAM: 32GB V(Video)RAM: 32GB Model Storage and Execution Insights DeepSeek-R1 7B requires 4GB disk storage. When I start inferencing with this model, it runs entirely on the GPU as it comfortably fits within the 16GB VRAM. During inference, the model expands in memory due to internal computations (which I will discuss further). However, this expansion remains within the VRAM limits, allowing the model to run completely on the GPU without requiring a fallback to the CPU. GPU utilization when the model is running DeepSeek-R1 32B requires 20GB disk storage. It requires 20GB disk storage. However, during inference, it exceeds the GPU memory limit, reaching 48GB VRAM due to internal computations. As a result, the system automatically offloads part of the model to the CPU, running in a hybrid mode (CPU + GPU) to balance the workload and ensure smooth execution. CPU and GPU utilization when the model is running Why Does the VRAM Usage Increase? While the base model is 20GB, VRAM usage expands significantly during inference due to internal computations. When we download a model, we only store its weights (parameters) on disk. However, during inference, computations using these weights lead to additional memory usage. Since LLMs are transformer-based models, they generate key-value matrices and utilize multiple attention heads, requiring substantial memory. The primary reasons for VRAM expansion include activation functions, which store intermediate computation values, and key-value matrices, which are dynamically generated to efficiently handle queries, both contributing to increased VRAM consumption. Performance Monitoring I monitored execution using the Task Manager to observe real-time GPU and CPU utilization. My key takeaways: Smaller models run fully on GPU, providing fast inference. Larger models automatically switch to CPU-GPU hybrid execution when VRAM is exceeded. Monitoring resource utilization helps optimize model selection based on available hardware. Conclusion Running open-source LLMs locally is a feasible alternative to expensive cloud-based solutions. DeepSeek models with Ollama provide a seamless experience, dynamically managing hardware limitations. Understanding GPU-CPU balance is crucial for efficient deployment. Stay tuned for more insights! Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI