Trying local LLM

2026-03-09 | AI

These are my first experiments running an LLM on my computer. I plan to use it either for agent development or for asking sensitive questions.

Installing Ollama

First, downloaded the Linux binary from https://ollama.com/download/ollama-linux-amd64.tar.zst

Unpacked it to my programs directory

tar --zstd xf ollama-linux-amd64.tar.zst -C /media/disk/_programs

Created the directory for models

mkdir /media/disk/_programs/ollama/models

Told Ollama about this directory through the environment variable

export OLLAMA_MODELS="/media/disk/_programs/ollama/models"

Ran Ollama

ollama serve

The API will be available on http://localhost:11434/v1.

I installed the model qwen3.5-9B

ollama pull hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M

To list the installed models

ollama ls

To see the maximum context length a model can support:

ollama show <model_name>

Install OpenCode

Next, I tried OpenCode.

Downloaded opencode-linux-x64.tar.gz from their GitHub.

Unpacked to /media/disk/_programs.

Created config file ~/.config/opencode/config.json. I configured it to use the local Ollama instance.

{
  "$schema": "https://opencode.ai/config.json",
  "model": "hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "ollama(local)",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M": {
          "name": "qwen3.5",
          "limit": {
				"context": 256000,
				"output": 16536
		  }
        }
      }
    }
  }
}

The model key hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M is the model name from Ollama. It is important because Ollama can serve multiple models. I can retrieve the available model names directly from Ollama using

ollama list

To run the agent

opencode web

Sessions are persisted in ~/.local/share/opencode.

At that moment I encountered the first problem. When I tested the model via the OpenCode CLI, something went wrong. Although the model worked perfectly within the Ollama CLI, generating responses in seconds, the OpenCode agent caused Ollama to burn 100% CPU without producing any response. This behaviour was consistent across other agents. I tried KiloCode - same result. Continue - same result.

I investigated further and found a related bug report on GitHub 4428.

I attempted to debug the OpenCode request by enabling logging, but the --log-level DEBUG flag yielded no results. Only INFO and ERROR records appeared in the log. Furthermore, Ollama itself provided no verbose output. The documentation for Ollama is incomplete. There are also only a few options for configuring it. I found on internet

OLLAMA_DEBUG=1 ollama serve

Thanks. This shows something from Ollama and libggml. There is record in stdout:

msg=”truncating input prompt” limit=4096

I can fix this using new default context size

OLLAMA_CONTEXT_LENGTH=32768 ollama serve

However there is nothing about the request processing.

I created a new model from the GGUF file. The OpenCode said the model does not support tools. How do you know that? It is very strange. I guess Ollama answered something. I checked the manifests. The model from the Ollama registry contains a template with tools tags. I rebuild new model with TEMPLATE in Modelfile and got a new error:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35'

Ollama does not support new model. See issue 14503.

Shure I should try another server. I don't like all these sha256-manifests, blobs, and Modelfiles.

Switching to llama.cpp

There are other servers as well such as SGLang, vLLM, and llama.cpp. Most of them use Python. Their documentation looks as if it were written by Python users for Python users. I don't like Python. That is why I chose llama.cpp.

I downloaded llama-b8149-bin-ubuntu-vulkan-x64.tar.gz from llama.cpp github. I chose the Vulkan version because my integrated GPU supports Vulkan. It is more effective than the CPU.

Downloaded model

export LLAMA_CACHE=/media/disk/_programs/llama_models
./llama-server -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M

Ran server

./llama-server --model ../llama_models/unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q4_K_M.gguf \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--port 8080

I sent Hi from the llama.cpp UI. The prompt size was 13 tokens. However, when I request via OpenCode, the system hangs and llama.cpp throws exception. I updated llama.cpp. It made things worse: using the Vulkan backend caused my GPU setup to crash, resetting the GPU and freezing all screens. Only the mouse remained movable. A hard reset was the only solution. The Vulkan version is faster on integrated video. However, it is unstable and can crash when a prompt is large. I therefore decided to switch from GPU to CPU.

I wanted to log the incoming prompt using llama.cpp. There is a flag --verbose-prompt. But it only works with llama-cli. With llama-server, it prints nothing. There is an open issue 19653.

I eventually used the --verbose flag. The server prints the prompt size on request. The OpenCode prompt contains more than 10000 tokens. The server starts parsing the prompt in chunks at a rate of 10 tokens per second. Only once parsing is complete does it print the request text to the log. Then it starts processing the response at a rate of 2 tokens per second.

So, what did I see? The OpenCode request is huge. It contains 2000 tokens for the system message and 8000 tokens for the tool descriptions. I can change system message with config. I can also change or disable the tools. However the OpenCode does not look interesting without built-in tools. It means that OpenCode or other agents require a good model to understand the tools, as well as a large context for the tools' descriptions.

Despite the verbosity, having llama.cpp enables me to wait for responses and monitor the process. With Ollama, I was left staring at 100% CPU usage with no insight into what was happening.