These are my first experiments running an LLM on my computer. I plan to use it either for agent development or for asking sensitive questions.
Installing Ollama
First, downloaded the Linux binary from https://ollama.com/download/ollama-linux-amd64.tar.zst
Unpacked it to my programs directory
tar --zstd xf ollama-linux-amd64.tar.zst -C /media/disk/_programs
Created the directory for models
mkdir /media/disk/_programs/ollama/models
Told Ollama about this directory through the environment variable
export OLLAMA_MODELS="/media/disk/_programs/ollama/models"
Ran Ollama
ollama serve
The API will be available on
http://localhost:11434/v1.
I installed the model qwen3.5-9B
ollama pull hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M
To list the installed models
ollama ls
To see the maximum context length a model can support:
ollama show <model_name>
Install OpenCode
Next, I tried OpenCode.
Downloaded opencode-linux-x64.tar.gz from
their
GitHub.
Unpacked to /media/disk/_programs.
Created config file
~/.config/opencode/config.json. I configured
it to use the local Ollama instance.
{
"$schema": "https://opencode.ai/config.json",
"model": "hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "ollama(local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M": {
"name": "qwen3.5",
"limit": {
"context": 256000,
"output": 16536
}
}
}
}
}
}
The model key
hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M is the
model name from Ollama. It is important because Ollama can
serve multiple models. I can retrieve the available model
names directly from Ollama using
ollama list
To run the agent
opencode web
Sessions are persisted in
~/.local/share/opencode.
At that moment I encountered the first problem. When I tested the model via the OpenCode CLI, something went wrong. Although the model worked perfectly within the Ollama CLI, generating responses in seconds, the OpenCode agent caused Ollama to burn 100% CPU without producing any response. This behaviour was consistent across other agents. I tried KiloCode - same result. Continue - same result.
I investigated further and found a related bug report on GitHub 4428.
I attempted to debug the OpenCode request by enabling
logging, but the --log-level DEBUG flag
yielded no results. Only INFO and ERROR records appeared
in the log. Furthermore, Ollama itself provided no verbose
output. The documentation for Ollama is incomplete. There
are also only a few options for configuring it. I found on
internet
OLLAMA_DEBUG=1 ollama serve
Thanks. This shows something from Ollama and libggml. There is record in stdout:
msg=”truncating input prompt” limit=4096
I can fix this using new default context size
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
However there is nothing about the request processing.
I created a new model from the GGUF file. The OpenCode
said the model does not support tools. How do you know
that? It is very strange. I guess Ollama answered
something. I checked the manifests. The model from the
Ollama registry contains a template with tools tags. I
rebuild new model with TEMPLATE in
Modelfile and got a new error:
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35'
Ollama does not support new model. See issue 14503.
Shure I should try another server. I don't like all these sha256-manifests, blobs, and Modelfiles.
Switching to llama.cpp
There are other servers as well such as
SGLang, vLLM, and
llama.cpp. Most of them use Python. Their
documentation looks as if it were written by Python users
for Python users. I don't like Python. That is why I chose
llama.cpp.
I downloaded
llama-b8149-bin-ubuntu-vulkan-x64.tar.gz from
llama.cpp github. I chose the Vulkan version because my integrated GPU
supports Vulkan. It is more effective than the CPU.
Downloaded model
export LLAMA_CACHE=/media/disk/_programs/llama_models
./llama-server -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M
Ran server
./llama-server --model ../llama_models/unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q4_K_M.gguf \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--port 8080
I sent Hi from the llama.cpp UI. The prompt
size was 13 tokens. However, when I request via OpenCode,
the system hangs and llama.cpp throws exception. I updated
llama.cpp. It made things worse: using the Vulkan backend
caused my GPU setup to crash, resetting the GPU and
freezing all screens. Only the mouse remained movable. A
hard reset was the only solution. The Vulkan version is
faster on integrated video. However, it is unstable and
can crash when a prompt is large. I therefore decided to
switch from GPU to CPU.
I wanted to log the incoming prompt using llama.cpp. There
is a flag --verbose-prompt. But it only works
with llama-cli. With
llama-server, it prints nothing. There is an
open issue
19653.
I eventually used the --verbose flag. The
server prints the prompt size on request. The OpenCode
prompt contains more than 10000 tokens. The server starts
parsing the prompt in chunks at a rate of 10 tokens per
second. Only once parsing is complete does it print the
request text to the log. Then it starts processing the
response at a rate of 2 tokens per second.
So, what did I see? The OpenCode request is huge. It contains 2000 tokens for the system message and 8000 tokens for the tool descriptions. I can change system message with config. I can also change or disable the tools. However the OpenCode does not look interesting without built-in tools. It means that OpenCode or other agents require a good model to understand the tools, as well as a large context for the tools' descriptions.
Despite the verbosity, having llama.cpp enables me to wait for responses and monitor the process. With Ollama, I was left staring at 100% CPU usage with no insight into what was happening.