RAG Assessment

What started as a simple comparison of LLM hosting options turned into a deep dive of tool calling and llm frameworks.

Goal: Create a hybrid-RAG pipeline with reasoning to map discrete KSATs from competency frameworks to private training content.

Tools:

Local nomic-embed-text model (ollama)
Local Llama3.3-70b model run with ollama and direct with tensorrt

Originally used grok to construct the python skeleton of this application and then used it as a learning project to delve in and actually build out and fix it into a working program. At a high level the program analyzes structured pieces of training content and uses a RAG model to map relevant KSATs before taking an adversarial approach to validate which are fully covered by the content. The core of the model is the Llama3.3 model which I ran two different ways and wanted to compare output. Ollama ran about 20% faster which surprised me because the tensorrt version was quantized with NVFP4 and should be optimized for the Nvidia blackwell hardware it is running on.

One of the labs has 10 task groups.

Initially I found these results:

ollama

50 total mappings
10 with high confidence / high retrieval rank

tensorrt

62 total mappings
0 with high confidence / high retrieval rank

Digging more into it, I discovered that the tensorrt host that Nvidia provides for the DGX spark isn’t very good at tool calling. This let down a journey of looking at other model providers since I want to keep the NVFP4 quantization for efficiency. This led to the quickstart recipe for vllm.ai

https://docs.vllm.ai/projects/recipes/en/latest/Llama/Llama3.3-70B.html

But then of course the public vllm image doesn’t have support for the GB10 (though it does have aarch64 at least).

Finally, found the vLLM release by nvidia themselves but this required a slightly different method to invoke. And then furthermore it also needs flags to enable tool calling.

With this all in place, I am finally able to use robust tool calling with langchain against the local model. This will enable the mapping function to run and a whole host of other applications coming soon.

For reference, here is the docker command that ultimately worked giving me GB10-optimized inference with openai tool calling support.

docker run \
  -e HF_TOKEN=$HF_TOKEN \
  --rm --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus=all --ipc=host --network host \
  -v "$MODEL_PATH:/models" \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
  --model /models/Llama-3.3-70B-Instruct-NVFP4 \
  --served-model-name Llama-3.3-70B-Instruct-NVFP4 \
  --dtype auto \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --chat-template /opt/vllm/vllm-src/examples/tool_chat_template_llama3.1_json.jinja \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --port 8001 \
  --host 0.0.0.0 \
  --enforce-eager \
  --gpu-memory-utilization 0.80 \
  --async-scheduling \
  --no-enable-prefix-caching \
  --compilation-config '{"pass_config":{"fuse_allreduce_rms":true,"fuse_attn_quant":true,"eliminate_noops":true}}'

Comments

Leave a Reply Cancel reply

More posts

A Primer for Classical AI

Big Problems Need Big Solutions

State of Local AI in Q1 2026

Problem Solving Prompt