{"id":59,"date":"2026-01-03T03:43:42","date_gmt":"2026-01-03T03:43:42","guid":{"rendered":"https:\/\/www.optimizer.llc\/?p=59"},"modified":"2026-01-03T03:43:42","modified_gmt":"2026-01-03T03:43:42","slug":"rag-assessment","status":"publish","type":"post","link":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/?p=59","title":{"rendered":"RAG Assessment"},"content":{"rendered":"\n<p>What started as a simple comparison of LLM hosting options turned into a deep dive of tool calling and llm frameworks.<\/p>\n\n\n\n<p>Goal: Create a hybrid-RAG pipeline with reasoning to map discrete KSATs from competency frameworks to private training content.<\/p>\n\n\n\n<p>Tools:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Local nomic-embed-text model (ollama)<\/li>\n\n\n\n<li>Local Llama3.3-70b model run with ollama and direct with tensorrt<\/li>\n<\/ul>\n\n\n\n<p>Originally used grok to construct the python skeleton of this application and then used it as a learning project to delve in and actually build out and fix it into a working program. At a high level the program analyzes structured pieces of training content and uses a RAG model to map relevant KSATs  before taking an adversarial approach to validate which are fully covered by the content. The core of the model is the Llama3.3 model which I ran two different ways and wanted to compare output. Ollama ran about 20% faster which surprised me because the tensorrt version was quantized with NVFP4 and should be optimized for the Nvidia blackwell hardware it is running on.<\/p>\n\n\n\n<p>One of the labs has 10 task groups.<\/p>\n\n\n\n<p>Initially I found these results:<\/p>\n\n\n\n<p>ollama<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>50 total mappings<\/li>\n\n\n\n<li>10 with high confidence \/ high retrieval rank<\/li>\n<\/ul>\n\n\n\n<p>tensorrt<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>62 total mappings<\/li>\n\n\n\n<li>0 with high confidence \/ high retrieval rank<\/li>\n<\/ul>\n\n\n\n<p>Digging more into it, I discovered that the tensorrt host that Nvidia provides for the DGX spark isn&#8217;t very good at tool calling. This let down a journey of looking at other model providers since I want to keep the NVFP4 quantization for efficiency. This led to the quickstart recipe for vllm.ai<\/p>\n\n\n\n<p><a href=\"https:\/\/docs.vllm.ai\/projects\/recipes\/en\/latest\/Llama\/Llama3.3-70B.html?\">https:\/\/docs.vllm.ai\/projects\/recipes\/en\/latest\/Llama\/Llama3.3-70B.html<\/a><\/p>\n\n\n\n<p>But then of course the public vllm image doesn&#8217;t have support for the GB10 (though it does have aarch64 at least).<\/p>\n\n\n\n<p>Finally, found the vLLM release by nvidia themselves but this required a slightly different method to invoke. And then furthermore it also needs flags to enable tool calling.<\/p>\n\n\n\n<p>With this all in place, I am finally able to use robust tool calling with langchain against the local model. This will enable the mapping function to run and a whole host of other applications coming soon.<\/p>\n\n\n\n<p>For reference, here is the docker command that ultimately worked giving me GB10-optimized inference with openai tool calling support.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker run \\\n  -e HF_TOKEN=$HF_TOKEN \\\n  --rm --ulimit memlock=-1 --ulimit stack=67108864 \\\n  --gpus=all --ipc=host --network host \\\n  -v \"$MODEL_PATH:\/models\" \\\n  nvcr.io\/nvidia\/vllm:25.12.post1-py3 \\\n  python3 -m vllm.entrypoints.openai.api_server \\\n  --model \/models\/Llama-3.3-70B-Instruct-NVFP4 \\\n  --served-model-name Llama-3.3-70B-Instruct-NVFP4 \\\n  --dtype auto \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser llama3_json \\\n  --chat-template \/opt\/vllm\/vllm-src\/examples\/tool_chat_template_llama3.1_json.jinja \\\n  --kv-cache-dtype fp8 \\\n  --max-model-len 131072 \\\n  --max-num-batched-tokens 8192 \\\n  --max-num-seqs 4 \\\n  --port 8001 \\\n  --host 0.0.0.0 \\\n  --enforce-eager \\\n  --gpu-memory-utilization 0.80 \\\n  --async-scheduling \\\n  --no-enable-prefix-caching \\\n  --compilation-config '{\"pass_config\":{\"fuse_allreduce_rms\":true,\"fuse_attn_quant\":true,\"eliminate_noops\":true}}'<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>What started as a simple comparison of LLM hosting options turned into a deep dive of tool calling and llm frameworks. Goal: Create a hybrid-RAG pipeline with reasoning to map discrete KSATs from competency frameworks to private training content. Tools: Originally used grok to construct the python skeleton of this application and then used it [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-59","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=\/wp\/v2\/posts\/59","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=59"}],"version-history":[{"count":0,"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=\/wp\/v2\/posts\/59\/revisions"}],"wp:attachment":[{"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=59"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=59"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bdbhuelq7tb31dofbuhmjpdiss.ingress.boogle.cloud\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=59"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}