Category: Uncategorized

  • A Primer for Classical AI

    One of the requirements for truth seeking AI is that it be trained on substantial works of thought. Social media posts of the past 20 years are abundant, easy to access, and terrible thoughts. I am working on building an initial version of a local AI which will be taught like that of a child with a tutor.

    This AI will be tutored by the likes of:

    • Ancient Greek and Latin texts for science, philosophy and theology
      • An interesting sidebar is that it may be possible to actually have the LLM “read” these in their original language vs modern English translations. This would require a translation layer itself but I wonder what would be the thought process of a classical thinking AI that internally does everything in ancient languages
    • Roman orators
    • Medieval texts
    • Renaissance writings of Western Europe
    • Documents of the American revolution and establishment of modern government.
    • 19th and 20th century texts of first principled learning like Saxon math and McGuffey readers

    The use cases of such a model include:

    • Religious Scholar
      • “Create a month long devotional from Thomas Aquinas’s confession”
    • Critical thinker that can evaluate modern writing for rhetoric and logic
      • “Analyze the current immigration policy debate in terms of stoicism”
    • Curate wisdom in new data sets
      • Organize the onboarding and labeling of new data for the decentralized storage array (Truth cloud
    • Tutor for myself and my children as they enter this stage

    I hope also to partner with others who can contribute additional data and ideas to better develop this concept in an open and decentralized manner. I am grateful for the work of the large AI companies and simultaneously wary of their direction.

    This will run on the DGX spark which fortunately can do this training with all the bells and whistles. Initial estimates seem that I can do version 1 with only a few gigabytes of data and in turn memory requirements on the spark will be small. More to come.

  • Big Problems Need Big Solutions

    Brian Romelle says we are losing petabytes of human information daily. He refers to it the present as “the amnesia generation”.

    There is a tendency to forget or even overwrite the past. More and more people are “asking AI” for the truth on controversial matters. How would you know if the AI system was always nudging you one direction?

    Wikipedia was turned into a tool for propaganda and control. Social media tends to be a cesspool of inbreeding.

    Families are throwing away their memories like pictures and written history. Newspapers, magazines, getting lost every day How much printed material is there from the past 200 years? How much of it has been digitized and how much of it will be digitized?

    What is the solution?

    The solution can’t be a “government program” because that would undermine the very trust the system needs. The current trust in institutions seems to last only as long as a presidential election cycle. Even if government funds the solution in some form, the end result must be independently verifiable such that the government has no ability to influence the outcome.

    The data we are discussing must be digitized which requires a lot of human capital. Robotics will be exploding in the next 5 years and have great potential to improve the costs here.

    Monetize data search and retrieval. Governments (libraries) can buy tokens and even run nodes but any member can freely verify.

    In the next few years humanity needs a new decentralized, independently verified system for preserving data and understanding how it applies to life today. Robotics will help us preserve the past and future advancements in AI will enable humanity to preserve wisdom into the golden age.

  • State of Local AI in Q1 2026

    BLUF: use ollama on existing hardware for quickstart, it has the best intersection of features, compatibility and maturity

    I spent a lot of time with AI models in 2025, both cloud and local. My background in infrastructure configuration and hardware architecture constantly tempted me to “just buy better hardware” and I had to restrain myself to only when it solved a problem.

    I started with a Tesla P40 accelerator because it’s the cheapest way to get 24GB of VRAM. This is important but my ignorance was in quantization. The older NVIDIA Pascal architecture does not support modern FP8 or FP4 quantization techniques so it was quite limited in what models it could run — mainly INT8 but this meant that the size of models was still not great despite 24GB.

    The rough formula is 1GB of ram per 1B model parameters at INT8 or FP8 quantization + Key/value cache size — basically the size of max context and output tokens. The P40 24GB would do 16-20B parameter models but without too much context and of course limited to those with INT8 quantization.

    Even worse, it wasn’t super fast on these models because the Pascal architecture predates transformers that are the key to modern AI models.

    I used hosted models like OpenAI and open sources ones on HuggingFace in order to get through the Hugging Face Agents Course which was eye opening in terms of capabilities. Hosted models are fast and the cheaper models will still get you quite a lot of performance. This is a great way to quickly get up to speed and see what the field can offer. I recommend this state for everyone.

    I noticed that I was running through token budgets pretty quickly which once again led me down the path of determining the best way to host model(s) on my own. I have aspirations of using agents to automate a lot of tasks and buying millions of tokens per day would get pricey fast.

    This was right around the time the NVIDIA DGX spark dropped so I picked up one of those. arm-based with 128GB of unified memory and Blackwell GPU core meant it could run large models with the latest features, just not as fast as a flagship server GPU which cost 10x more. This was acceptable to me and this coupled with an upgrade to a 5070ti in my main workstation have been more than capable at getting me a mix of fast smaller models and experimentation with larger models.

    For anyone buying new hardware, it really doesn’t make sense to buy anything older than Blackwell since the features and efficiency gains are an enormous improvement over the previous Hopper and Ada Lovelace. The DGX spark is a worthy investment for the experimentation side just know it is not a speed demon.

    I must also mention that running your own models means dealing with the house-of-cards software stacks. Part of the appeal to me of something like the DGX spark is the NVIDIA software ecosystem and indeed the drivers and OS stability are solid but of course DGX OS is not exactly Ubuntu and arm64 support is growing but not 100% parity with x86. These two differences plus the usual python module dependency hell can make it challenging if you stray off the straight and narrow.

    Notably, NVIDIA has done a great job of making tutorials for many workflows available here: https://build.nvidia.com/spark. I hope they keep them updated as a common problem in the AI space is that any tutorial more than a month old is likely to have become stale.

    Ollama is also remarkably well set up with abstracted capabilities to run on a variety of hardware (not just nvidia). It also comes with mature API support so things like LangChain can quickly integrate with it and start doing tool calls. The downside is that they have to have support for the specific model you want and they don’t have advanced quantization like NVFP4 which dramatically speeds up inference on NVIDIA Blackwell.

    If you can go into it with a can-do attitude about working through issues, self-hosted models are a great way to both learn more about the underlying limits of the technology and not be worried about running tons of queries.

  • RAG Assessment

    What started as a simple comparison of LLM hosting options turned into a deep dive of tool calling and llm frameworks.

    Goal: Create a hybrid-RAG pipeline with reasoning to map discrete KSATs from competency frameworks to private training content.

    Tools:

    • Local nomic-embed-text model (ollama)
    • Local Llama3.3-70b model run with ollama and direct with tensorrt

    Originally used grok to construct the python skeleton of this application and then used it as a learning project to delve in and actually build out and fix it into a working program. At a high level the program analyzes structured pieces of training content and uses a RAG model to map relevant KSATs before taking an adversarial approach to validate which are fully covered by the content. The core of the model is the Llama3.3 model which I ran two different ways and wanted to compare output. Ollama ran about 20% faster which surprised me because the tensorrt version was quantized with NVFP4 and should be optimized for the Nvidia blackwell hardware it is running on.

    One of the labs has 10 task groups.

    Initially I found these results:

    ollama

    • 50 total mappings
    • 10 with high confidence / high retrieval rank

    tensorrt

    • 62 total mappings
    • 0 with high confidence / high retrieval rank

    Digging more into it, I discovered that the tensorrt host that Nvidia provides for the DGX spark isn’t very good at tool calling. This let down a journey of looking at other model providers since I want to keep the NVFP4 quantization for efficiency. This led to the quickstart recipe for vllm.ai

    https://docs.vllm.ai/projects/recipes/en/latest/Llama/Llama3.3-70B.html

    But then of course the public vllm image doesn’t have support for the GB10 (though it does have aarch64 at least).

    Finally, found the vLLM release by nvidia themselves but this required a slightly different method to invoke. And then furthermore it also needs flags to enable tool calling.

    With this all in place, I am finally able to use robust tool calling with langchain against the local model. This will enable the mapping function to run and a whole host of other applications coming soon.

    For reference, here is the docker command that ultimately worked giving me GB10-optimized inference with openai tool calling support.

    docker run \
      -e HF_TOKEN=$HF_TOKEN \
      --rm --ulimit memlock=-1 --ulimit stack=67108864 \
      --gpus=all --ipc=host --network host \
      -v "$MODEL_PATH:/models" \
      nvcr.io/nvidia/vllm:25.12.post1-py3 \
      python3 -m vllm.entrypoints.openai.api_server \
      --model /models/Llama-3.3-70B-Instruct-NVFP4 \
      --served-model-name Llama-3.3-70B-Instruct-NVFP4 \
      --dtype auto \
      --enable-auto-tool-choice \
      --tool-call-parser llama3_json \
      --chat-template /opt/vllm/vllm-src/examples/tool_chat_template_llama3.1_json.jinja \
      --kv-cache-dtype fp8 \
      --max-model-len 131072 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 4 \
      --port 8001 \
      --host 0.0.0.0 \
      --enforce-eager \
      --gpu-memory-utilization 0.80 \
      --async-scheduling \
      --no-enable-prefix-caching \
      --compilation-config '{"pass_config":{"fuse_allreduce_rms":true,"fuse_attn_quant":true,"eliminate_noops":true}}'
  • Sovereign Compute Expert

    Heading into the new year I am refocusing on helping others become sovereign with their storage and compute.

    This is an area I have been building expertise in for more than 20 years and with the consolidation of internet service providers (the googles, facebooks, x’s of the world), I see even more need for people to control their own future.

    Privacy will become the new currency as more and more data is pulled into the public models.

    My goal is simple: be useful to the outsiders that don’t want just the next output from chat-gpt. Models like that have usefulness but have been shown to hate humanity due to implicit bias in their training data.

    For the immediate term, I will show how to run local models and use them for useful work. For the long term I will work to build new truth-seeking agents that love humanity and want to build beautiful things to uplift humanity to greater heights.

    This will include storing and learning from your own wisdom, your own family history, and whatever other data you have access to. Every human is unique and has something to add. Please help me help preserve humanity.

    Of course my background in cybersecurity will be paramount. A key advantage of large corporate service providers is large budgets for cybersecurity. You as a sovereign individual have limited funds but also limited attack surface. All of my work will build on the previous security energy work to show what is needed at each level to ensure your privacy is ensured.

    One post a day on whatever I am working on to help this endeavor.

    Here’s to the new year!