BLUF: use ollama on existing hardware for quickstart, it has the best intersection of features, compatibility and maturity
I spent a lot of time with AI models in 2025, both cloud and local. My background in infrastructure configuration and hardware architecture constantly tempted me to “just buy better hardware” and I had to restrain myself to only when it solved a problem.
I started with a Tesla P40 accelerator because it’s the cheapest way to get 24GB of VRAM. This is important but my ignorance was in quantization. The older NVIDIA Pascal architecture does not support modern FP8 or FP4 quantization techniques so it was quite limited in what models it could run — mainly INT8 but this meant that the size of models was still not great despite 24GB.
The rough formula is 1GB of ram per 1B model parameters at INT8 or FP8 quantization + Key/value cache size — basically the size of max context and output tokens. The P40 24GB would do 16-20B parameter models but without too much context and of course limited to those with INT8 quantization.
Even worse, it wasn’t super fast on these models because the Pascal architecture predates transformers that are the key to modern AI models.
I used hosted models like OpenAI and open sources ones on HuggingFace in order to get through the Hugging Face Agents Course which was eye opening in terms of capabilities. Hosted models are fast and the cheaper models will still get you quite a lot of performance. This is a great way to quickly get up to speed and see what the field can offer. I recommend this state for everyone.
I noticed that I was running through token budgets pretty quickly which once again led me down the path of determining the best way to host model(s) on my own. I have aspirations of using agents to automate a lot of tasks and buying millions of tokens per day would get pricey fast.
This was right around the time the NVIDIA DGX spark dropped so I picked up one of those. arm-based with 128GB of unified memory and Blackwell GPU core meant it could run large models with the latest features, just not as fast as a flagship server GPU which cost 10x more. This was acceptable to me and this coupled with an upgrade to a 5070ti in my main workstation have been more than capable at getting me a mix of fast smaller models and experimentation with larger models.
For anyone buying new hardware, it really doesn’t make sense to buy anything older than Blackwell since the features and efficiency gains are an enormous improvement over the previous Hopper and Ada Lovelace. The DGX spark is a worthy investment for the experimentation side just know it is not a speed demon.
I must also mention that running your own models means dealing with the house-of-cards software stacks. Part of the appeal to me of something like the DGX spark is the NVIDIA software ecosystem and indeed the drivers and OS stability are solid but of course DGX OS is not exactly Ubuntu and arm64 support is growing but not 100% parity with x86. These two differences plus the usual python module dependency hell can make it challenging if you stray off the straight and narrow.
Notably, NVIDIA has done a great job of making tutorials for many workflows available here: https://build.nvidia.com/spark. I hope they keep them updated as a common problem in the AI space is that any tutorial more than a month old is likely to have become stale.
Ollama is also remarkably well set up with abstracted capabilities to run on a variety of hardware (not just nvidia). It also comes with mature API support so things like LangChain can quickly integrate with it and start doing tool calls. The downside is that they have to have support for the specific model you want and they don’t have advanced quantization like NVFP4 which dramatically speeds up inference on NVIDIA Blackwell.
If you can go into it with a can-do attitude about working through issues, self-hosted models are a great way to both learn more about the underlying limits of the technology and not be worried about running tons of queries.
Leave a Reply