The pace of LLM development is astounding but I predict we are heading towards a point of diminishing returns for large model advancement. The next frontier is models that can be run locally on your own hardware. Fortunately the pace of hardware development is still robust and with proper quantization even some decently sized models can be run on consumer hardware in a few gigabytes of ram.
A quick note on performance, LLMs are most sensitive to memory bandwidth. Overall CPU performance doesn’t matter as much as having fast RAM which means at least DDR4 in 2025 and better yet DDR5. This will enable your local models to have reasonable performance. Maybe not as lightning fast as the big cloud models but enough so that you aren’t held back by the model. Even better, there are numerous advances in inference-specific hardware that should make this even more accessible in the coming years.
Why would you want to run your own model? Data sovereignty. There is an old saying that if you aren’t paying for the internet service, YOU are the product. This is exactly how giants like Google and Facebook have built empires by data mining and figuring out how best to monetize your data. Public AI providers like Anthropic or OpenAI offer services for free and I would bet all the money that they are storing/data mining peoples requests for a variety of purposes. If you want to use expert AI for hard personal applications involving sensitive data, you will need a local AI.
To this end, I wanted to figure out what can and can’t be done with local AI in October 2025. If you are reading this in the future, hopefully some pieces are still useful! I use VSCode for most of my day-to-day development and I have found AI features from copilot to be useful more often than not.
The trick Is that GitHub copilot cannot be used with local models. It is designed with tight integration to supported cloud providers and has no option for you to define custom providers or endpoints.
Next, I chased down some options for Azure AI Foundry, in particular Foundry Local. The main limitation here was that it has no support for Linux at this time (just Mac and Windows) while I need support in WSL Ubuntu for the majority of my projects.
The final solution was found in my huggingface MCP course which referenced using the Continue extension. This was the trick. Continue replicates most of the copilot features and best of all, it has an easy way to define custom OpenAI compatible endpoints.

You specify roles and then it will automatically start using the model for those. Since it runs off of a general openAI endpoint, you can use any common tool like ollama or vllm to run the model of your choosing. In the above example, I am using vllm docker container to host a Qwen coder model on a GPU and share it out over the local network. The same concept can be extended to public cloud providers like AWS or decentralized on the Akash supercloud.
There was a bit of a learning curve with vscode and continue specifically with WSL. The continue extension is installed in the local workspace for vscode even if you have a WSL workspace active. This means that any configuration changes must be in C:\Users\<username\.continue\config.yaml as this is the root workspace for your vscode instance. If you put these in a folder in your wsl workspace, it will not work.

Note that any MCP servers you defined must also be able to run in native Windows. For example, I have npm installed in WSL but still had to separately install it in Windows native to be able to use the playwright mcp server which requires it.
With continue set up and the models running, it is now time to code with private assistance.
Leave a Reply