If you’re a user of GitHub Copilot, then you’re likely already aware that GitHub recently switched all Copilot plans over to usage-based billing. Every interation with Copilot now burns credits by the token, and with so many of us working agentically these days, this can add up fast.

With this in mind, here’s how you can set up an open-weight model on your own hardware with open-source tooling in your favourite IDE, without any subscription overheads and no per-token bill.

A local model won’t out-think Claude or GPT, but inside a good agent harness with tests and a linter to lean on, it handles a surprising amount of everyday dev work.

1. Find a tool to download models

While you’re probavbly eager to install a model and get building right away, you first need to find a tool to download the models from, one that will then serve the model behind a local API that your editor can call.

We recommend starting with Ollama, as it’s open source and CLI-first, with a lightweight GUI now, too.
Alternatively, Jan is an open source desktop app for downloading and running models, or there’s LM Studio, though while this is free-to-use, it’s not open source.

Whatever you pick, they all use the OpenAI API format so you can always switch to one of the others later on.

2. Pick a model you can load

It should be said you’ll be using open-weight models, which isn’t quite open source as you get the models to run and fine-tune but not necessarily full access to its codebase in the way you might expect. Even so, they’re open enough to download, inspect and self-host for free, so we shouldn’t complain too much.

The hard limit for choosing a model is memory, the VRAM if you’ve got a dedicated GPU, or the unified memory on an Mac. For coding you want three things from a model, access to tools, reasoning, and (a nice to have is) vision, for when you paste in a screenshot, and it’s worth checking the Hugging Face leaderboards for which models are on top this week.

We’d suggest avoiding anything under 4B parameters for real work, though most runners flag whether a model fits before you download it, so if it’s too big for your memory, you’ll probably know.

3. Start the server and fix the context window

Next, fire up the local server in the runner you installed in Step 1 and before we hook it up to our editor, let’s edit a few settings.

Runners default the context window to something tiny — Ollama starts around 4,000 tokens, and when an agent’s system prompt and tool definitions can eat 20,000–40,000 tokens before you’ve typed a single word, we need to make it much bigger if we’re going to work agentically.

So, increase it to 100k or more (if your memory can take it) by either adjusting the slider on the GUI or with /set parameter num_ctx 100000 with the Ollama CLI, to set the context for that session.

Next, begin a chat and watch the tokens-per-second. Under about 10 TPS and you’ll likely want a leaner model or coding will feel slow and sluggish.

4. Wire it into your agent

Now our server is running, let’s point our IDE (in this case VS Code) to it. In VS Code, we open the Copilot model picker, hit Add Models > Custom Endpoint, and give it your localhost URL (Ollama serves it at http://localhost:11434 automatically)

There’s a cold start while the model loads into memory, and then the first prompt is slow, sometimes taking a couple of minutes to chew through a giant system prompt before it answers. Once this is complete, you should notice your model speeds up massively.

5. Borrow GPUs for free

If you’re struggling to run a decent model, OpenRouter exposes hundreds of models, free open-weight ones included, through a single endpoint your editor treats just like a local server.

Make sure you set your spending limits before you connect anything allow-list only the free models, and set new API keys to a zero credit ceiling.

Of course the trade-offs here are that your prompts may be used for training, you need to be online for your agent to run, and free tiers can vanish whenever a provider feels like it. Even so, there are plenty of options to get going if GPUs is your issue.

Happy coding (agentically)!

There we go, you’ve got your first agent up and running on open tooling, with no lock-in and usage-based billing.

From here, work on your harness, with a solid AGENTS.md, a test suite the agent can run, a linter and more.

If you want a deep dive, Alex Ewerlöf’s walkthrough on local LLMs for agentic coding is an amazing guide with all the details covered along the way.