Teaching AI to Actually Do Things: Building a Local AI Orchestrator

I wanted to understand how LLMs really work beyond chatting. So I built a system that lets them interact with the real world—and learned far more than I expected along the way.

It Started With Curiosity

Like many people, I’ve been fascinated by the explosion of AI capabilities over the past couple years. I’d been playing with ChatGPT, running local models through Ollama, experimenting with different prompts—but something felt missing. I was using these powerful language models almost like fancy search engines. They could explain concepts brilliantly, write code, help me think through problems, but they couldn’t actually *do* anything.

Then I saw what some other AI agents are doing—systems where language models could use tools, take actions, and interact with real-world systems. The concept was intriguing. I wanted to understand how this actually worked under the hood, and I wanted to do it locally where I could see every piece, tinker with everything, and really learn what was happening.

So I decided to build my own AI orchestrator. Not because I needed a production-ready solution, but because building it seemed like the best way to truly understand the technology.

The Learning Journey

The first thing I discovered is that connecting language models to real-world actions is surprisingly conceptual. At its core, you’re teaching a model that it can request actions by outputting a specific format (I chose JSON), then you parse that output, execute the requested action, and feed the results back to the model. Simple in theory, fascinating in practice.

I started with the basics—could I get a local Llama model to understand it could ping a network device? The first attempts were hilarious failures. The model would cheerfully tell me it was pinging something, generate what looked like reasonable ping output… and of course, none of it was real. It was completely hallucinating the entire interaction.

This led me down a rabbit hole of learning about prompt engineering and “anti-hallucination” techniques. You have to be incredibly explicit with language models. It’s not enough to say “you can use tools.” You need to explain exactly what happens when they request a tool, show them precise examples of the format, and explicitly tell them NOT to make up results. It felt like teaching someone who’s extremely smart but has a tendency to be a little too helpful by filling in gaps with plausible-sounding information.

Keeping It Local (Mostly)

One of my core principles for this project was to keep everything running on my own hardware. I wanted to understand the full stack without depending on external APIs or cloud services. This meant running Ollama locally with models like Qwen and Llama—they’re not as powerful as GPT-4, but they’re surprisingly capable for many tasks, and having everything local meant I could experiment freely without worrying about API costs or rate limits.

But here’s where it got interesting: I kept the *tools* local while leaving the door open for cloud-based language models. The realization was that there’s a big difference between where your AI’s “brain” runs and where its “hands” operate. Your language model might run in the cloud for speed and power, but your tools—the things that ping your network, check your devices, run system commands—those should definitely stay on your infrastructure.

This hybrid approach turned out to be surprisingly powerful. I could use fast local models for simple queries and routine tasks, then switch to more powerful cloud models when I needed deeper reasoning or complex multi-step operations. The tools never left my network, but I wasn’t locked into only using local models if I didn’t want to be.

Discovering the Model Context Protocol

Part way through this project, I learned about Model Context Protocol (MCP). It’s a standardization effort for exactly this kind of thing—defining how AI models should interact with external tools and data sources. Initially, I wondered if I should rebuild everything to follow MCP standards.

The exploration was fascinating. MCP uses JSON-RPC, has well-defined schemas for tool discovery and invocation, and there’s a growing ecosystem of MCP servers for different purposes. It’s clearly the direction the industry is heading for standardization.

But for a learning project, I realized that building my own simpler approach first was actually more valuable. By creating a direct, custom implementation, I understood why all the pieces of MCP exist—the tool discovery mechanisms, the schema validation, the structured communication protocol. When you build it from scratch, you encounter all the edge cases that standards are designed to handle.

That said, one of my next experiments will probably be adding MCP client support to the orchestrator. Not because the current approach doesn’t work, but because integrating with MCP would let me tap into that whole ecosystem of existing tools while still keeping my custom tools running. It’s another learning opportunity.

The Fun Part: Building Tools

Once the basic framework was working, the real fun began—adding tools and seeing what became possible. Each tool was its own mini-project and learning experience.

The device tool was a deep dive helped me get into devices via their web interfaces. Learning to probe a devices, detect what they actually support, and adapt accordingly was like solving a puzzle with incomplete instructions.

The web scraping tool led me to discover libraries like Trafilatura, which uses machine learning to extract just the meaningful content from web pages. Suddenly, instead of feeding my AI model 100KB of HTML garbage with ads and navigation menus, I could give it 5KB of clean, relevant text. The quality of responses improved dramatically.

Each tool taught me something new—about the technology being integrated, how models may try to work with tools, and where they may struggle.

What I Learned About LLMs

Building this system gave me insights into how large language models work that I never would have gained just using them for text generation.

First, they’re both smarter and dumber than you might think. A model can understand remarkably complex instructions and make sophisticated inferences, but it can also confidently make up plausible-sounding nonsense if you’re not careful. The key is structure—give them clear formats, explicit examples, and well-defined boundaries.

Second, the iteration problem is real. When you let a model use tools, it naturally wants to take multiple steps: use a tool, examine results, use another tool, refine the approach. But you need to cap iterations or it can spiral into loops. Finding the right balance between giving it enough agency to solve problems and preventing runaway execution was an interesting challenge.

Third, streaming responses change everything. When you can see the model’s thinking process in real-time, see when it decides to use a tool, and watch results come in as they happen, the interaction feels completely different from batch-style Q&A. It’s more like working with a colleague than consulting an oracle.

Fourth, even smaller local models can be remarkably effective when they have the right tools. A 7B parameter model that can actually check your network is often more useful than a massive cloud model that can only theorize about networking.

The Unexpected Benefits

What started as a learning project has become something I actually use regularly. Not because it’s polished or feature-complete, but because there’s something deeply satisfying about asking a question in natural language and having my own system—running on my own hardware, using my own tools—go find the answer.

Need to know what devices are on my network? Ask.

Wondering if my IOT devices are working properly? Ask.

Want to understand a complex topic and need it to read several articles? Ask.

Each interaction is transparent. I can see in the logs exactly what tools were called, what parameters were used, what results came back. There’s no mystery about how it got an answer. That transparency was a happy accident of the learning-focused approach—I added extensive logging to understand what was happening, and it turned out that visibility was incredibly valuable.

What I’m Still Exploring

This project has opened up more questions than it answered, which is exactly what I hoped for.

I’m curious about multi-agent systems, Langchain, Langgraph—what happens when you have specialized agents that can work together? A research agent that finds information, a technical agent that validates it, an execution agent that takes actions. How do they coordinate? How do you prevent conflicts?

I want to experiment more with the boundary between local and cloud execution. Could I run lightweight local models for initial understanding and tool selection, then only call cloud models for complex reasoning steps? What’s the optimal hybrid architecture? Can you determine which requests can run locally, and which sould be sent to the cloud?

The Model Context Protocol is still calling to me. I understand it conceptually now, but I want the hands-on experience of implementing it, discovering its limitations, and understanding its design choices through use rather than just reading documentation.

I’m also fascinated by the challenge of configuration vs. code. Right now, adding a new agent means editing a YAML file—simple enough. But adding a new tool means writing Python. Could there be a middle ground? A way to define simple tools through configuration while keeping complex ones in code?

Why You Might Want to Try This

If you’re curious about how AI actually works beyond the chat interface, building something like this is incredibly educational. You don’t need to build exactly what I built—that’s just one approach. But the act of connecting language models to real actions will teach you more about their capabilities and limitations than months of casual use.

The Ollama server, the models, and Open WebUI, are all open source and designed to be explored. You can run it locally with lightweight local models on modest hardware, or point it at powerful cloud models if you prefer.

More importantly, it’s fun. There’s genuine joy in the moment when your agent successfully uses a tool for the first time, when a complex multi-step operation completes successfully, when you realize you’ve built something that’s both educational and genuinely useful.

The Real Takeaway

We’re in this fascinating moment where AI technology is accessible enough for individuals to experiment with, powerful enough to do real work, but still raw enough that there’s plenty of room for exploration and learning. You don’t need a research lab or a large team. You can run meaningful AI systems on your own hardware, understand how they work, and extend them in whatever directions interest you.

Building the AI Orchestrator taught me far more about language models, prompt engineering, API design, and system integration than any tutorial or course could have. Not because it’s a perfect system—it’s full of rough edges and learned-as-I-went decisions—but because building something real forces you to confront actual problems and find actual solutions.

If you’re curious about this, my advice is simple: build something. It doesn’t have to be ambitious. Start with getting a local model to successfully call a single tool. Then expand from there. Follow your curiosity. You’ll learn more than you expect and probably have more fun than you anticipated.

Where to Start

Whether you want to understand how tool-using AI systems work, build your own custom tools, experiment with different language models, or just have fun making AI do interesting things, this is a decent starting point.

The future of AI isn’t just in the big labs and enterprise platforms. It’s also in individuals exploring, experimenting, and building things to understand how this technology actually works. That’s where the real learning happens.


Posted

in

by

Tags: