How to Write an AI Agent: Lessons Learnt

TL;DR

Don't host the agent, use it as a service
Tokens matter, don't rely only on vision
Stick to mature technologies
A stubborn LLM is the one!

At Browserless, we've spent the last few months building our own agentic browsing experience. We've all started being more LLM-friendly, letting LLMs discover our routes and consume our REST APIs easily. So it seemed like a no-brainer to write our own MCP server that LLMs could consume.

Then we realized we could build an autonomous agent with what we already had.

The road's been hard, there's no denying that. Every other day, HackerNews gets a meltdown over the new CLI tool that, this time for "real", will make browsing fully autonomous, or the new Python wrapper that makes round trips to an Anthropic agent that will revolutionize autonomous browsing.

We get it, the situation is very nebulous, and everyone has their own solution. But some things that we learnt during the process of building our agent seem to almost be cardinal rules that (if not already in place) will be crucially valued a year from now, when the LLM space is much different (just like it was last year).

1. Hostability is key

If you are going to take something from this blog post, please let it be this. Hostability is key. AI, and agentic browsing in particular, seems to be forgetting how the last two decades of DevEx have evolved into a Thing-as-a-service philosophy.

Let me put it this way: unless you have an extremely specific reason, nobody builds their bare metal servers from scratch, and Amazon, Google, DO, Microsoft, and many other cloud providers are a fundamental part of the whole internet infrastructure.

Unless you have an extremely good reason not to, you do CI paying GitHub, BitBucket, or a cloud provider to host Gitea for you. Team chat and sync? You pay for Slack or Teams. Internal documentation and ticket tracking? Same story.

This is not bad on its own. It actually allows teams to be more productive and deliver products with much higher quality. And it's something that AI, and autonomous agents in particular, will eventually be offered as a "plug and play" feature. You can already see Anthropic/OpenAI/Google providing managed agents on their own infrastructure, which you consume and pay as a service. Agentic browsing will also be plug and play. In the same way CodeRabbit or Devin is "plugged" to your CI config, agentic browsing through MCP will be offered as a plug-and-play feature.

At Browserless, we kind of made a bet on this approach.

Currently, these agentic browser tools (your Browser-harness, your Vercel agent-browser CLI) are a program you host and execute. You need at least four pieces to run it:

The browser harness repo
A runtime environment
A file system to read and write skills
An LLM

Of course, when you download these repos to your local machine, you can use browsers in the cloud, but these tools forget that remote browser ≠ remote infrastructure. That's why Browserless MCP is plug-and-play, and can connect to any MCP-capable LLM, without having to host anything on your end.

2. You will be chasing the token

Another unavoidable question: is this a bubble?

When subsidies end and investors want their money back, will it implode? Will token usage be sustainable? Regardless of the answer, we are progressively suffering a bit every time from paying all those sweet tokens, every time a model gets better and requires more computing power. Which brings me to my second point: every token costs.

It's almost like a tautology, saying that tokens have a value to be paid. But the agentic browsing environment doesn't seem to be concerned with this. Every other week, a new agentic browsing repo comes in claiming to be the most accurate, or the most reliable, or the most stealthy. And strictly speaking, all of these are good. The problem is that all this is done sacrificing token efficiency.

Engineers quickly realized that sending pure HTML to the LLM was not as accurate as having the LLM take a screenshot of the page, which, in turn, was not as accurate as taking a screenshot and having the LLM describe it, divide it into visual chunks, process each chunk, and send raw click commands to coordinates. Again, this undeniably makes the model more accurate. But it also makes it bloody expensive and slow.

At Browserless, we kind of took a bet on having a hybrid approach, in the sense of providing the LLM with a text-format-based representation of the page (including tab and network information), which it can use to infer the state of the page and think through the following action.

If, and only if, this is not enough, we encourage the LLM to take a screenshot and see what's blocking it. This has undeniably made our MCP agent much more efficient in both speed and token spending. Take, for instance, the following benchmark done against Browser Harness:

Prompt	Browser Harness	Browserless
"Go to ESPN and check Los Angeles Lakers Stats 2023-24, calculate Anthony Davis' games played (GP) percentage, tell me if there are other players with the same games played percentage as Anthony Davis."	54s, 3.8k tok	38s, 1.5k tok
"Go to the Dell store and find a good laptop with an RTX 50xx and at least 32GB of RAM"	6m50s, 17.9k tok	5m26s, 10.6k tok
"Search Amazon for best deal on Skull Candy headphones with ANC, use a proxy on Doral and tell me the fastest delivery date"	8m43s, 17.2k tok	1m54s, 7.4k tok

If you give the LLM something good enough that it can work with, it doesn't have to burn your allowance to do it.

3. A well-defined problem is not a problem at all (AKA skills are your friend)

An agent that can't select a flight schedule because it can't close the cookie banner. An agent that can't log in to a site because of a CAPTCHA. An agent that realizes mid-session that it needs a proxy… I bet these sound familiar. And all of them happen because the agent doesn't know how to solve the problem specifically. If you were skeptical of the last item in the Browser Harness benchmark, taking 8 minutes to do what took Browserless 2 minutes, don't be. It happened because the Browser Harness agent didn't realize that it was being bot-blocked until mid-session, and proceeded to actively search the results on Google. Which worked… in the end.

The point I'm trying to make is that skills are an agent's friend. Regular LLMs already realized this and bundle design, programming, mathematics, reading, and writing skills.

The MCP Browserless Agent is also making a bet on skills: that the LLM has been explicitly explained how it should deal with CAPTCHAs, MFA, shadow DOMs, tab management, etc. In a way, we spend more time tweaking and creating skills, so the LLMs can use the agent in the most general way possible.

Of course, you can always describe every single step of the interaction, but that means spending all your time and tokens refining the prompt.

4. You will be reinventing the wheel

The current agentic landscape feels like the DotCom era: the lack of standardization, the hype, the confusion about topics... We've seen this movie so many times before that we already know the ending.

Because there isn't a unified way to handle things like state persistence, session management, or even how an agent should report a failure, everyone ends up building their own custom logic from the ground up.

Writing and hosting your own personal browsing agent means you'll find yourself writing yet another wrapper for element selection or a bespoke retry mechanism for flaky network conditions. But even in the chaos of the DotCom era, there were some pieces of tech that folks just knew would prevail: the browser, the HTTP protocol, the streaming, even the software as a service (Warcraft should really ring a bell).

Here's the other bet we made at Browserless: skills and MCP are probably the protocol that will prevail. We went with the mature architecture that the industry really should be adopting. By using MCP, we've eliminated the need for developers to write wrappers for flaky retry mechanisms, for specific agent skills, for managing sessions or proxies. Just a plug-and-play, streamlined, production-ready experience that allows you to focus on your agent's objective rather than the underlying infrastructure.

5. You want a stubborn model

The technology behind LLMs is here to stay. It is creative and flexible, it's a supportive and jolly good fella. For browsing agents, however, that's actually a liability. You want a model that is stubborn, one that keeps its eyes on the prize and doesn't derail from the original objective. One that doesn't care if this is round trip number 73. The web is a noisy place, filled with pop-ups, irrelevant sidebars, and confusing UI patterns that are basically catnip for a model that is friendly with everything it sees.

You want a model that doesn't get distracted by a "Sign up for our newsletter" modal or a "You are the 1,000,000th visitor! You won an iPad" ad. It sees the hurdle, tries the instructed workaround, and keeps its eyes on the prize. If it fails, rather than politely trying to "help" by navigating to a completely different part of the internet because it thought that might be nice, it stays focused on the goal and does not move from there. Precision beats politeness every single time when you're paying for the tokens.

Curiously, we've found that Anthropic and xAI models are the most "stubborn" in this regard, while OpenAI models give up really easily. We recommend plugging the Browserless MCP into Opus or Grok!

Ok, I'm sold

We already introduced the MCP agent in another blog post, check it out for a more general idea of exactly what sets us apart from other competitors. You can read about more technical aspects and learn how to connect to our MCP in our docs. It is open source, so you can run it locally. And you can even use it without an Anthropic/LLM account, through the Browserless CLI.

Lessons Learnt From Writing an AI Agent