As AI booms, the open Internet may soon be a relic of the past

The era of democratised access to information might be drawing to a close.

Paul Mah

28 Jul 2024 — 4 min read

Photo Credit: Unsplash/Fredy Jacob

In May, I wrote about how technology giants and AI startups are gunning for our data, methodically strip-mining all the available data on the Internet to feed the insatiable appetite for new AI models.

Today, I want to highlight a new observation: How content providers are fighting back by thwarting rampant web-scraping and setting up walled gardens to protect their data from unscrupulous AI firms.

Unfortunately, this battle being fought in the high places could have an unintended outcome that affects all of us, even if we care nothing about AI or copyright. When the dust settles, the Internet that remains could well be radically different from what most of us grew up with.

girl covering her face with both hands — **Photo Credit**: Unsplash/Caleb Woods

A nod and a wink

Despite valiant – and sometimes even comical – efforts by AI firms to avoid talking about the provenance of their data, the reality of AI is an open secret to everyone involved in AI training. AI models need copious amounts of data, and we know that most get it primarily by taking them without asking.

Indeed, a 404 Media investigation published this week found that AI video generator Runway used hundreds of YouTube videos to train its latest Gen-3 model. A leaked internal document from an ex-employee listed thousands of YouTube videos and even pirated movies used as training data.

Ethical AI training is possible. AI Singapore created its Sea-Lion AI model by rejecting data brokers with data sourced from dubious or unknown sources, opting instead to forge partnerships with various organisations and institutions across the region to create Southeast Asia's first large language model.

no entry sign on metal rail in front of building during day — **Photo Credit**: Unsplash/Fikri Rasyid

Raising the drawbridges

In the meantime, drawbridges are being raised across the Internet. Reddit has started blocking major search engines and AI bots unless they pay – it currently has a partnership with OpenAI for an undisclosed sum and a US$60 million deal with Google.

Of course, the limiting of content was already happening before ChatGPT crashed the party, as publications pushed subscription services to stay financially viable. There is no question that AI greatly accelerated this trend, however, forcing media firms and online services to rethink their content monetisation and data strategies even more aggressively.

But surely individual websites are unlikely to stop AI bots from accessing their data? Not so, according to Cloudflare, a content distribution network giant. After an initial trial, it concluded that website owners don't like the idea of AI bots scraping their data. With that in mind, Cloudflare earlier this month made its automated blocking of website scrapping bots available for free.

brown wooden blocks on black table — **Photo Credit**: Unsplash/Valery Fedotov

Growing cracks and fissures

I remembered how the bulletin board systems (BBS) and online forums of my youth fascinated me. It was possible to spend hours exploring content spanning multiple topics and engaging in discussions with individuals from different walks of life.

The Internet eventually superseded that as it evolved into the enormous planet-spanning network we know today, offering an incredible array of democratised information and diverse insights from people from all corners of the globe.

Unfortunately, it might already have plateaued as the digital knowledge repository it grew into. Today, cracks are spreading along the entire edifice as it fragments into disparate silos of information, a rising tide of paywalls, and members-only communities. Where will it end?

low angle photography of tube interior — **Photo Credit**: Unsplash/Sharosh Rajasekher

The post-AI Internet

In the meantime, what will happen with generative AI? Make no mistake. The next generation of AI models will require an unimaginable amount of fresh data. And nascent efforts at creating "synthetic" data to meet the voracious appetite of AI models might be doomed to failure.

A new report published this week on Nature found that AI trained on AI outputs churns out gibberish. This finding isn't new, but it is the latest study that shows how using the outputs of AI models for training quickly devolves into a phenomenon pegged as model collapse.

It is somewhat ironic that the very rise of AI might well lead to its decline, as data sources shrivel or get locked away. After all, one of the core strengths of AI is its ability to aggregate existing information pools and generate new content birthed from human creativity and ingenuity. But with scant new sources of input, surely it will struggle to maintain its relevance and accuracy over time.

Or maybe not. Tech giants are using the tens of billions in their war chests to sign deals with publishers and content providers at a frenetic pace to gain access to new content created by humans. Because there can be no AI without it.

Yet no matter how it plays out, the Internet of tomorrow will probably be a radically different place for the rest of us.

Enjoyed reading this? Sign up here to get a digest of my stories in your inbox every week.