Everything You Published Became Google’s Brain
You trained the machine and got nothing for it. Now what?
For over a decade, Google has been quietly using the open web as raw material for its AI. Now, as lawsuits mount, the real stakes are becoming clear: if creators win, the machine stops learning. If Google wins, the web is no longer yours.
We can’t let that happen.
The quiet remapping of human meaning
In 2013, Google released a research paper about a system called word2vec. It wasn’t flashy. No talking robots or image generators. Just a clever new way to turn words into numbers so that machines could understand how they relate to each other.
Here’s how it worked in plainish terms:
The system scanned massive amounts of text like websites, books, news articles, Wikipedia entries. Yep, even your blog if you were around back then. (I was.)
It didn’t try to understand what the text meant like a person would. Instead, it looked at which words appeared near each other.
It represented each word as a vector, or list of numbers. These weren’t random: they’re positioned in a way that uses space to capture the semantic relationship between words.
So over time, it figured out patterns. For example, “Paris” often appears near “France,” so those words end up close together in the language space, which is like a galaxy of words. The closer two words orbit each other, the more alike their meanings and usage.
But it also captured deeper relationships, like how “king” and “queen” are used in similar ways, but differ along the axis of gender. In that space, the difference between “king” and “queen” is similar to the difference between “man” and “woman.”
For example:
King + woman = Queen
Queen - her = King
This wasn’t just a cool trick. It was the beginning of machines building models of meaning from human speech.
That same year, Google began integrating these models into its products. Especially Search.
Google’s march to monopolize the web
Before word2vec, Google mostly functioned like a library index. You searched for something, and it gave you a list of links to relevant pages. After word2vec, the goal shifted: instead of showing you where to find answers, Google wanted to be the answer.
That meant a new kind of data gathering, not just to help people find information, but to help machines generate it.
Embedding words in language space
Word2vec was a shallow neural network that used a fixed embedding for each word. So “bank” occupied the same vector whether it referred to a river bank, money in the bank, or a bank shot in pool.
To humans, the word has different meanings depending on its use. To train its algorithms to understand these different meanings, Google researchers developed Bidirectional Encoder Representations from Transformers (BERT 2018) — an opaque way of saying that a word’s embedding in a vector was no longer fixed.
Understanding nuance by context
In other words, BERT creates dynamic representations based on full sentence context. So, “bank” in “river bank” got a different vector from its use in “river bank” or “bank shot.” The system was beginning to understand nuance based full sentence context.
That kind of knowledge doesn’t just happen. The models needed ever-larger datasets to stay accurate and competitive, and the biggest, richest dataset available was everything we published online.
The machine didn’t train itself
Every forum post, blog, article, review, and edit? It was all scraped, processed, and fed into the system.
This shift is what made possible the most recent wave of AI tools, including Google’s knowledge panels, featured snippets, “people also ask” and now “AI Overviews,” which summarize answers at the top of search results — often without giving credit to the human sources they’re based on.
Why Google is fighting so hard in court
You may have seen headlines about the New York Times suing OpenAI and Microsoft, or authors suing over AI training on their copyrighted works. But the more existential legal battle is brewing around Google itself.
Because here’s the uncomfortable truth: Google’s AI business is built on a decade of training data it never paid for. The information that word2vec trained on? Much of it was subject to copyright. Same with the datasets that BERT consumed. And BERT’s training was part of the basis for Gemini.
This process is known as transfer learning. It happened at the architectural and methodological level when BERT evolved from the concepts embodied in word2vec. And it’s happening now as Gemini — that intrusive AI Google now shoves in every possible product — builds on retrained components from earlier models like BERT, t5, and PaLM.
Google’s AI business is built on a decade of training data it never paid for.
In other words, if courts decide that this use of publicly posted content was unauthorized, Google’s entire AI stack — from BERT to Gemini, from its algorithm ranking websites in search to its AI-operated ad tech — becomes legally questionable.
This is almost certainly why Google is racing to integrate AI into every product it offers, not just to innovate and move forward with AI but to shore up a defense. As in, “We can’t unwind this because it’s foundational to everything we do now.”
Undoing that training, or compensating the people whose work powered it, wouldn’t just be inconvenient. It could be catastrophic to Google’s business model.
And that tells you everything you need to know about the stakes.
Everything you wrote became someone else’s profit
This didn’t just happen to journalists or big-name authors.
It happened to small bloggers explaining niche topics.
To food bloggers writing about Grandma’s meatloaf.
To product reviewers, Amazon affiliates, tech support forums, Quora contributors, Reddit commenters, and Wikipedia editors.
To people commenting on their friends’ vacation photos publicly shared to Instagram or Facebook — and to the people who posted those photos, too.
Your life and work became the training data for commercial AI. In many cases, it's now used to answer questions instead of showing your site.
Google doesn’t just crawl the web anymore. It digests it, extracts the patterns, and spits out synthetic knowledge.
All while doing a worse job of crediting its sources than a C-minus term paper.
They took your work. Here’s how you push back.
If you’re wondering what can still be done, how creators might fight back, organize, or get compensated, here are the three most realistic remedies on the table right now.
But first, some plain truth.
It’s too late to untrain the models. What you published online has already been absorbed and broken down into patterns, embedded in algorithms, and distributed across Google’s ecosystem. And not just theirs.
Through outputs, APIs, and model inheritance, your work may have quietly shaped other systems too — systems you’ll never see, but that now depend on the unpaid labor of millions of creators.
And because of transfer learning, which I covered above, those systems are built on top of each other. In other words, once your data is in, it stays in.
You can’t claw that data back. But you can still push forward.
We have a narrow window to demand transparency, accountability, and compensation before this becomes the permanent cost of being online.
The calvary is not coming. We have to save ourselves.
Collective bargaining sounds like the obvious solution. We heard a lot about that when OpenAI burst onto the scene. Creators would hold hands and join forces to negotiate licensing deals hold those naughty Big AI companies accountable.
Except… no one did it.
Raptive, an ad agency representing many of the highest-traffic bloggers on the internet, hinted at it. Then they went silent.
A few grifters asked creators to cough up $1,000+ to "start a movement" but somehow the movement always looked more like a business plan to profit themselves than an actual union.
So let’s be clear: no one’s coming to save us. There’s no organization. No rights group. No licensing collective.
But that doesn’t mean you’re powerless. If you have a platform — a blog, a newsletter, a feed, a following — you have leverage. Here are three ways to use it.
1. Shine a light on the theft and publicly shame it.
If your content shows up in ChatGPT, Perplexity, or Google’s AI answers with or without credit or a working link, screenshot it. Share it. Not as a traffic gripe, but as proof they’re using your labor to pad their pockets.
Then demand transparency from the offending company. Ask:
What data did you train on?
Was it licensed?
Who got paid?
They won’t answer, at least not at first. But the more creators ask, the harder it gets to ignore.
2. Use content-origin tags, watch who ignores them, and build your record online.
Add robots.txt and meta tags that tell AI bots not to train on your content. Tools like Yoast have made this easier.
Will they respect it? Some will. Some won’t. That’s the point.
Keep your traffic logs.
Study them.
Name the bots that violate the signal.
Call them out online.
By doing this, you’re establishing a pattern of disregard, and that pattern matters. Because if this ever goes to court or Congress, it won’t be about one blog. It’ll be about companies willfully ignoring explicit instructions from hundreds of thousands of creators.
(Plus, you’ll have started gathering the evidence you might need to lay claim to any compensation fund should there ever be one.)
If you have a platform — a blog, a newsletter, a feed, a following — you have leverage.
3. Pester the hell out of politicians for specific measures.
This is where individual action meets systemic change. Write your political representatives. Interact with them on social media. Drop your proof of harms from the above tactics in their comment sections.
What to push for:
Regulatory oversight and routine audits of AI training data and public disclosure of the sources;
Model-level enforcement of content-origin tags with fines for breach;
Compensation funds for creators whose work was used without consent;
The equitable and transparent use of AI online and off.
This isn’t just a copyright debate. It’s a policy failure, the kind that exposes millions to the theft of their life’s work and their ability to earn a living.
This isn’t about search anymore. It’s about who gets to own knowledge. What you shared freely — in the spirit of creativity and satisfying curiosity — was taken to build the statistical skeleton of the machine.
And now, that machine is here to replace you.
We don’t get a say in what’s already been done. But we can fight for a future where creators have power, where transparency is non-negotiable, and where your voice isn’t stolen.
Speak up.
Because if you don’t, the machine will answer instead of you.
⚙️ Join the System Breakers
If you found this post useful, share it. Post it. Screenshot it. The more people understand what’s been taken from them, the harder it is for companies to keep doing it in the dark.
And if you want to support this work — not just what’s here, but what’s coming — become a paid subscriber.
System Breakers keep this newsletter independent, human, and growing.
Let’s build a smarter resistance →