That’s not going to go well. There’s a reason genAI engines often spew garbage; it’s what they were trained on. For instance, 80% of OpenAI GPT-3 tokens come from Common Crawl. Like the name says, these petabytes of data are scraped from everywhere and anywhere on the web. As a Mozilla Foundation study found, the result is not trustworthy AI.
Worse still, this will eventually lead to a time when those genAI tools start consuming their own garbage. This is a known problem that will cause model collapse. Or, as neuroscientist Erik Hoel pithily describes the end result: “synthetic garbage.” He’s not alone; many AI engineers think a little bit of AI-generated data can poison their LLMs.
At the same time, genAI companies aren’t doing us — or themselves, in the long run — any favors. For example, Google’s AI-powered “Overviews” provides concise AI summaries at the top of search results. This move promises quicker access to information, and Google’s Liz Reid claims it will drive more clicks to websites by piquing users’ interest.
Reid, who oversees search operations, maintains that AI Overviews really will encourage more searches and clicks to websites as users seek to “dig deeper” after getting the initial synthesized summary.
Publishers know better. Who will bother to go to the real story, which might require a subscription or — horrors —seeing an ad?
Danielle Coffee, CEO of the News Media Alliance (it represents more than 2,200 publishers) warns that the change could be “catastrophic” for an industry already struggling with declining ad revenue. “It’s offensive and potentially unlawful for a dominant monopoly like Google to dictate the rules in a way that sacrifices the interests of publishers and creators,” she said.