thomblog

Tom 'voxel' Purnell's notes

bloggpt, a rant

It is not news to anyone that the web is filled to the eyeballs with shit. This may always been the case, but never before was the effluent so endless in quantity and miserably poor in quality. Image diffusers and Large Language Models - so called ‘AI’ - have created a vast new source of meaningless fecal content to pollute the information streams. Every social media site is filled with images and text generated in exchange for not inconsiderable amounts of electricty - all to try and farm enough traffic to generate advertiser revenue. So pervasive is the rot that even my closest friends use these AI ‘tools’, and at least one family member goes to great lengths to try and discover ways of monetising them. In my immense privilege, I use a moderated, private mastodon server for most of my social media interaction. I see little AI content unless I go looking for it - besides a few petty fraudsters trying to pass off generated art as their own, using it to try and solicit donations from the unwary. But elsewhere: tumblr, instagram, search results - these places I see more AI content than human.

A small grace is that as the web becomes increasingly polluted with machine generated feculence, the source from which the AIs are trained becomes increasingly corrupt. In the 1990s, beef farmers in the UK recycled some of the less desirable cow parts back into the food supply for their herds. This cannibalism resulted in ‘mad cow disease’ infecting the food supply and potentially the people that ate it. Likewise with AI, as the corpus on which they feed fills with AI content, their ability to generate coherent content rapidly reduces. Image diffusers either generate garbage or pictures that congregate on specific styles despite user instructions. LLMs generate poorer quality text filled with ‘tells’, words increasingly common in AI generated content like ‘delve’ and ‘hitch’. ‘Pre-AI’ datasets, consisting of mostly human generated content free of generative taint, are valuable commodities.

Why am I talking about this here? Blogs are one of the continuing sources of longer-form human generated content. Clearly there are blogs written by AI, or perhaps co-written, but there are still a good number of people continuing to blog their thoughts, notes, gamedev environment configuration instructions and other important human generated text. My own humble blog is free of AI content, and while it may not be the most important writing humanity has produced, it’s infinitely more valuable than anything ever output from chatgpt.

A cursory glance at the website access logs for any public blog will show that automated scrapers account for much (if not most) of the traffic any given blog receives. ‘Bots’ have always crawled sites - you wouldn’t have much in the way of search engines without this. But the search engine ‘web spiders’ were primarily indexing the content of websites to see if it had anything relevant for web search results. The current generation of crawlers want only to copy every image and sentence on a website into a training set. Whether that is to sell to model trainers or to build their own AI, the result is the same. Your words are taken and put into a giant statistics table that someone hopes to sell access to, and to burn kilowatts of electricity to make it happen. These crawlers also seem to ignore requests to not index your site via the venerable ‘robots.txt’ standard that was somewhat respected by the web spiders of yore.

Over on mastodon, there’s a culture of providing comprehensive alt-text, written descriptions of any images or non text media uploaded to the platform. This helps improve accessibility by providing a non graphical alternative for blind users, or potentially a low-bandwidth option for users on connections too slow or unstable for transferring photos of your cat. Image diffuser training relies on exactly these types of descriptions. The more accurate, detailed and explicit the text description of an associated image, the more valuable it is for the purposes of training an AI to generate fake versions of those visual subjects.

To use the web in the last 20 years was to feed the advertising algorithms. Every link clicked, text entered, momentary hesitation when scrolling past a photo - every recordable interaction fed into a program designed to sell you to advertisers. Now every contribution you make online is potentially fed into a vast hellish data lasagne to be fed to coal burning demon AI. This is the new cost of interacting with online communities and you can’t avoid it. I won’t stop blogging because I’m afraid an AI will give inaccurate instructions on configuring a raspberry pi as a dreamcast development machine. I won’t make my mastodon photos inaccessible to a blind person just to spite the seller of a dataset.

Inconveniencing real people for a small chance at causing a rounding-error of difference to an AI isn’t worth the cost to me.