On chatbots…

In another lifetime I used to write a lot about education technology, the field I used to work in (for certain values of the word “work”). If you have this RSS feed in your reader (hell if you know what RSS is and have an RSS reader) you might remember me doing a lot of stuff like this that falls in the gap between journalism and thinking about stuff. This is me thinking about stuff.

You’ve probably got one in your pocket.

Open your phone, type these words into a text message and press the middle option. Predictive text just guesses the next word rather than the next phrase, but the principles are similar to the way that the current wave of “generative AI” works.

There’s a thread that goes back to the dawn of computing – Colossus (at Bletchley Park) was pretty great at guessing the next value in a series of encrypted messages despite having less processing power than your oven (albeit the two probably have similar heating power) .

What’s next?

Pattern matching is something that computers do really well. Given a large amount of data, it is trivial to come up with a statistically likely response to any given stimulus. It’s also not that hard to frame this response as a part of a conversation – Joseph Weizenbaum’s Eliza used a simple set of rules to impersonate a psychotherapist back in the 1960s. And if you are impressed with the ability of a computer to parse natural language, may I direct you to Ask Jeeves?

Again, the more data and the more processing power you can throw at stuff like this, the better the response is going to get. Having, as a planetary culture, spent the last 25 years creating a machine-readable repository of text and other creative outputs – and having followed Moore’s law to the point of physical constraint – the conditions for tools that can use data to return a statistically likely human-like response are better than they have ever been.

As with much in life, you have to keep thinking about the data rather than the output. Emily Bender’s famous characterisation of the most recent wave of tools as “stochastic parrots”:

haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning

reminds the reader that any semblance of insight and intelligence you see in a ChatGPT response comes from you. A chatbot doesn’t care about being right – it is designed to give you an output that looks plausible.

Norwegian blue

Our predictive text example is of a system that is designed to come up with a plausible next word. If you kept pressing the middle button your sentence would have drifted away into a nonsense – the system is not designed to come up with a plausible next phrase or sentence. Chatbots are designed to come up with a plausible response overall – for example, when you ask for “250 words” it will give you around 250 words because most successful responses to prompts that include the phrase “250 words” do that. If you ask it for a working computer programme it will do that, because that is the most likely response to a request for a working computer programme.

What if you asked it for a first class university assignment? Or gave it the rubric and asked for something that meets every response?

It’s here that we have to take pause and think about data again. What dataset could it plausibly have to help it do this? I don’t know for sure, but my guess is there are a lot of undergraduate essays and undergraduate style essays online (and even more in certain commercial databases that the sector has been quietly feeding for more than a decade), and some of them may have grades attached. So we can assume it will give you an assignment that is statistically similar to ones deemed to be “first class”.

Exam board

“Plausible” and “similar” are impressive, but they are a long way from “good”. We don’t really have enough data to know whether systems like this are able to reliably write good essays – we know that they have written some good ones and some terrible ones, and we know that academics (and commentators) have made their own assessments of these.

But the point is that it is fairly easy for a computer to test if it has written 250 words – it can count them (computers are great at counting). It is also fairly easy for it to test whether it has written a working computer programme (computers excel at running computer programmes).

Despite arguments to the contrary it is not easy for a computer to grade an essay in anything more than a very functional rubric-informed way. If your rubric says a first class answer must quote critically from sources your chat bot will do that – but as we have found it doesn’t really care if those quotations or sources are accurate or meaningful. It can judge if a response is under the world limit, or if it deals with the issues it is supposed to deal with. It can’t judge if the argument is well made or if the writing is convincing and original because it doesn’t know what those things are.

We could of course give it rules covering how to do that. These are questions that linguists and philosophers have grappled with for centuries – but should they ever finish this work (spoiler: they won’t) that would be an option.

Group test

So let’s look at a competing technology – academics. Academics may not be great at counting, but they have computers to do that for them. They’re not always fantastic at storing and retrieving information, but they do have access to libraries and search engines. But they are really, really good at this intangible and difficult to describe work that computers struggle with. Academics are spectacularly good, for example, at identifying bullshit.

As the current generation of large language model chatbots are designed to produce bullshit (and do an excellent job, kind of like a digital Boris Johnson). If you’ve read James Ball and Andrew Greenway’s Bluffocracy you are probably already thinking about Oxford’s PPE course – but, contrary to that fairly cynical reading, you need a lot more than just plausible nonsense to get through a PPE tutorial.

For some this is a dark pattern in academic life – the Potter Stewart formulation (“I know it when I see it”) for is for some a castigation of everything that is arbitrary about the judgement of the professoriate. You can’t take an “academic judgement” case to the Office of the Independent Adjudicator precisely because there is no way of testing it. There’s no real way of distinguishing an essay with a mark of 57 per cent and one of 56 per cent – this is why we use multiple markers and moderation boards.

If you think, in other words, that ChatGPT poses a clear and present danger for the validity and sanctity of the higher education process there is both good and bad news for you. The way that higher education works means that the risk is not as great as you think, but the sausage machine is a very messy (but effective) way of making sausages.

A matter of semantics

I want to close my argument by thinking about the other major strand of artificial intelligence – an associative model that starts (in the modern era) with Vannevar Bush and ends with, well, Google search. The idea of a self-generating set of semantic links – enabling a machine to understand how concepts interrelate – is probably closer to the popular idea of artificial intelligence than toys like ChatGPT.

Research in this field continues, but at a much lower ebb than the historic boom of the 1960s and 1970s. In UK higher education the 1973 Lighthill Report led to the dismantling of work like this (and the linked, but even more far-fetched, approach of directly modelling the way human brains work) on the grounds of a vanishingly low likelihood of success. It led (along with similar decisions in the US and elsewhere) to what has been described as an AI Winter – where much of the field, and the fanciful projections that drove interest in it, were abandoned.

Clearly, the work continued (hence, frankly, Google – and who remembers the 00s moral panic about higher education in a world of instant information retrieval?) but it was focused on solving problems rather than the general goal of a recognisable artificial intelligence. For me (and I’m a serial technology pessimist) the longer term benefits of chatbots and related technologies will be focused on optimal discrete use cases in a similar way.