In this article I will attempt to simplify a number of complicated concepts related to Artificial Intelligence (AI) language models, with a specific focus on GPT-2.
Those of you who’ve been following recent advances in the field of natural language processing might be thinking “but we have GPT-3 now”. This is true, and while I admit I quite like large models, GPT-3 is just too big for regular people to work with.
If you’ve been tracking the evolution of AI, you’ll no doubt have heard the controversy, hype, even danger, surrounding OpenAI’s GPT-2 model. However, if you have been hiding under a rock, here’s a quick overview:
According to technology website The Register, OpenAI’s massive text-generating language model, which was whispered to be too dangerous to release, has finally been published in full after the research lab concluded it has “seen no strong evidence of misuse so far.”
I love that last bit… it has “seen no strong evidence of misuse so far”.
You see the issue here? To quote Carl Sagan, “absence of evidence is not evidence of absence”.
It’s more likely a testimony to how well this model performs, as well as a reliable indicator that GPT-2 is likely being used more than a IT geek’s keyboard in many of the super competitive search markets including, but not limited to, online gambling, pharma and adult entertainment, not to mention GPT-2’s notable adoption in computational propaganda. (Note to Buzzfeed’s data scientist: @minimaxir has a great Github repository for anyone who wants to play along at home.)
While GPT-2 models are large, they are still manageable and provide a practical method to produce programatically generated casino reviews. However, some of the larger GPT-2 models proved impractical given my available computer resources.
Stay with me
Before your eyes glaze over, I’m not even going to attempt to explain how GPT-2 works, only that it does work – very well. If you’re considering using GPT-2 to write your casino reviews, here’s what I learned along the way.
My goal was to automatically produce coherent text capable of ranking in Google without being identified as duplicate for 883 casinos.
There were three distinct steps in achieving this goal: First, collecting training data (scraping). Second, training/tuning a language model. Third, producing the text (decoding). There’s also a fourth step which I’ll be covering in more detail in the next issue of iGB Affiliate.
Before diving into this let’s briefly familiarise ourselves with some jargon.
● NATURAL LANGUAGE PROCESSING (NLP) TASKS: These are tasks that have something to do with human languages, for example language translation, text classification (e.g. sentiment extraction), reading comprehension and named-entity recognition (e.g. recognising a person, location, company names in text)
● LANGUAGE MODELS: These are models that can predict the most likely next words (and their probabilities) given a set of words – think Google auto-complete. It turns out that these types of models are useful for a host of other tasks although they may be trained on mundane next-word prediction.
● TRANSFORMER MODELS: From the deep learning family of NLP models, which forms the basic building block of most of the state-of-the-art NLP architectures. These are replacing recurrent neural networks (RNN) and long short-term memory (LSTM) models due to their performance and speed of training.
● TOKENISATION: This is a common task in NLP. Tokens are the unit items or pieces which make up natural language. Tokenisation is a way breaking down a sentence, paragraph or document into smaller units called tokens. Tokens can be either words, characters or subwords.
After starting out by experimenting with recurrent neural networks to solve the problem I immediately ran into trouble. The problem was in the tokenisation methods.
The RNN models I found for the project came in two flavours, word-level and character-level.
Word-level models predict the next word in a sequence of words. Character-level predicts the next character in a sequence of characters. Each of these approaches comes with some important trade-offs which led me to a dead end.
Keep in mind that computers have no notion of the meaning of a word; the word is represented by numbers known as a word vector or word embedding.
The word-level approach selects the next word from a dictionary, an approach that typically generates more coherent text but at the cost of frequently stumbling into ‘out-of-vocabulary’ words which appear in the generated text as tokens (abbreviation of “unknown”)
Other word-level showstoppers included grammar, especially capitalisation since the model has no concept of capitalising the first word in a sentence or proper nouns.
Character-level solves many of the word-level problems such as out-of-vocabulary words and correct use of capitalisation simply by treating each character as a unique token with the vocabulary comprising all possible alpha-numeric characters.
The downside of character-level models is that the generated text is much less coherent and can often get stuck in repetitious loops.
Among other innovations, GPT-2 uses a clever innovation to solve the out-of-vocabulary and capitalisation problems which make word-level models unusable. It does this by adopting a middle-ground approach called byte pair encoding (BPE).
This approach builds the dictionary from all possible two-character combinations. These two-character tokens are “predicted” by the decoder based on the preceding sequence of tokens.
What is a language model?
Now we know what a token is, we have a better understanding of the notion that a language model predicts the next token in a sequence of tokens and iterates over itself to produce fully formed sentences and even paragraphs.
Okay, this is an oversimplification, but you get the idea. The GPT family of models takes an input, word, sentence or partial sentence and a number to indicate how many tokens to return.
Transformer models are large but keep in mind “the law of accelerating returns”. Here, American futurist Ray Kurzweil famously notes that the rate of change in a wide variety of evolutionary systems, including, but not limited to, the growth of technologies, tends to increase exponentially.
GPT-3 models are hundreds of times larger than GPT-2 models, and while they currently don’t fit on a single computer, they’re decoded on clusters. The largest available GPT-3 is mostly indistinguishable from human written text.
A recent blind study of GPT-3 showed 52% of example texts were accurately guessed to be AI-generated. Marginally higher than a coin flip.
I predict we’re only three years away from regular business users being able to generate content using AI which is entirely indistinguishable from human-generated content.
How language models will change your life as an SEO
As we’ve seen, a language model is probabilistic, with the next token in a sequence of tokens selected based on probability.
The model is also capable of generating fully formed HTML or Markdown. What’s more, by training/tuning your model using scraped content from the dominant casino affiliate in the space, it’s possible to use some simple pre-processing to learn casino reviews including the internal and external link structures.
Yes, you read that right… no more guessing what the optimal cross-linking strategy looks like, simply train the GPT-2 model to learn where to put the links.
Practical tips for outputting articles
The decoder algorithm is what computer scientists refer to as Quadratic Complexity (Order n^2), which means by doubling the length, we quadruple the time/processing. By quadrupling the length it takes 16 times as long to output.
In other words, don’t produce a single multi-paragraph article. Do produce multiple paragraphs and link them into a single article. This was something I started to notice when I first began testing the next larger model.
Producing reviews took forever and the text produced would often be truncated, with the article finishing mid-sentence. It’s also important to know that the time it takes to produce a full casino review, even on a 32-core Xeon server, was not practical for my purposes.
I will be covering the fourth practical step in using GPT-2 to write casino reviews – data processing – in the next issue of iGB Affiliate.
Paul Reilly is a technology enthusiast, speaker and AI engineer. Following an SEO career which spanned two decades, Paul turned his attention to the practical uses of artificial intelligence, leading him to regularly drop in on the University AI research team while exploring new ways to make a splash as a casino affiliate. Paul is the founder of flashbitch.com a largely AI-generated casino reviews website.