Why em dashes are common in LLM outputs
Research by Maria Sukhareva
In Let's talk about em dashes, Maria Sukhareva describes the two reasons we commonly see em dashes in LLM writing:
- The training data
"...they are common in polished fiction, legal writing, and older scientific articles. It is therefore plausible that GPT‑4 models simply saw more em-dash‑heavy book data during pretraining than GPT‑3.5 did." - Reinforcement learning from human feedback (aka RLHF):
"In the tokenizer used by GPT‑4 the sequence “  —” (leading space + em dash) is one token, whereas a comma plus “and” or a semicolon usually costs two or three tokens.
Fewer tokens means cheaper inference and lower training loss per token, therefore, higher reward during RLHF."
As Maria states, all LLMs use these.
It could be that this has to do with these models using the same corpora, but it could also be that they train on new data.
Does an LLM eat itself in some way? The internet is flooded with AI generated text, and all models will have to be retrained at some point to understand changes in language.