You are here

  1. Home
  2. Catchwords, Intentions and Embeddings

Catchwords, Intentions and Embeddings

Abstract patterns on a tree stump

by Ayan-Yue Gupta

In the previous post (Frequency Matters) Suman raised an important question – what kind of method can be used to understand why catchwords spread? Tracking frequencies over time can be used to identify what terms are catchwords, but not why terms become catchwords.

A not very informative answer would be: a term becomes a catchword because lots of people decide to start using it in a short period of time. Uninformative as this is, it draws attention to the fact that understanding why terms become catchwords is a matter of understanding the intentions behind a decision to use a term. So, what kind of method can be used to extract the intentions underlying a decision to use a word?

We make inferences about the intentions underlying particular language uses all the time – in fact, expressions are only meaningful because they enable inferences to be made about intentions. Thus, there must be enough information in the speech act within which some term is used to infer the intentions underlying a decision to use the term. Recognizing that A uttering ‘Snake!’ to B while pointing to a venomous snake nearby is warning enough for us to figure out that the intention underlying the utterance is to draw attention to danger.

Let’s suppose that based on a frequency analysis we found out that ‘snake’ is a catchword because its rate of use doubled amongst a population within 10 years. If we were able to extract from the population every single speech act containing each use of ‘snake’ within those 10 years and all contextual information needed to correctly interpret those speech acts, we would have before us all the information needed to understand why ‘snake’ became a catchword. But now we have an information overload problem. How can one work through and make generalisations about the millions of speech acts of those 10 years in a reasonable amount of time?

Natural Language Processing (NLP) methods, in particular those based on embedding sequences of text, offer a solution. Embedding methods are methods that involve representing text sequences as vectors, where a vector is a list of numbers that encodes information about whatever is represented by a vector. Think of coordinates – a coordinate is a list of two numbers (e.g. (3, 2)), that encodes information about how to arrive at a location (move 3 steps to your right and 2 steps in front of you to reach the location represented by (3, 2)).

It turns out that one can use artificial neural nets to represent text sequences as vectors. These vectors are called ‘embeddings’.'These neural nets are designed in such a way that the embeddings they produce encode the linguistic features distinctive of the text sequences they represent, analogous to how (3, 2) encodes the direction and magnitude distinctive of the path to the location it represents. How neural nets do this is unimportant for this post’s purpose. All that matters is that an embedding is a vector representation of a text sequence that encodes the linguistic features distinctive of the sequence.

Returning to the issue of information overload: we have retrieved all speech acts in which ‘snake’ was used within the 10 years in which it became a catchword. We can then represent each instance of ‘snake’ as an embedding by inputting the text of all speech acts into some neural net. Since these embeddings encode linguistic information, the mathematical differences between each embedding (i.e., euclidean distance, cosine similarity) of ‘snake’ will capture the linguistic similarities and differences between each instance of ‘snake’.

So, one can cluster these embeddings into groups, where embeddings are sorted into the same cluster if, by euclidean distance or cosine similarity, they are more similar to each other than they are to other embeddings. For example, between the coordinates (3, 2), (3, 1) and (5, 4), the coordinate (3, 2) and the coordinate (3, 1) have the lowest euclidean distance from each other, so they would be sorted into a cluster rather than having a cluster of (3, 2) and (5, 4) or a cluster of (3, 1) and (5, 4). What we end up with after this process is a series of clusters of ‘snake’ speech acts where each cluster captures a distinctive way of using ‘snake’. One cluster might capture warning uses, another cluster might capture uses discussing a newly discovered snake. Instead of having to read through millions of ‘snake’ speech acts to make generalisations about the main intentions underlying the millions of decisions to use ‘snake’ in the 10 years, we need to only identify the distinctive uses captured by each cluster.

How does one extract speech acts? Unfortunately, since everything has to be written down to be fed into a neural net, the closest one can come to extracting a whole speech act is extracting the text sequences used within a speech act – paragraphs, sentences etc. One could perhaps create a standardised method of translating the unspoken gestures involved in speech acts into written form (as is done in conversation analysis), but this is too labour intensive to be done for very large quantities of speech acts. For many cases this isn’t an issue since the original form of many speech acts is written (think of the speech acts of novels, newspapers, tweets etc.), and usually one wouldn’t even bother with non-written speech acts since the whole process of recording and automatically transcribing speech at scale is too much work, but it’s a limitation to bear in mind anyway.

Can one really extract the written speech acts of a whole population? No. In practice one can only extract samples of the speech acts used by very specific subsections of a population, e.g., the user base of a social media network, the content of newspapers, government publications. Data availability is a key limitation on what kinds of intentions can be understood through methods based on embedding.

Identifying the main uses captured by clustered embeddings is itself a complicated process since each cluster will contain a large quantity of catchword uses. Some manual inspection of clusters will be necessary, but this can be aided by automatic methods of inspection. For example, one can automatically extract the most distinctive terms of a cluster using measures of distinctiveness (such as term frequency-inverse document frequency). These distinctive terms will then give a representative impression of the main uses captured by a cluster in a very compact, concise manner. So, the most distinctive terms of one cluster of ‘snake’ speech acts might be ‘new species’, ‘discover’, ‘venomous’, ‘novel’, and this will tell you that this cluster is concerned with the discovery of a new species of venomous snake.


Ayan Yue Gupta is a researcher in the area of computational sociology, graduate teacher at the University of Bristol, and an artist (MA Fine Art, Slade UCL). His recently completed Ph.D. is entitled BERT for Discourse Analysis: A Pragmatist Approach to Governmentality (University of Bristol, 2023)

Maksim Sokolov (Maxergon), CC BY-SA 4.0, via Wikimedia Commons