Topic Modelling: Discovery or Generation?
Published:
Topic Modelling: Discovery or Generation?
Introduction
Topic models such as Latent Dirichlet Allocation are built on a simple but powerful idea:
documents are generated from a mixture of latent topics
- Each document has a mixture of topics
- Each topic has a distribution over words
- Words in the document are generated from these topics
This generative assumption is for the purpose of modelling abstraction. But how realistic is this assumption?
Of course, real-world text is not literally generated by sampling topics and words. The generative story should be understood as a modelling abstraction, rather than a literal description of how humans produce language.
But this raises an interesting question:
How realistic is this assumption for different types of text?
Modern approaches such as BERTopic do not rely on a generative assumption. Instead, they treat topic discovery as a clustering problem in semantic embedding space.
Across many studies, generative models and clustering-based models have been compared on different datasets and domains. The findings are not always consistent: sometimes generative models perform better, sometimes clustering-based methods do.
One possible explanation is that the effectiveness of a method depends heavily on its underlying modelling assumptions, and whether those assumptions align with how a particular type of text is actually produced.
Generative assumptions and discourses
In structured or semi-structured discourse, such as academic papers, policy reports, or conference presentations, authors typically have a clear topic in mind and organise their content around themes.
For example, academic papers often begin with a title, which signals the intended scope of the paper. In such contexts, it is quite plausible that topics play a causal role in shaping the text. The generative assumption (documents arise from a mixture of topics) may therefore be a reasonable approximation.
However, in unprompted and unstructured discourse, such as consumer reviews, interviews, or most social media posts, text often emerges from a mix of experiences, emotions, and fragmented thoughts.
In these cases, language production may look more like:
experience → emotion → expression
rather than:
topic → words → document
Here, topics may not be the underlying causes of text generation. Instead, they may function as post-hoc abstractions that summarise recurring semantic patterns across documents.
This suggests a possible hypothesis:
Generative topic models may work particularly well for structured discourse, while clustering-based approaches may be better suited for highly unstructured or experiential text.
This hypothesis remains an open empirical question and likely depends on the dataset and analytical goals.
Reflections
Perhaps the key question is not whether the generative assumption is correct, but when it is useful.
Topic modelling may be better understood not as a single technique, but as a family of methods for discovering structure in text, each built on different assumptions about how language is produced.
