Topic Modelling: Discovery or Generation?

2 minute read

Published: March 11, 2026

Topic Modelling: Discovery or Generation?

Introduction

Topic models such as Latent Dirichlet Allocation are built on a simple but powerful idea:

documents are generated from a mixture of latent topics

Each document has a mixture of topics
Each topic has a distribution over words
Words in the document are generated from these topics

This generative assumption is for the purpose of modelling abstraction. But how realistic is this assumption?

Of course, real-world text is not literally generated by sampling topics and words. The generative story should be understood as a modelling abstraction, rather than a literal description of how humans produce language.

But this raises an interesting question:

How realistic is this assumption for different types of text?

Modern approaches such as BERTopic do not rely on a generative assumption. Instead, they treat topic discovery as a clustering problem in semantic embedding space.

Across many studies, generative models and clustering-based models have been compared on different datasets and domains. The findings are not always consistent: sometimes generative models perform better, sometimes clustering-based methods do.

One possible explanation is that the effectiveness of a method depends heavily on its underlying modelling assumptions, and whether those assumptions align with how a particular type of text is actually produced.

Generative assumptions and discourses

In structured or semi-structured discourse, such as academic papers, policy reports, or conference presentations, authors typically have a clear topic in mind and organise their content around themes.

For example, academic papers often begin with a title, which signals the intended scope of the paper. In such contexts, it is quite plausible that topics play a causal role in shaping the text. The generative assumption (documents arise from a mixture of topics) may therefore be a reasonable approximation.

However, in unprompted and unstructured discourse, such as consumer reviews, interviews, or most social media posts, text often emerges from a mix of experiences, emotions, and fragmented thoughts.

In these cases, language production may look more like:

experience → emotion → expression

rather than:

topic → words → document

Here, topics may not be the underlying causes of text generation. Instead, they may function as post-hoc abstractions that summarise recurring semantic patterns across documents.

This suggests a possible hypothesis:

Generative topic models may work particularly well for structured discourse, while clustering-based approaches may be better suited for highly unstructured or experiential text.

This hypothesis remains an open empirical question and likely depends on the dataset and analytical goals.

Reflections

Perhaps the key question is not whether the generative assumption is correct, but when it is useful.

Topic modelling may be better understood not as a single technique, but as a family of methods for discovering structure in text, each built on different assumptions about how language is produced.

Twitter Facebook LinkedIn

Topic Modelling: Discovery or Generation?

Introduction

Generative assumptions and discourses

Reflections

SenticNet Is Not Aspect-Based Sentiment Analysis