Coherence in Machine-Assisted Qualitative Research
This is the first in a series of posts exploring the notion of coherence when using topic modeling methods like Latent Dirichlet Allocation (LDA) for qualitative analysis of texts. I'm trying to think about how mechanically measured coherence matters when you are trying to interpret something qualitatively from a large corpus of texts (with the assistance of machine-learning tools).
Having estimated many, many LDA models of similar corpora of texts over the last few months for one reason or the other, I have started to observe some methodologically salient patterns in coherence. So this post will amount to me trying to reflect on the tacit knowledge gained from "doing" topic modeling repeatedly. I hope there is value in this for others attempting to use methods like LDA to support qualitative analysis of texts. To illustrate my reflections, I will refer to a "vignette," comprised of R code, results, and visualizations of LDA analysis, written for pedagogical purposes, that I'm sort of working on currently, and will "really" work on this summer: https://rpubs.com/anandb/notes
Some background: LDA operates under the assumption that any corpus of documents can be described as a mixture of some number of topics or themes. So each document can be described using a vector of topic probabilities or Pr (Topic | Document). And that is what the LDA algorithm estimates. However, the length of that vector, representing some “correct” .
There are many ways at coming up with some “best” number of topics. I typically choose to optimize for semantic coherence, a measure of how distinct the meaning of one topic or theme is from the other topics present in the field. As a measure, this makes sense. You want to be able to interpret what each topic is "about" clearly, and qualitatively be able to associate the topic with some social grouping within the texts, be it author, or author community or cohort, or "schools of thought" or occupational group in a cross-occupation co-authored texts, etc. So coherence makes sense as something to optimize for in topics. Now of course there's the notion of semantic coherence (a coherent thought or a coherent idea, linguistically), and there's the measure "semantic coherence". I'm not the expert in how this measure of coherence behaves. The developer of the textmineR package has spent a lot of time thinking about coherence, and is very thoughtful in the implementation, so I would go there for detail. I do, however, provide a working definition below. Suffice to say, I buy that this measure of semantic coherence measures how coherent a topic is in the text.
In general, we are looking for the number of topics that maximizes the coherence of topics, so we need to search through some range of parameter space for k, and find the k that yields the "best" coherence. Most people use average semantic coherence, so if you have the topics, the average coherence of those ten.. I believe, depending on context, something like maximum coherence may be a better optimization target. But that's the subject of an entirely different post about coherence.
So where do we start? One way to do it, when you’re new to a type of corpus, is to start low, around 5 topics, and cast a wide net. Over time, you get a feel for the range of topics in a particular type of corpus, and then you can start with a narrower range. This is important, because the search process can be very time consuming for large corpora. You’re basically estimating an LDA topic model of your text with 5 topics, then 6 topics, then 7, and so on, and for large corpora (size and number of texts), each estimation may take a lot of time.
But there's more to it than just the "highest coherence," I have found. A useful analogy is the use of adjustable zoom and focus in photography. You can choose to focus on a particular plane by adjusting the configuration of lenses. Similarly, you see coherence spikes when the number of topics k is such that there’s roughly one topic per various socially determined groups of documents, one topic per document, one topic per section of document, per subsection, per sentence, per phrase, etc. The more you know and have read your texts, the easier it is for you to guesstimate things like number of subsections in your text or number of social groups represented in the texts. So this is tacit meta-textual knowledge you need to access, and you'll only be able to do it if you've spent a lot of time wallowing in said texts. Have I mentioned how useful it is to have actually read what you are modeling?
In the example below, I have a 22 document corpus of reading notes from a PhD seminar, where we students have split up readings among ourselves. The note-takers are 4 organizational behavior and 3 strategy/organizational theory PhD students. The notes are on average 1.5 pages of 12 pt single spaced text. They are often grouped into subsections like 'overview,' 'findings,' or 'methods.' There is variability in form of text: prose, bullets, tables, etc. across authors and time. The assigned readings for which these notes are being prepared are methods papers, book chapters, and empirical papers where a particular method is demonstrated.
I estimated one LDA model each for a formatting-free, stopword-free version of the corpus with from 5 to 50 topics, and plotted the average coherence of topics vs the number of topics in the model. I have allowed for both words and bigram (two word sequences) to be treated as terms in the model.
I have a relatively small number of documents, o we’re looking for a document-level spike, or a spike in coherence around number of topics = number of notes. Looks like we have one at 18 topics. So let's look at this corpus modeled as 18 latent topics.
Here is one of the most important technical artifacts that we can produce for machine-assisted qualitative sensemaking: a machine-generated thematic summary of a corpus.
In this summary, phi is Pr (Term | Topic ) or central terms, and gamma is Pr (Topic | Term) or specific terms. Prevalence is Pr (Topic | Corpus)*100 and the column sums to 100. This is the topic distribution of this corpus. Coherence here is a function of the probability that any of the top phi terms are not also top phi terms for any other topic. As coherence increases, most topic-specific terms increasingly overlap with most topic-central terms. Topics are machine-labeled with the bigram that has the highest phi. I like to use labels for topics because it easier for me to think about and refer to a topic using a name rather than a number. Here, think of it as the algorithm open coding the text for you. I recommend re-coding/re-labeling with something that makes more sense, when needed.
Another important sense-making artifact when analyzing multi-author corpora is the author-topic prevalence heat map (for longitudinal texts, a time-topic heat map is also useful). In these heat maps, rows/topics and columns/authors are clustered on similarity, so that's a useful way of assessing for structure. Now recall that we looked for a spike in coherence at roughly the same number of topics as there are documents in the corpus. At 18 topics, you can see in the heat map below (darker green=more prevalent) that several topics are highly prevalent in the notes of one and only one note taker. This is what a particular note by a particular note-taker was "about". Look at t_7 perceptions_quality, for example. It is highly prevalent in my (AB's) notes. That because one of the notes I wrote for the class was summarizing the reading: Azoulay, P., T. Stuart, and Y. Wang (2014). Matthew: Effect or fable? Management Science 60(1). This is very much a paper about perceptions of quality. Similarly, t_3 gram_panchayats is only highly present in AMB's notes. That was a note summarizing Chattopadhyay, R. and E. Duflo (2004). Women as policy makers: Evidence from a randomized policy experiment in India. Econometrica 72(5). Gram panchayats are grassroots administrative units through which women participate in policy in rural India. So I'd say those darkest rectangles in the heat map each correspond to what each note or the paper it was referencing was "about". Many of these are also highly coherent topics, which tends to happen to rare topics with exclusive vocabularies- they're quite coherent.
Popular topics or concepts that are interpreted in many ways by many authors end to be less coherent. The heat map below also has a few of topics that are prevalent across multiple authors. Let's take t_13 provided_enhanced for example. First, this is an example of a not-great machine label that should probably be relabeled. Saliently, this is an example of "corpus-level concern" topics, which typically have among the highest average prevalence but low coherence. These corpus-level concerns are what the class is about. Look at the high prevalence topics: t_1 treatment_assignment, t_2 selection_bias. Clearly this is a methods class.
But this seems like a bit of a blunt instrument. One topic per document is probably fine if you're analyzing tweets, but larger texts, especially ones like these that have expertise and formalism baked into the authors, there's often the chance that you can extract several coherent topics out of documents. Expertise is reflected in more coherent separation among topics in texts? Seems like a bold claim. In any case you can do more with coherence. Remember what I said about focal length? Would it then be possible to get crisp focus on this corpus, but more zoomed in? I know that as note takers we try to write about the paper and also try to relate it to the class through lenses that reflect our own interests and history. Is there a coherence-optimized number of topics that will allow me to start unpacking some of those semantic reflections of social process (in this case, the pedagogical experience)? Let me try the global maximum from the coherence plot above - at 44 topics. What does this prevalence heat map look like? See below:
The big change is now you have little clusters of similarly labeled topics with high prevalence. There's an entire cluster of class-level concerns (t_19 directed_graphs and above in the second heat map). You'll also notice that MK and JS are highly prevalent close to this class-level concern cluster. This is cool, because I happen to know that MK and JS have frequently summarized book chapters these past few weeks, and notes about book chapters look a lot more diffuse than a focused summary of an empirical paper. The chapters also happen to be thematically central to our course.
You'll also see clusters of similarly labeled topics. Let take t_2, t_6 surrogate_outcome for example. In the 18-topic model, this was a single topic, corresponding to my note about a specific reading: Ch. 3 “Surrogate outcomes" in Prasad, V. K. and A. S. Cifu (2015). Ending medical reversal: improving outcomes, saving lives. Here, in the 44-topic model it has decomposed into two topics:
Looks like this is a nice split into broader methodological themes around proxy variables and the empirical specifics of this reading around medical reversals. That's cool. It's not always this neat.
You'll also note that note-takers cluster differently in the two models. For example, in the 18-topic model, I cluster closest to JN. In the 44-topic model, we're nowhere near each other. I have a theory for why. JN and I happen to share methodological interests, which is reflected in what we select to summarize in this methods class. Our research interests, training, and history, however, are widely divergent, which is reflected in how we write and summarize. I need more data to speculate any further.
You also have to be careful not to go too high with coherence. A metric like coherence can only go so far. An extremely high number of topics may optimize coherence “the metric”, but may not be very coherent to you, the human observer. For pure quantitative use of texts, such as in regression or network frameworks, that may be OK. Here, however, we are prioritizing human interpretability in how we use LDA. From experience, anything much above 50 topics becomes hard to process at the level of the corpus. So optimizing for coherence at one topic per document in a corpus of 20,000 documents is probably not a good idea if you're trying to understand more generally what the 20,000 document corpus was "about."
I hope these examples are illustrative of the nuances of coherence in machine-assisted qualitative research. I'm deriving a lot of joy out of these explorations. I hope this post is useful to you, hypothetical reader. I'd love to hear about your own experiences with using topic modeling as a sense-making tool (as opposed to something that goes in the paper as a "result"). More to follow, I'm sure...