Customer insights teams have been let down by traditional text analytics tools. There was so much promise early on.

Let AI do the heavy lifting of understanding the voice of the customer at scale and the effect it was having on your business, giving CX teams more time to figure out how to improve the experience for customers in a data driven way. Unfortunately, the devil was in the details. What we all expected was the ability to quickly understand what customers were saying and the impact it had on metrics the business cared about.

Instead, what we got were rigid black box systems with long setup times and large amounts of ongoing manual maintenance required. Worse still, these solutions had no way to handle the unknown unknowns of customer feedback. They needed us to tell them what to look for, meaning they could only capture known threads of discourse. So how exactly did we get here?

What is the traditional approach to text analytics?

To understand the traditional approach to text analytics, we first need to delve into the human approach at solving the problem without technology. The most popular way of doing this is called a Thematic Analysis. It’s a fairly complicated process, but it can be summarised as follows:

  1. Initial Exploration: First step is to get a general feel for the data. Read over some pieces of feedback, let's say a few hundred randomly sampled pieces, paying particular attention to any repeating patterns in the language that you can see. At the conclusion of this step, you should have some idea about the high level themes or topics people are talking about in the data.
  2. Code Generation: This can also be referred to as label generation. In this step, we are formalising what we have found in step 1 by coming up with a set list of codes for the text. Unfortunately, this requires some more reading of data, perhaps another 500 pieces of randomly sampled data. By the end of this step, we should have a list of codes/labels that we have created. They should be descriptive enough so that the consumer can understand the “idea” behind the code and the meaning it is trying to capture.
  3. Review of Codes: While this step is optional, it is good practice to ensure high quality outputs. Take another random data sample, a few hundred data points, and begin applying the codes to the text. Pay particular attention to any text that doesn’t get any codes/labels assigned to it or codes that have potentially been missed. If required, update the list of codes/labels.
  4. Manually code/label the data: Now the fun really begins. Go back to the start of the data and read through it all, piece by piece, assigning the appropriate codes/labels as you go. If you want to maintain high standards, you should have more than one person do this, and you should put quality controls in place like inter-code reliability and intra-coder reliability.
  5. Rinse and repeat: As you can well imagine, human language isn’t static. For example, prior to 2020, very few people in the western world knew what a coronavirus was, and it certainly wasn’t appearing in customer feedback. For data sources that are ongoing (like surveys, support centres, social listening, etc.), which is most data sources in a CX use case, it’s not enough to set and forget the codes. You need to periodically review. Conservatively, you need to do so at least once a quarter. For high quality results, you want to be doing this at least monthly.

(This is a simplified version of a true Thematic Analysis. I’d encourage you to read the Wikipedia article if you want a more exhaustive definition.)

This post is for subscribers only

Sign up now to read the post and get access to the full library of posts for subscribers only.

Sign up now Already have an account? Sign in