A friend recently asked me a really interesting question about the statistical significance of blind tasting. He asked: “How many times should I successfully identify an intruder coffee by blind tasting among a set of three cups, in order for my experiment to be statistically significant ?“
As you might imagine, I really like this scientific-minded approach to blind tasting. Unfortunately, the answer to this question is not that simple, and we must plunge into combinatory statistics if we want to answer it. I won’t do this here, but I will provide you with a way to get an answer without caring about combinatory statistics. I’m sure most of do not care about the long, detailed equations.
Even if you don’t care about maths, I would like you to read a few paragraphs below that I think are super important to understand, so please bear with me for a bit longer. I promise you won’t encounter any more equations.
A common theme to all problems of mathematics and physics is that a question must be posed very precisely before we can answer it. This is often the hardest part of a problem: formulating it precisely and correctly. The way we posed the question earlier is not precise enough to start doing maths with it, because we need to specify what we mean exactly by “statistically significant”. To do this, we also need some reference point. We want our experiment to be better than something, but better than what ?
One neat way of setting up the problem is by adopting the frame of mind of classifiers. A person trying to identify the intruder coffee among three cups can be called a classifier: it can be a very efficient classifier, succeeding every blind taste, or a very bad classifier, randomly selecting a cup because it is unable to taste anything different in the three cups. There’s also a third possibility, which is more rarely interesting: someone could be a misguided classifier, by always identifying one of the two wrong cups as the intruder. If you think about it, this is even worse than a random classifier, because the random classifier will be right at least a fraction of the time.
Now that we talked about classifiers, it becomes easier to ask the question more precisely. As a first step toward this, we can ask instead something like “Am I better than a random classifier at this ?“. This is a step in the right direction, but it is slightly incomplete. Let’s take a simple example: You did the blind tasting test three times, and successfully identified the intruder coffee twice. The third time, you chose the wrong cup, and therefore you failed. Is the random classifier better than you ? Well, it will be sometimes. If you ask a random classifier to repeat this experiment of three tastings a dozen times, it may beat you a few times by identifying the correct cup at least twice, and then it may do worse the rest of the time.
An even better way to pose the question is thus: “What fraction of the time will I beat a random classifier ?” This is now a question posed precisely enough that statistics can answer. Obviously, you will want this fraction to be high ! For example, if statistics tell you that you are better than a random classifier 99.9% of the time, you should be happy about it. If you are better than it only 50% of the time, this is not great news. You might now realize that there is a subjective aspect to the way we interpret this score. There is no universal laws of nature that tell you: “You must be better than 99.9% of random classifiers in order to be a good taster“. What does “good” mean ?
This is a problem we must embrace, because we are stuck with it. Physics, Chemistry and all other fields of science are also stuck with it. How confident do you need to be before you think something is probably true ? This is a fundamental question, and different fields of science adopted different goals of confidence. As an example, the field of astrophysics decided that a confidence of 99.7% is cool. The field of particle physics decided to be more conservative, and decided they want to be at least 99.99994% confident before they change their minds. There is probably some sociology playing a role in this decision, but it is also certainly in part related to how precisely we are able to measure stuff. Particle physicists have big labs and can design experiments in them – astrophysicists are stuck lightyears away from their experiment, and all they can do is watch.
Talking in terms of probabilities like 99.7% or 99.99994% is a bit impractical, unless you really enjoy counting decimals. Fortunately, there is another way to describe this, with a very simple number that you can view as a score. In technical terms, this is called an “N-sigma significance“, but you can now safely forget I ever said that. Just think of it as a score, and you want it to be as high as possible. Let’s visualize a few different scores in a table, and translate them to % of confidence:
Here’s what I suggest: let’s try to reach a score of at least 2 when we do blind cupping experiments. This means we will draw wrong conclusions only 4.6% of the time, and it will not take a crazy amount of tasting ability or repetition to reach this. Obtaining a Q-grader license requires correctly identifying an intruder cup amongst three in at least five out of six trials, and this corresponds to a score of 2.4. I won’t suggest that everyone should aim at Q-grader level scores all the time. 🙂
Now, let’s talk about designing a blind cupping experiment. Choose a number of identical cups, maybe you would like to do three cups like my friend. Now fill all but one cups with the same coffee, and fill the last cup with a coffee that is different in some way. Maybe you want to see if you are able to recognize this different origin, something new you tried with roasting, or a different type of brew water. The more cups you use, the harder the challenge will be, and you will thus get to higher scores faster when you succeed. Mark the bottom of each cup with what they are, ask someone else to swap them around, and then try to identify the intruder cup without looking at the tags. Once you think you found it, look at the tag underneath, and mark on a sheet whether you succeeded and failed. Do this a dozen times and log or results; I’ll help you decide what your score is.
One thing you absolutely cannot do when you design this experiment, is to decide after 5 tastings, you are failing too many times, let’s just start from scratch. This is how the fields of psychology and biology got themselves into a crisis where a large number of their experiments were false. Erasing your failed tastings is cheating. If your score is never getting high after you are exhausted, here’s what happened: the experiment informed you that you were unable to distinguish the intruder cup from the other ones. You can start again if you change something about the variable you are testing. For example, if you failed to differentiate two types of water, you can restart the experiment with an entirely different type of water in your intruder cup. This would not be cheating, because it is now a different experiment.
For those of you who do not want to think further about maths or science, I built a Wolfram Alpha widget for you. Just enter how many cups you are using in each tasting (“Number of Cups“), how many times you tasted (“Number of Trials“), and how many times you failed to identify the intruder cup (“Number of Failures“). Then press “Submit“. You will get then see some ugly equation stuff that I was unable to remove from Wolfram’s output, but just focus on the number at the end. This is your score.
For the default values (3 trials, 3 cups, 1 failure), your score would be 1.1. This is a really bad score – you will certainly need to do more than 3 tastings if you want to be confident about your experiment. If you reach 8 tastings with only one failure (with 3 cups each time), then you will reach a score of 2. If you however fail twice, you will need even more successful tastings to reach a score of 2.
If you never heard about Wolfram Alpha, it’s a wonderful website. It’s like a robot version of Wikipedia that can do maths. You can ask it really silly stuff, like “What is the average life span of donkeys ?” and it knows the answer surprisingly often. The kind of questions you probably ask yourself every day.
I’d like to thank Victor Malherbe of the Montreal Coffee Academy for asking me this question on statistics, and for the information on Q-grade requirements.