User Testing 101

There are dozens of different types of testing and hundreds of different methodologies out there. I’m going to focusing on the big three that we interact with in UX on a regular basis:

User Interviews
Usability Testing
A/B Testing

User Interviews

User interviews vary greatly depending on whether they are moderated or unmoderated, who wrote the script, and who is conducting the interview. This is a very soft science and a lot of care needs to go into how they are constructed. In general, you will have better results with moderated interviews because you can observe the participants’ body language, correct trains of thought before they derail completely, and ask probing questions when the participants hit on something interesting. So why would you do an unmoderated study, if you’re going to lose all of that insight and control? Unmoderated studies are easier to recruit for and they’re fast; for a lot of companies, it’s that simple. If you’re going to go the unmoderated route, UserTesting.com has a decent platform; but if you don’t have a lot of experience with this type of work, you may find it difficult to get usable feedback. Regardless of the path you take, here are a few things to consider:

Before you write your test, write out what your objectives are. If you don’t have clear objectives, then you won’t be able to focus your participants and you will be less likely to get good responses.
Identify your assumptions and biases. With any qualitative study, it’s easy to let your own opinions leak through. Only by recognizing your biases and understanding where they come from will you be able to actively filter them out of your test so you don’t pollute the results.
Rehearse your user interview with someone–preferably someone who has the same amount of context a user would have–as a dry run before you conduct the real interview. This will undoubtedly reveal a typo, unclear question, leading statement, or confusing flow in your script.
Sorts, rankings, and scales are helpful when collating the results later. These types of questions will allow you to quantify some of the responses, rather than just having pages of open-ended feedback.
What participants say they would do in the fictional scenario you are giving them and how they would actually behave in the wild will often be very different. For example, when calling attention to a hero or banner space on a website, participants may say they find them useful or would click on them, but production data shows a much lower engagement with those spaces. So filter any feedback through what you know is actually happening in your product.
Asking a participant if they want a feature or if they want to see more content or if they want more options will almost always result in a “yes“. These types of questions carry with them a positivity bias that doesn’t always translate to actual behavior.
In general, I recommend you pretty heavily discount the positive feedback you get in a qualitative study. Using a respondent’s affirmation as evidence of success is misleading to your stakeholders since that person may not be indicative of the population who will ultimately be using the product.
Rather, pay attention to the negative feedback or to the participants who get confused. You may think they are an edge case, but if they are having a problem now, you will most certainly encounter many more users with that problem once you scale into production.
Don’t overdo it. Interviewing 20 people will rarely uncover anything the first 6 participants didn’t already touch on and it still won’t make for a statistically significant sample. Keep your study small and take everything with a grain of salt or filter it through a separate quantitative study.
Fidelity matters. Participants tend to get hung up on visuals or copy (if you’re not using production-ready content).
In a moderated study, there can be a temptation to explain things or defend the design. This is not a product demo. You should be issuing tasks or asking questions, that’s it.
Always record your interviews so that you can go back to them later to pull out critical pieces.

Usability Testing

The purpose of usability testing is to see how real people actually behave when using your product. The premise being that you give users a task and observe them as they carry out that task from beginning to end. You may be wondering what’s different about this from what we would do in a user interview? The task we are giving someone in a usability test is much bigger and broader: “schedule an appointment”, “find a widget you like and checkout”, “create an account and get started with X”. Once we’ve established the single, overarching task for the participant, we then go into observation mode while they attempt to complete it. User interviews tend to be comprised of many smaller tasks with lots of pre-determined questions sprinkled in between. That makes usability testing much more open-ended by nature and so requires a moderator to be successful in most cases. Here are some things to consider when conducting a usability test:

Create the most realistic scenario possible. For example, an eCommerce website it may mean needing to give people money to shop for things that they want to buy or are already in the process of researching.
Screening is hyper critical. You need to be observing people who are your actual users or who accurately represent your user in the state of mind that your users would be in for that task in the wild.
Always record the participants screen and the audio of the conversation. It’s also incredibly helpful if you have a second person to take notes for the moderator so that they can focus on the participant.
It’s important that you make your participants feel comfortable because you want them to talk transparently with you about everything that goes through their mind as they are using your product.
Do not tell the participant where to go. If you ask them to find X and it takes them 4 minutes to get there using your main navigation, then 4 minutes is how long it takes. Once they have successfully arrived at X, then you can ask probing questions about why they took that particular route and what they were thinking as they did it.
If your participant gets truly stuck, you should first ask them to talk out loud about what’s going through their mind. If verbalizing their thought process doesn’t help, then you can slowly start to ask over so slightly leading questions to get them them headed in the right direction. Leading a participant in this way is permitted for two reasons: (1) in service of preventing decision fatigue too early in the test or (2) because of a confusing command or poorly designed prototype. Otherwise, we want to avoid leading the participant in any way.
Try to limit questions to only the immediate actions the participant is taking in that moment; don’t pull them out of the headspace they are in as part of their natural behavior in your product. Once they have completed the overall task, then you can go back and deep dive on individual features or aspects on which you want more information.

A/B Testing

This is, most simply, a test in which you would be comparing two versions (A and B) of a feature or product against each other to see which performs the best. However, A/B testing is widely misused and the nuances of statistical analysis are not well taught within organizations. This means bad tests with misleading results in many cases.

The largest problem being the isolation of variables: differences between the test versions, different participants, bias in the design, when the testing occurs, etc. Ideally, both versions would be tested in production with real users at a high enough level to attain statistical significance.

Statistical Significance
“Statistical Significance” is the mathematical term for having either a sufficiently large sample size or a sufficiently great distance between the results that you can confidently rule out the variance being attributed to differences in your sample sets. In other words, you are confident that your samples are acceptably indicative of your actual population, so you can trust that the results for the entire population would mirror the results from your test. However, having “significance” does not mean that your data is important, positive, or even useful for decision-making; it merely describes your confidence in the accuracy of your data (“confidence” is also mathematically defined and if you want to get into that, check out the Khan Academy video).

If you don’t have statistical significance, then any differences you are seeing in the test versions should be attributed to the people you had in your test groups and not to material differences in the test versions. We call this being “flat”. For more information on statistical significance, check out this article in the Harvard Business Review.

Sample Sets
Unlike user interviews or a usability tests; you can’t ask your users about their thoughts and feelings during an A/B test. You also can’t put the same user through both versions of your experience, because one experience will impact their behavior in the other. This means you have to ask one group of people about Version A and another group of people about Version B. In larger organizations, you may be testing more than two versions of a new product or feature against what is currently in production. That means you adding a Control (the baseline or current experience) group that you are comparing to all of the test versions. This can be a lot to manage if you aren’t using a tool like Adobe Target, but is is possible. It isn’t ideal, but if you are at a smaller company or working with fewer resources, you can run the tests concurrently rather than simultaneously. Just be aware that it may be more difficult for you to be confident about what is statistically significant.

Now that we’ve established some context around significance and samples, here are some guidelines to help you be more successful when you A/B test:

Create a hypothesis for every test you run. If your overall test doesn’t have a hypothesis (e.g. “Implementing X will drive Y up by Z”), then it’s going to be hard for you to establish good success metrics and make measurement much more difficult.
Create a separate hypothesis for every version in your test. If you find that more than one version that you want to test has the same fundamental hypothesis, then merge those versions or drop one from the test because they are not materially different won’t provide you with usable data.
Running two controls and waiting for them to normalize is an easy method of determining when you have eliminated variations due to sample size. You may still not reach statistical significance if the variation in your versions is too small, but it’s a start. If your tests are running concurrently, playing around with normalizing your controls will give you an idea of how long your tests need to run, in general, to achieve significant sample sizes.
Do not report out flat results. As I mentioned earlier, “flat” means there is no reliable or functional difference between the performance of one test version over another and reporting out these numbers will only mislead people who are unfamiliar with statistics and A/B testing methodologies.
Normalize your results. Companies I’ve worked at before have measured their results in basis points (bps) over/under a certain success metric in the control. But a 50bps lift for a small section of the product might yield less benefit to the bottom line than a 5bps lift to a highly engaged portion of the product. The only way to clarify the impact your test versions have is to annualize the revenue impact or normalize the results against some other metric.