Pages

Monday, September 10, 2012

Item Response Theory: Developing Your Intuition

Suppose that you accepted my argument from the last two posts on halo effects and bifactor models.  As you might recall, I argued that when respondents complete rating scales, they predominating rely on their generalized impression with a more minor role played by the specific features that the ratings were written to measure.  Consequently, we see a sizable general factor accounting for a substantial amount of the variation between respondents.  This general factor is important because much of human judgment and decision making is based on one’s overall emotional response (approach vs. avoidance).  But what if you needed to drill down into the specific features?

A good illustration is the work being done by National Institute of Health (NIH) over the last decade on the Patient Reported Outcome Measurement Information System (PROMIS).  For example, when measuring the upper extremity physical function of children, they get very specific and ask about the occurrence of everyday activities:  undoing Velcro, using a mouse with the computer, buttoning shirts, pouring liquids from a pitcher, and cutting paper with scissors.  In order to construct these scales and analyze such data, PROMIS has turned to Item Response Theory (IRT).  Neither dichotomous (checklist) nor ordinal (frequency of occurrence) responses can be analyzed with statistical procedures, like factor analysis, as if they were continuous.

OK, so you get the first part of the title, but what about “Developing your intuition?” This sounds like it belongs in a self-help book. But I have found that it is very difficult to learn item response theory unless you understand the motivation behind it. Perhaps it is because IRT is not a single statistical model, but a family of increasing complex models and estimation techniques. At times it seems like one is reading an old encyclopedia entry with heading after heading dealing with one more complex topic after another. There are one-parameter, two-parameter, and three-parameter models. Then, there are polytomous items, and nonparametric models, and linear logistic test models, and all the different estimation techniques (including Bayesian IRT). And when you think you got it, someone tells you about multidimensional IRT. But placing measurement on a firm foundation is too important for us to wait. So let’s see if we can outline a framework that makes IRT seem reasonable and within which we can begin to place all the different models and approaches.

Measurement Scales Derived from Relative Comparisons

Harder substances scratch softer substances, as every elementary student is taught. Given an unknown substance, I can evaluate its relative hardness by attempting to scratch it with an ordered set of standards of increasing hardness (e.g., gypsum < quartz < diamond). This is the rationale for the Mohs scale. It yields an ordinal scale because all I know is relative hardness.

How about physical fitness? Could we not identify a set of standardized physical tasks of increasing difficulty and measure a person’s physical fitness as the most difficult task they were able to pass? This is the idea underlying the Guttman scale.  If I can order the tasks, then a person will pass every task until they fail and then not pass any subsequent tasks.  The only problem is that human behavior is variable, so that it is possible for us to see random variation and inconsistent performance. By replacing the deterministic Guttman scale with a probabilistic response, we can deal with random variation and focus on the likelihood of passing.  This is the approach taken by item response theory.

Ultimately, the goal is to get both criterion-reference and norm-referenced measurements. If we include physical tasks that have real world implications (e.g., walking up stairs, lifting heavy luggage into an overhead bin, running for a cab), we will know something about what the person can and cannot do (criterion-referenced). In addition, we have learned where the person places relative to others in the sample (norm-referenced).

But since I am a marketing researcher, perhaps we could substitute brand strength for physical fitness and talk about brand equity or brand health. Specifically, a strong brand or a healthy brand should be able to pass several tests. It should have a favorable image, be in the consideration set during purchase deliberations, be bought, have satisfied customers, and get recommended by its users.

It’s All in the Response Pattern Matrix

I have generated some data consistent with the rank ordering of these five brand tests in order to show you the R code and the output from an item response model.  I have provided all the R code needed to generate some data and run the analysis at the end of the post in an appendix.  In addition, here is a link to a Journal of Statistical Software article by Dimitris Rizopoulos.  He works through all the same code when analyzing five items from the LSAT (Section 3.1).  I deliberately created a “matching” example so that you could see two worked examples from different perspectives.  Rizopoulos more fully discusses the code and the output, while I try to "develop your intuition."

First, let us look at the frequency table for 200 respondents who gave Yes/No answers to the five brand strength tests. There are five binary variables, so there are 32 possible response patterns. We only see 21 patterns because the brands tests are not independent. That is, if we had seen all 32 possible combinations with equal or balanced cell frequencies, we would have concluded that the 5 tests were independent and would have stopped our analysis. If it helps, you can think of this response pattern matrix as a 2x2x2x2x2 factorial design or as a contingency table of the same dimensions.

Favorable
Consider
Purchase
Satisfied
Recommend
Percent
Total Score
Latent Score
1
0
0
0
0
0
18.0%
0
-1.07
2
0
0
1
0
0
4.5%
1
-0.50
3
0
1
0
0
0
5.0%
1
-0.50
4
1
0
0
0
0
13.5%
1
-0.50
5
0
1
0
1
0
0.5%
2
-0.08
6
0
1
1
0
0
4.0%
2
-0.08
7
1
0
0
0
1
1.5%
2
-0.08
8
1
0
0
1
0
1.0%
2
-0.08
9
1
0
1
0
0
2.5%
2
-0.08
10
1
1
0
0
0
5.0%
2
-0.08
11
0
1
1
1
0
0.5%
3
0.31
12
1
0
0
1
1
0.5%
3
0.31
13
1
0
1
1
0
3.5%
3
0.31
14
1
1
0
0
1
2.0%
3
0.31
15
1
1
0
1
0
4.0%
3
0.31
16
1
1
1
0
0
6.0%
3
0.31
17
1
0
1
1
1
0.5%
4
0.72
18
1
1
0
1
1
1.5%
4
0.72
19
1
1
1
0
1
3.5%
4
0.72
20
1
1
1
1
0
7.5%
4
0.72
21
1
1
1
1
1
15.0%
5
1.26

 
Note that the two profiles with the largest percentages of respondents are all No's (the first row with 18.0% or 36 of the 200 respondents) and all Yes's (the last row with15.0% or 30 respondents). This means that 18% received the lowest possible score and cannot be differentiated further without adding another brand strength test (something easier than favorable image, like awareness). Similarly, at the other end of the scale we find 15% with the highest possible score. If we wanted to separate this 15%, we would need to add a more severe test than recommendation (e.g., continue to purchase after major price increase).

Items vary in their difficulty, and as a result, are best at measuring individual difference near their location on the scale. On the one hand, easy items with lots of respondents saying “Yes” separate individuals at the lower end of the scale. On the other hand, difficult items with lots of respondents saying “No” separate individuals at the higher end of the scale.

Before we proceed, we need to assure ourselves that the five tests are all tapping one underlying dimension. Obviously, the tests were selected because we believed that they were all measures of brand strength. Strong brands deliver consistent value that customers are willing to pay for. When selecting the tests, I was thinking of the purchase funnel, a theory about the steps in the purchase process. Brands with favorable images tend to make it into the consideration set, but they are not always purchased. Everyone who buys is not necessarily satisfied with their purchase. Even satisfied customers don’t always recommend. It is this “funneling process” that makes each of the steps increasingly difficult for the brand to pass.

So, I have a solid theoretical basis for believing that these tests tap the same underlying individual difference dimension.  Why the stress on the “individual difference” modifier?  The brand strength that we are concerned with varies over customers.  This can be confusing because at times brand strength is measured over several different brands and used as a brand characteristic.  Perhaps you need to recall your experimental design class where you studied two types of designs, between-subject and within-subject, and their combination as mixed designs.  There are many research areas where we gather lots of data from subjects with the intent to learn something about how the individual operates.  In IRT we have the extended Rasch model (eRm package), where the items are systematically varied according to a design and the effects of item features estimated.  But that is not what we are doing here.  We are looking at perceptions of a single brand.  Some respondents have a favorable opinion of the brand, and others do not.  Brand strength is the dimension that differentiates the favorable respondents from the not favorable respondents.

Now, what about empirical evidence for the unidimensionality of these five brand strength tests?  Principal component analysis should help.  The first principal component accounts for over 50% of the total variation and is 3.5 times the size of the second principal component. There are more tests that the ltm package will run (e.g. modified parallel analysis), but the magnitude of the first principal component is probably good enough for this example.

We can also get a sense of the data by looking at the above table. It is ordered by difficulty of the item and by the latent trait score associated with each response pattern. The last column is the latent trait score, an index of brand strength associated with each response profile. The adjacent column is the total score calculated as the number of Yes’s to the five brand tests.  As one moves from left to right across the table, the number of Yes’s (=1) decreases. The percentage of Yes’s falls from 67.5% for favorable to 54.5% to 47.5% to 34.5% to 24.5% for recommend. As one moves down the table, brand strength increases, as does the number of Yes’s in each row.

Allow me to deal with one side issue. If this were a “true” funnel, I would fix the order of the questions from favorable image to recommendation and screen respondents so that no one who did not purchase would be asked about satisfaction. But I wanted to show the type of response patterns that you tend to see in survey data. As you can see in the R code at the end of this post, I generated random data as if this were a checklist with order randomized separately for each respondent and no branching or screening between items. For example, look at row 15 with 4% or 8 respondents. These 8 respondents indicate that they would be satisfied with the brand, but would not purchase it. Is this an inconsistent response pattern? Is it random variation? Or perhaps the brand is too expensive for them? I would be satisfied driving a luxury car (check “Yes” under Satisfied), if I could afford it (check “No” under Purchase).  If it is the case that Purchase measures affordability, in addition to brand strength, then we might wish to remove that item from our scale.

They key to achieving an intuitive sense of what an IRT is trying to accomplish is the ability to “see” the relationship between the response pattern matrix and the underlying latent trait. Once you get that, it becomes much easier. Items tap different locations along the latent trait. How do we know that? Fewer respondent give positive responses to more severe tests of the latent trait. You need to “picture” the response pattern matrix and see the number of 0’s increasing and the number of 1’s decreasing. Look at the above table again. I have used yellow and green coloring so that you get the sense that the green is the valley toward which “Yes” responses flow.

But don’t forget that there are two parts: items and persons. Latent traits are dimensions of individual differences. How are respondents separated by their response patterns? Can you see it in the response pattern matrix? What happens as you move down the table? More and more respondents begin to say “Yes” to the more stringent tests of brand strength. What does it mean for a respondent to perceive a brand as being a strong brand? They give more Yes’s, true, but lots of respondent say “Yes” to the easy tests. Only those respondents with the most positive brand perceptions say “Yes” to the hardest tests for a brand to pass.

Output from an Item Response Theory Analysis

If you recall, my intent was to develop your intuition and not review all of IRT. Keeping with that goal, I will only run one type of IRT model and then try to relate the output to the response pattern matrix.

We will use the one of the many IRT packages in R. I selected the latent trait model package, ltm, because it is both comprehensive and relatively easy to use. More importantly, Dimitris Rizopoulos has gone out of his way to provide extensive support, both in articles and in presentations. He has written a lot, and it is all worth reading.

We will run one of the more common IRT models, the two-parameter logistic model. To understand what the two parameters are, we look at the item characteristic curves. These are logistic functions, one for each of the 5 brand tests. Each one shows the likelihood of checking "Yes" as a function of the respondent's score on the underlying latent variable, called ability for historical reasons (the first applications were in educational testing).
The curves follow the same ordering that we saw in the response pattern matrix with favorable (1) < consider (2) < purchase (3) < satisfied (4) < recommend (5). They are defined by their location and slope.

Location
Slope
Item 1
Favorable
-0.57
2.36
Item 2
Consider
-0.15
1.98
Item 3
Purchase
0.09
1.65
Item 4
Satisfied
0.46
3.30
Item 5
Recommend
0.83
2.65

The location estimates nicely separate the five brand tests. They indicate the score on the latent variable that would yield a 50-50 chance of saying "Yes" to each item. As we noted when we looked at the response pattern matrix, these five brand tests do not do a good job of differentiating respondents at the very top or bottom of the scale. You should remember that 18% gave the lowest possible responses and 15% gave the highest possible responses. We see this in the location parameter that range from only -0.57 to 0.83.  We would need to add easier items (location below -.057) and harder items (location above 0.83) in order to differentiate among these two large groups.  I should note that I am interpreting these location parameters as if they were z-scores with a mean of 0 and a standard deviation of 1.  Although one always needs to check the distribution of latent scores in their particular study, this is not a bad “rule of thumb” when you have many items and a normally distributed latent variable.

The slope parameters indicate how quickly the probability of saying "Yes" changes as a function of changes in the latent trait. Item 4 measuring Satisfaction has the steepest slope. In fact, its steepness causes the item curves to overlap, meaning that there is a range of the latent trait where the likelihood of saying Satisfied (4) is greater than the likelihood of saying Purchase (3).  This is not a happy result.  We would like for the five logistic curves to be parallel (same slope) or at least for them not to overlap.  That is, we would like for Satisfaction to be a more severe test of brand strength than Purchase for everyone regardless of their level of brand strength.

Perhaps these differences in slope are due to random variation?  What if we constrained the five slopes to equal the same value, would we be able to reproduce our observed response pattern matrix as well as when we allow the slopes to be different?  Sounds like a likelihood ratio test using the anova() function, and with four degrees of freedom (i.e., five different slope estimates reduced to one common slope), we find a p value of 0.232.  The location estimates change little when the slopes are constrained to be equal.

Location
Slope
Item 1
Favorable
-0.58
2.25
Item 2
Consider
-0.16
2.25
Item 3
Purchase
0.07
2.25
Item 4
Satisfied
0.49
2.25
Item 5
Recommend
0.87
2.25

The common slope seems a reasonable compromise, and the logistic curves are now parallel.

Respondents and Items on the Same Scale

Our goal from the beginning was to find some way to combine the five separate brand strength tests into a single index.  Looking at the response pattern matrix, we knew that we had only 21 of the 32 possible combinations of the five dichotomous Yes/No tests.  Each of 21 response profiles could have yielded a different latent variable score.  Had we rejected the test for the equality of the five slope parameters, we would have had 21 different latent trait scores.  This is an important point.  When you look back to the response pattern matrix, you see all 21 possible response profiles along with the total score (number of Yes’s) and the latent trait score from the IRT model with all the slopes constrained to be equal. 

There are duplicate latent scores for different response patterns because all five tests have equal slopes.  In fact, there is a unique latent score for each total score. In this particular case, the relationship between the total score and the latent score appears to be linear. However, this will not generally occur, and the relationship between the observed total score and the unobserved latent score is often not linear. 

The slope indicates how well the item is able to discriminant among the respondents.  A flat slope tells us that the probability of saying "Yes" changes slowly with increases or decreases in the latent variable.  A steep slope shows that the likelihood of "Yes" changes quickly over a small interval of the latent trait.  The contribution that each item makes to the latent score depends on its slope (see section on scoring in this Wikipedia entry).

Respondents get latent scores, and items get latent scores too. Both items and respondents can be placed on the same scale. Had we more items and had those additional items filled in the gaps on both the lower and upper ends of the scale, our latent scores would looked more like z-scores and would have looked like the distribution of the underlying latent variable.

The table below summarizes these results and shows where the five tests fall in relation to the respondents. 

Total Score
Latent Score
Percent
Item Score
 
0
-1.07
18.0%
 
 
 
 
 
-0.58
Favorable
1
-0.50
23.0%
 
 
 
 
 
-0.16
Consider
2
-0.08
14.5%
 
 
 
 
 
0.07
Purchase
3
0.31
16.5%
 
 
 
 
 
0.49
Satisfied
4
0.72
13.0%
 
 
 
 
 
0.87
Recommend
5
1.26
15.0%

 

Conclusions: Going beyond the Mathematics

Individual Difference Dimensions. The dimensions that IRT uncovers are between-person.  Different respondents see the same brand differently.  We can aggregate respondents and calculate the "brand strength" of many different brands.  But now the analysis ignores individual differences among the respondents and focuses on the brands.  It is important to maintain the distinction between brand-level and individual-level analyses.

Moreover, it is so easy to forget that what differentiates people at the lower end of the scale may not be what differentiates people at the upper end of the scale. Mathematical proficiency is assessed using arithmetic problems in the lower grades and algebra in the higher grades. As we have seen, different brand tests are necessary to measure low and high levels of brand strength.

Response Pattern Matrix. Although we can look item by item, we learn much more about individuals when we examine their pattern of responses across an array of items tapping different portions of the underlying trait. If it helps, think of it as a form of triangulation.  We also learn about the items.  Sometimes what we learn is that one or more items simply do not belong in the dimension and need to be removed.  Often we learn that we have not adequately measured the entire range of our latent construct and need additional items, especially at the upper and lower ends of the scale.

Of course, the response pattern matrix becomes unwieldy as the number of items increase.  For example, with 10 items the number of possible combinations is over a thousand and simply too large to be of any help.  However, regardless of the IRT model or the number of items, you will be able to interpret the results because you have an intuitive understanding of the connection between the response patterns and the IRT parameters.

Criterion-referenced interpretation. I can do more with your response profile than just locate your performance in comparison to others, although such norm-referenced information is important. If I am careful, I can select brand tests corresponding to milestones that I wish to achieve. Knowing how favorable my brand perceptions are is valuable on its own because it provides diagnostic information in that it might explain why the brand is not considered during purchase. What moves a customer at the low end of the brand strength scale is likely to be different than what moves a customer at the upper end of the same scale.


Appendix:  R code to generate data and run ltm

#use orddata package to generate random data

library(orddata)
#probabilities for each brand test (location)
prob <- list(
  c(35,65)/100,
  c(45,55)/100,
  c(55,45)/100,
  c(65,35)/100,
  c(75,25)/100
  )

#slope for each logistic curve

loadings<-matrix(c(
  .6,
  .6,
  .6,
  .6,
  .6),
5, 1, byrow=TRUE)

#creates correlation matrix as input

cor_matrix<-loadings %*% t(loadings)
diag(cor_matrix)<-1

#generates 200 random ordinal observations
ord<-rmvord(n = 200, probs = prob, Cor = cor_matrix)

#calculates first principal component

library(psych)
principal(ord,nfactors=1)$value

library(ltm)
ord<-ord-1
descript(ord)

#likelihood ratio test
anova(rasch(ord), ltm(ord ~ z1))

#two-parameter logistic model
fit<-ltm(ord ~ z1)
summary(fit)

#item characteristic curves
plot(fit)

#calculates latent trait scores
pattern<-factor.scores(fit)
#constrains slopes to be equal
fit2<-rasch(ord)
plot(fit2)

summary(fit2)      
scores2<-factor.scores(fit2)