Sunday, June 5, 2016

Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

Someone decided what data ought to go into the matrix. They placed the objects of interest in the rows and the features that differentiate among those objects into the columns. Decisions were made either to collect information or to store what was gathered for other purposes (e.g., data mining).

A set of mutually constraining choices determines what counts as an object and a feature. For example, in the above figure, the beer display case in the convenience store contains a broad assortment of brands in both bottles and cans grouped together by volume and case sizes. The features varying among these beers are the factors that consumers consider when making purchases for specific consumption situations: beers to bring on an outing, beers to have available at home for self and guests, and beers to serve at a casual or formal party.

These beers and their associated attributes would probably not have made it into a taste test at a local brewery. Craft beers come with their own set of distinguishing features: spicy, herbal, sweet and burnt (see the data set called beer.tasting.notes in the ExPosition R package). Yet, preference still depends on consumption occasion for a craft beer needs to be paired with food and desires vary by time of day and ongoing activities and who is drinking with you. What gets measured and how it is measured depends on the situation and the accompanying reasons for data collection (the task at hand).

It should be noted that data matrix construction may evolve over time and may change with context. Rows and columns are added and deleted in order to maintain a type of synchronization with one beer suggesting the inclusion of other beers in the consideration set and at the same time inserting new columns into the data matrix in order to differentiate among the additional beers.

Given such mutual dependencies, why would we want to perform separate analyses of the rows (e.g., cluster analysis) and the columns (e.g., factor analysis), as if row (column) relationships were invariant regardless of the columns (rows)? That is, the similarity of the rows in a cluster analysis depends on the variables populating the columns, and the correlations among the variables are calculated as shared covariation over the rows. The correlation between two variables changes with the individuals sampled (e.g., restriction of range), and the similarity between two observations fluctuates with the basis of comparison (e.g., features included in the object representation). Given the shared determination of what gets included as the rows and the columns of our data matrix, why wouldn't we turn to a joint scaling procedure such as correspondence analysis (when the cells are counts) or biplots (when the cells are ratings or other continuous measures)?

We can look at such a joint scaling using that beer.tasting.notes data set mentioned earlier in this post. Some 38 craft beers were rated on 16 different flavor profiles. Although the ExPosition R package contains functions that will produce a biplot, I will run the analysis with FactoMineR because it may be more familiar and I have written about this R package previously. The biplot below could have been shown in a single graph, but it might have been somewhat crowded with 38 beers and 16 ratings on the same page. Instead, by default, FactoMineR plots the rows on the Individuals factor map and the columns on the Variables factor map. Both plots come from a single principal component analysis with rows as points and columns as arrows. The result is a vector map with directional interpretation, so that points (beers) have perpendicular projections onto arrows (flavors) that reproduce as closely as possible the beer's rating on the flavor variable.

Beer tasting provides a valuable illustration of the process by which we build the data matrix for the task at hand. Beer drinkers are not as likely as those drinking wine to discuss the nuances of taste so that you might not know the referents for some of these flavor attributes. As a child, you learned which citrus fruits are bitter and sweet: lemons and oranges. How would you become a beer expert? You would need to participate in a supervised learning experience with beer tastings and flavor labels. For instance, you must taste a sweet and a bitter beer and be told which is which in order to learn the first dimension of sweet vs. bitter as shown in the above biplot.

Obviously, I cannot serve you beer over the Internet, but here is a description of Wild Devil, a beer positioned on the Individuals Factor Map in the direction pointed to by the labels Bitter, Spicy and Astringent on the Variables Factor Map.
It’s arguable that our menacingly delicious HopDevil has always been wild. With bold German malts and whole flower American hops, this India Pale Ale is anything but prim. But add a touch of brettanomyces, the unruly beast responsible for the sharp tang and deep funk found in many Belgian ales, and our WildDevil emerges completely untamed. Floral, aromatic hops still leap from this amber ale, as a host of new fermentation flavor kicks up notes of citrus and pine.
Hopefully, you can see how the descriptors in our columns acquire their meaning through an understanding of how beers are brewed. It is an acquired taste with the acquisition made easier by knowledge of the manufacturing process. What beers should be included as rows? We will need to sample all those variations in the production process that create flavor differences. The result is a relatively long feature list. Constraints are needed and supplied by the task at hand. Thankfully, the local brewery has a limited selection, as shown in the very first figure, so that all we need are ratings that will separate the five beers in tasting tray.

In this case the data matrix contains ratings, which seem to be the averages of two researchers. Had there been many respondents using a checklist, we could have kept counts and created a similar joint map with correspondence analysis. I showed how this might be done in an earlier post mapping the European car market. That analysis demonstrated how the Prius has altered market perceptions. Specifically, Prius has come to be so closely associated with the attributes Green and Environmentally Friendly that our perceptual map required a third dimension anchored by Prius pulling together Economical with Green and Environmental. In retrospect, had the Prius not been included in the set of objects being rated, would we have included Green and Environmentally Friendly in the attribute list?

Now, we can return to our convenience store with a new appreciation of the need to analyze jointly the rows and the columns of our data matrix. After all, this is what consumers do when they form consideration sets and attend only to the differentiating features for this smaller number of purchase alternatives. Human attention enables us to find the cooler in the convenience store and then ignore most of the stuff in that cooler. Preference learning involves knowing how to build the data matrix for the task at hand and thus simplifying the process by focusing only on the most relevant combination of objects and features.

Finally, I have one last caution. There is no reason to insist on homogeneity over all the cells of our data matrix. In the post referenced in the last paragraph, Attention is Preference, I used nonnegative matrix factorization (NMF) to jointly partition the rows and columns of the cosmetics market into subspaces of differing familiarity with brands (e.g., brands that might be sold in upscale boutiques or department stores versus more downscale drug stores). You should not be confused because the rows are consumers and not objects such as car models and beers. The same principles apply whenever goes in the rows. The question is whether all the rows and columns can be represented in a single common space or whether the data matrix is better described as a number of subspaces where blocks of rows and columns have higher density with sparsity across the rest of the cells. Here are these attributes that separate these beers, and there are those attributes that separate those beers (e.g., the twist-off versus pry-off cap tradeoff applies only to beer in bottles).

R Code to Run the Analysis:
# ExPosition must be installed
# load data set and examine its structure
# how ExPosition runs the biplot <- epPCA(beer.tasting.notes$data)
# FactoMiner must be installed
# how FactoMineR runs the biplot<-PCA(beer.tasting.notes$data)
Created by Pretty R at

Wednesday, May 25, 2016

Using Support Vector Machines as Flower Finders: Name that Iris!

Nature field guides are filled with pictures of plants and animals that teach us what to look for and how to name what we see. For example, a flower finder might display pictures of different iris species, such as the illustrations in the plot below. You have in hand your own specimen from your garden, and you carefully compare it to each of the pictures until you find a good-enough match. The pictures come from Wikipedia, but the data used to create the plot are from the R dataset iris: sepal and petal length and width measured on 150 flowers equally divided across three species.

I have lifted the code directly from the svm function in the R package e1071.
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y) 
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
     col = as.integer(iris[,5]),
     pch = c("o","+")[1:150 %in% model$index + 1])
Created by Pretty R at

We will focus on the last block of R code that generates the metric multidimensional scaling (MDS) of the distances separating the 150 flowers calculated from sepal and petal length and width (i.e., the dist function applied to the first four columns of the iris data). Species plays no role in the MDS with the flowers positioned in a two-dimensional space in order to reproduce the pairwise Euclidean distances. However, species is projected onto the plot using color, and the observations acting as support vectors are indicated with plus signs (+).

The setosa flowers are represented by black circles and black plus signs. These points are separated along the first dimension from the versicolor species in red and virginica in green. The second dimension, on the other hand, seems to reflect some within-species sources of differences that do not differentiate among the three iris types.

We should recall that the dist function calculates pairwise distances in the original space without any kernel transformation. The support vectors, on the other hand, were identified from the svm function using a radial kernel and then projected back onto the original observation space. Of course, we can change the kernel, which defaults to "radial" as in this example from the R package. A linear kernel may do just as well with the iris data, as you can see by adding kernel="linear" to the svm function in the above code.

It appears that we do not need all 150 flowers in order to identify the iris species. We know this because the svm function correctly classifies over 97% of the flowers with 51 support vectors (also called "landmarks" as noted in my last post Seeing Similarity in More Intricate Dimensions). The majority of the +'s are located between the two species with the greatest overlap. I have included the pictures so that the similarity of the red and green categories is obvious. This is where there will be confusion, and this is where the only misclassifications occur. If your iris is a setosa, your identification task is relatively easy and over quickly. But suppose that your iris resembles those in the cluster of red and green pluses between versicolor and virginica. This is where the finer distinctions are being made.

By design, this analysis was kept brief to draw an analogy between support vector machines and finder guides that we have all used to identify unknown plants and animals in the wild. Hopefully, it was a useful comparison that will help you understand how we classify new observations by measuring their distances in a kernel metric from a more limited set of support vectors (a type of global positioning with a minimal number of landmarks or exemplars as satellites).

When you are ready with your own data, you can view the videos from Chapter 9 of An Introduction to Statistical Learning with Applications in R to get a more complete outline of all the steps involved. My intent was simply to disrupt the feature mindset that relies on the cumulative contributions of separate attributes (e.g., the relative impact of each independent variable in a prediction equation). As objects become more complex, we stop seeing individual aspects and begin to bundle features into types or categories. We immediately recognize the object by its feature configuration, and these exemplars or landmarks become the new basis for our support vector representation.

Monday, May 23, 2016

The Kernel Trick in Support Vector Machines: Seeing Similarity in More Intricate Dimensions

The "kernel" is the seed or the essence at the heart or the core, and the kernel function measures distance from that center. In the following example from Wikipedia, the kernel is at the origin and the different curves illustrate alternative depictions of what happens as we move away from zero.

At what temperature do you prefer your first cup of coffee? If we center the scale at that temperature, how do we measure the effects of deviations from the ideal level. The uniform kernel function tells us that closest to the optimal makes little difference as long as it is within a certain range. You might feel differently, perhaps it is a constant rate of disappointment as you move away from the best temperature in either direction (a triangular kernel function). However, for most us, satisfaction takes the form of exponential decay with a Gaussian kernel describing our preferences as we deviate from the very best.

Everitt and Hothorn show how it is done in R for density estimation. Of course, the technique works with any variable, not just preference or distance from the ideal. Moreover, the logic is the same: give greater weight to closer data. And how does one measure closeness? You have many alternatives, as shown above, varying from tolerant to strict. What counts as the same depends on your definition of sameness. With human vision the person retains their identity and our attention as they walk from the shade into the sunlight; my old camera has a different kernel function and fails to keep track or focus correctly. In addition, when the density being estimated is multivariate, you have the option of differential weighting of each variable so that some aspects will count a great deal and others can be ignored.

Now, with the preliminaries over, we can generalize the kernel concept to support vector machines (SVMs). First, we will expand our feature space because the optimal cup of coffee depends on more than its temperature (e.g., preparation method, coffee bean storage, variation on finest and method of grind, ratio of coffee to water, and don't forget the the type of bean and its processing). You tell me the profile of two coffees using all those features that we just enumerated, and I will calculate their pairwise similarity. If their profiles are identical, the two coffees are the same and centered at zero. But if they are not identical, how important are the differences? Finally, we ought to remember that differences are measured with respect to satisfaction, that is, two equally pleasing coffees may have different profiles but the differences are not relevant.

As the Mad Hatter explained in the last post, SVMs live in the observation space, in this case, among all the different cups of coffees. We will need a data matrix with a bunch of coffees for taste testing in the rows and all those features as columns, plus an additional column with a satisfaction rating or at least a thumbs-up or thumbs-down. Keeping it simple, we will stay with a classification problem distinguishing good from bad coffees. Can I predict your coffee preference from those features? Unfortunately, individual tastes are complex and that strong coffee may be great for some but only when hot. What of those who don't like strong coffee? It is, as if, we had multiple configurations of interacting nonlinear features with many more dimensions than can be represented in the original feature space.

Our training data from the taste tests might contain actual coffees near each of these configurations differentiating the good and the bad. These are the support vectors of SVMs, what Andrew Ng calls "landmarks" in his Coursera course and his more advanced class at Stanford. In this case, the support vectors are actual cups of coffee that you can taste and judge as good or bad. Chapter 9 of An Introduction to Statistical Learning will walk you through the steps, including how to run the R code, but you might leave without a good intuitive grasp of the process.

It would help to remember that a logistic regression equation and the coefficients from a discriminant analysis yield a single classification dimension when you have two groupings. What happens when there are multiple ways to succeed or fail? I can name several ways to prepare different types of coffee, and I am fond of them all. Similarly, I can recall many ways to ruin a cup of coffee. Think of each as a support vector from the training set and the classification function as a weighted similarity to instances from this set. If a new test coffee is similar to one called "good" from the training data, we might want to predict "good" for this one too. The same applies to coffees associated with the "bad" label.

The key is the understanding that the features from our data matrix are no longer the dimensions underlying this classification space. We have redefined the basis in terms of landmarks or support vectors. New coffees are placed along dimensions defined by previous training instances. As Pedro Domingos notes (at 33 minutes into the talk), the algorithm relies on analogy, not unlike case-based reasoning. Our new dimensions are more intricate compressed representations of the original features. If this reminds you of archetypal analysis, then you may be on the right track or at least not entirely lost.

Monday, May 9, 2016

The Mad Hatter Explains Support Vector Machines

"Hatter?" asked Alice, "Why are support vector machines so hard to understand?" Suddenly, before you can ask yourself why Alice is studying machine learning in the middle of the 19th century, the Hatter disappeared. "Where did he go?" thought Alice as she looked down to see a compass painted on the floor below her. Arrows pointed in every direction with each one associated with a word or phrase. One arrow pointed toward the label "Tea Party." Naturally, Alice associated Tea Party with the Hatter, so she walked in that direction and ultimately found him.

"And now," the Hatter said while taking Alice's hand and walking through the looking glass. Once again, the Hatter was gone. This time there was no compass on the floor. However, the room was filled with characters, some that looked more like Alice and some that seemed a little closer in appearance to the Hatter. With so many blocking her view, Alice could see clearly only those nearest to her. She identified the closest resemblance to the Hatter and moved in that direction. Soon she saw another that might have been his relative. Repeating this process over and over again, she finally found the Mad Hatter.

Alice did not fully comprehend what the Hatter told her next. "The compass works only when the input data separates Hatters from everyone else. When it fails, you go through the looking glass into the observation space where all we have is resemblance or similarity. Those who know me will recognize me and all that resemble me. Try relying on a feature like red hair and you might lose your head to the Red Queen. We should have some tea with Wittgenstein and discuss family resemblance. It's derived from features constructed out of input that gets stretched, accentuated, masked and recombined in the most unusual ways."

The Hatter could tell that Alice was confused. Reassuringly, he added, "It's a acquired taste that takes some time. We know we have two classes that are not the same. We just can't separate them from the data as given. You have to look it in just the right way. I'll teach you the Kernel Trick." The Mad Hatter could not help but laugh at his last remark - looking at in in just the right way could be the best definition of support vector machines.

Note: Joseph Rickert's post in R Bloggers shows you the R code to run support vector machines (SVMs) along with a number of good references for learning more. My little fantasy was meant to draw some parallels between the linear algebra and human thinking (see Similarity, Kernels, and the Fundamental Constraints on Cognition for more). Besides, Tim Burton will be releasing soon his movie Alice Through the Looking Glass, and the British Library is celebrating 150 years since the publication of Alice in Wonderland. Both Alice and SVMs invite you to go beyond the data as inputted and derive "impossible" features that enable differentiation and action in a world at first unseen.

Sunday, April 3, 2016

When Choice Modeling Paradigms Collide: Features Presented versus Features Perceived

What is the value of a product feature? Within a market-based paradigm, the answer is the difference between revenues with and without the feature. A product can be decomposed into its features, each feature can be assigned a monetary value by including price in the feature list, and the final worth of the product is a function of its feature bundle. The entire procedure is illustrated in an article using the function rhierMnlMixture from the R package bayesm (Economic Valuation of Product Features). Although much of the discussion concentrates on a somewhat technical distinction between willingness-to-pay (WTP) and willingness-to-buy (WTB), I wish to focus instead on the digital camera case study in Section 6 beginning on page 30. If you have question concerning how you might run such an analysis in R, I have two posts that might help: Let's Do Some Hierarchical Bayes Choice Modeling in R and Let's Do Some More Hierarchical Bayes Choice Modeling in R.

As you can see, the study varies seven factors, including price, but the goal is to estimate the economic return from including a swivel screen on the back of the digital camera. Following much the same procedure as that outlined in those two choice modeling posts mentioned in the last paragraph, each respondent saw 16 hypothetical choice sets created using a fractional factorial experimental design. There was a profile associated with each of the four brands, and respondents were asked to first select the one they most preferred and then if they would buy their most preferred brand at a given price.

The term "dual response" has become associated with this approach, and several choice modelers have adopted the technique. If the value of the swivel screen is well-defined, it ought not matter how you ask these questions, and that seems to be confirmed by some in the choice modeling community. However, outside the laboratory and in the field, commitment or stated intention is the first step toward behavior change. Furthermore, the mere-measurement effect in survey research demonstrates that questioning by itself can alter preferences. Within the purchase context, consumers do not waste effort deciding which of the rejected alternatives is the least objectionable by attending to secondary features after failing to achieve consideration on one or more deal breakers (i.e., the best product they would not buy). Actually, dual response originates as a sales technique because encouraging commitment to one of the offerings increases the ultimate purchase likelihood.

We have our first collision. Order effects are everywhere. It is one of the most robust findings in measurement. The political pollster wants to know how big a sales tax increase could be passed in the next election. You get a different answer when you ask about a one-quarter percent increase followed by one-half percent than when you reverse the order. Perceptual contrast is unavoidable so that one-half seems bigger after the one-quarter probe. I do not need to provide a reference because everyone is aware of order as one of the many context effects. The feature presented is not the feature perceived.

Our second collision occurs from the introduction of price as just another feature, as if in the marketplace no one ever asks why one brand is more expensive than another. We ask because price is both a sacrifice with a negative impact and a signal of quality with a positive weight. In fact, as one can see from the pricing literature, there is nothing simple or direct about price perception. Careful framing may be needed (e.g., maintaining package size but reducing the amount without changing price). Otherwise, the reactions can be quite dramatic for price increases can trigger attributions concerning the underlying motivation and can generate a strong emotional response (e.g., price fairness).

At times, the relationship between the feature presented and the feature perceived can be more nuanced. It would be reasonable to vary gasoline prices in terms of cost per unit of measurement (e.g., dollars per gallon or euros per liter). Yet, the SUV driver seems to react in an all-or-none fashion only when some threshold on the cost to fill up their tank has been exceeded. What is determinant is not the posted price but the total cost of the transaction. Thus, price sensitivity is a complex nonlinear function of cost per unit depending on how often one fills up with gasoline and the size of that tank. In addition, the pain at the pump depends on other factors that fail to make it into a choice set. How long will the increases last? Are higher prices seen as fair? What other alternatives are available? Sometimes we have no option but to live with added costs, reducing our dissonance by altering our preferences.

We see none of this reasoning in choice modeling where the alternatives are described as feature bundles outside of any real context. The consumer "plays" the game as presented by the modeler. Repeating the choice exercise with multiple choice sets only serves to induce a "feature-as-presented" bias. Of course, there are occasions when actual purchases look like choice models. We can mimic repetitive purchases from the retail shelf with a choice exercise, and the same applies to online comparison shopping among alternatives described by short feature lists as long as we are careful about specifying the occasion and buyers do not search for user comments.

User comments bring us back to the usage occasion, which tends to be ignored in choice modeling. Reading the comments, we note that one customer reports the breakage of the hinge on the swivel screen after only a few months. Is the swivel screen still an advantage or a potential problem waiting to occur? We are not buying the feature, but the benefit that the feature promises. This is the scene of another paradigm collision. The choice modeler assumes that features have value that can be elicited by merely naming the feature. They simplify the purchase task by stripping out all contextual information. Consequently, the resulting estimates work within the confines of their preference elicitation procedures, but do not generalize to the marketplace.

We have other options in R, as I have suggested in my last two posts. Although the independent variables in a choice model are set by the researcher, we are free to transform them, for instance, compute price as a logarithm or fit low-order polynomials of the original features. We are free to go farther. Perceived features can be much more complex and constructed as nonlinear latent variables from the original data. For example, neural networks enable us to handle a feature-rich description of the alternatives and fit adaptive basis functions with hidden layers.

On the other hand, I have had some success exploiting the natural variation within product categories with many offerings (e.g., recommender systems for music, movies, and online shopping like Amazon). By embedding measurement within the actual purchase occasion, we can learn the when, why, what, how and where of consumption. We might discover the limits of a swivel screen in bright sunlight or when only one hand is free. The feature that appeared so valuable when introduced in the choice model may become a liability after reading users' comments.

Features described in choice sets are not the same features that consumers consider when purchasing and imagining future usage. This more realistic product representation requires that we move from those R packages that restrict the input space (choice modeling) to those R packages that enable the analysis of high-dimensional sparse matrices with adaptive basis functions (neural networks and matrix factorization).

Bottom Line: The data collection process employed to construct and display options when repeated choice sets are presented one after another tends to simplify the purchase task and induce a decision strategy consistent with regression models we find in several R packages (e.g., bayesm, mlogit, and RChoice). However, when the purchase process involves extensive search over many offerings (e.g., music, movies, wines, cheeses, vacations, restaurants, cosmetics, and many more) or multiple usage occasions (e.g., work, home, daily, special events, by oneself, with others, involving children, time of day, and other contextual factors), we need to look elsewhere within R for statistical models that allow for the construction of complex and nonlinear latent variables or hidden layers that serve as the derived input for decision making (e.g., R packages for deep learning or matrix factorization).

Friday, March 25, 2016

Choice Modeling with Features Defined by Consumers and Not Researchers

Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.

As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?

With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of  more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).

Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.

Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.

R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).

Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.

Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.

We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.

Monday, March 21, 2016

Understanding Statistical Models Through the Datasets They Seek to Explain: Choice Modeling vs. Neural Networks

R may be the lingua franca, yet many of the packages within the R library seem to be written in different languages. We can follow the R code because we know how to program but still feel that we have missed something in the translation.

R provides an open environment for code from different communities, each with their own set of exemplars, where the term "exemplar" has been borrowed from Thomas Kuhn's work on normal science. You need only to examine the datasets that each R package includes to illustrate its capabilities in order to understand the diversity of paradigms spanned. As an example, the datasets from the Clustering and Finite Mixture Task View demonstrate the dependence of the statistical models on the data to be analyzed. Those seeking to identifying communities in social networks might be using similar terms as those trying to recognize objects in visual images, yet the different referents (exemplars) change the meanings of those terms.

Thinking in Terms of Causes and Effects

Of course, there are exceptions, for instance, regression models can be easily understood across applications as the "pulling of levers" especially for those of us seeking to intervene and change behavior (e.g., marketing research). Increased spending on advertising yields greater awareness and generates more sales, that is, pulling the ad spending lever raises revenue (see the R package CausalImpact). The same reasoning underlies choice modeling with features as levers and purchase as the effect (see the R package bayesm).

The above picture captures this mechanistic "pulling the lever" that dominates much of our thinking about the marketing mix. The exemplar "explains" through analogy. You might prefer "adjusting the dials" as an updated version, but the paradigm remains cause-and-effect with each cause separable and under the control of the marketer. Is this not what we mean by the relative contribution of predictors? Each independent variable in a regression equation has its own unique effect on the outcome. We pull each lever a distance of one standard deviation (the beta weight), sum the changes on the outcome (sometimes theses betas are squared before adding), and then divide by the total.

The Challenge from Neural Networks

So, how do we make sense of neural networks and deep learning? Is the R package neuralnet simply another method for curve fitting or estimating the impact of features? Geoffrey Hinton might think differently. The Intro Video for Coursera's Neural Networks for Machine Learning offers a different exemplar - handwritten digit recognition. If he is curve fitting, the features are not given but extracted so that learning is possible (i.e., the features are not obvious but constructed from the input to solve the task at hand). The first chapter of Michael Nielsen's online book, Using Neural Nets to Recognize Handwritten Digits, provides the details. Isabelle Guyon's pattern recognition course adds an animated gif displaying visual perception as an active process.

On the other hand, a choice model begins with the researcher deciding what features should be varied. The product space is partitioned and presented as structured feature lists. What alternative does the consumer have, except to respond to variations in the feature levels? I attend to price because you keep changing the price. Wider ranges and greater variation only focus my attention. However, in real setting the shelves and the computer screens are filled with competing products waiting for consumers to define their own differentiating features. Smart Watches from Google Shopping provides a clear illustration of the divergence of purchase processes in the real world and in the laboratory.

To be clear, when the choice model and the neural network speak of input, they are referring to two very different things. The exemplars from choice modeling are deciding how best to commute and comparing a few offers for same product or service. This works when you are choosing between two cans of chicken soup by reading the ingredients on their labels. It does not describe how one selects a cheese from the huge assortment found in many stores.

Neural networks take a different view of the task. In less than five minutes Hinton's video provides the exemplar for representation learning. Input enters as it does in real settings. Features that successfully differentiate among the digits are learned over time. We see that learning in the video when the neural net generates its own handwritten digits for the numbers 2 and 8. It is not uncommon to write down a number that later we or others have difficulty reading. Legibility is valued so that we can say that an easier to read "2" is preferred over a "2" that is harder to identify. But what makes one "2" a better two than another "2" takes some training, as machine learning teaches us.

We are all accomplished at number recognition and forget how much time and effort it took to reach this level of understanding (unless we know young children in the middle of the learning process). What year is MCMXCIX? The letters are important, but so are their relative positions (e.g. X=10 and IX=9 in the year 1999). We are not pulling levers any more, at least not until the features have been constructed. What are those features in typical choice situations? What you want to eat for breakfast, lunch or dinner (unless you snack instead) often depends on your location, available time and money, future dining plans, motivation for eating, and who else is present (context-aware recommender systems).

Adopting a different perspective, our choice modeler sees the world as well-defined and decomposable into separate factors that can be varied systematically according to some experimental design. Under such constraints the consumer behaves as the model predicts (a self-fulling prophecy?). Meanwhile, in the real world, consumers struggle to learn a product representation that makes choice possible.

Thinking Outside the Choice Modeling Box

The features we learn may be relative to the competitive set, which is why adding a more expensive alternative makes what is now the mid-priced option appear less expensive. Situation plays an important role for the movie I view when alone is not the movie I watch with my kids. Framing has an impact, which is why advertising tries to convince you that an expensive purchase is a gift that you give to yourself. Moreover, we cannot forget intended usage for that Smartphone is a camera, a GPS, and I believe you get the point. We may have many more potential features than included in our choice design.

It may be the case that the final step before purchase can be described as a tradeoff among a small set of features varying over only a few alternatives in our consideration set. If we can mimic that terminal stage with a choice model, we might have a good chance to learn something about the marketplace. How did the consumer get to that last choice point? Why these features and those alternative products or services? In order to answer such questions, we will need to look outside the choice modeling box.

Friday, January 8, 2016

A Data Science Solution to the Question "What is Data Science?"

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a "sexed up term" for Statistics (see embedded quote by Nate Silver)? It's difficult to separate the two at this level of generality, so perhaps we need to define our terms.

We begin by making a list of all the stuff that a data scientist might do or know. We are playing a game where the answer is "data scientist" and the questions are "Do they do this?" and "Do they know that?". However, the "this" and the "that" are very specific. For example, "Data is Processed" can range from simple downloading to the complex representation of visual or speech input. What precisely does a data scientist do when they process data that a programmer or a statistician does not do?

To be clear, I am constructing a very long questionnaire that I intend to distribute to individuals calling themselves data scientists along with everyone else claiming that they too do data science, although by another name. A checklist will work in our game of Twenty Questions as long as the list is detailed and exhaustive. You are welcome to add suggestions as comments to this post, but we can start by expanding on each of the boxes in the above data science flowchart.

Since I am a marketing researcher, I am inclined to analyze the resulting data matrix as if it were a shopping cart filled with items purchased from a grocery store or an inventory of downloads from a video or music provider. The rows are respondents, and the columns are all the questions that might be asked to distinguish among all the various players. Let's not include sexy as a column.

You may have guessed that I am headed toward some type of matrix factorization. Can we recognize patterns in the columns that reflect different configurations of study and behavior? Are there communities composed of rows clustered together with similar practices and experiences? R provides most of us who have some experience running factor and cluster analyses with a "doable" introduction to non-negative matrix factorization (NMF). You can think of it as simultaneous clustering of the rows and columns in a data matrix. My blog is filled with examples, none of which are easy, but none of which are incomprehensible or beyond your ability to adapt to your own datasets.

What are we likely to find? Will we discover something like anchor words from topic modeling? For instance, it is necessary to work with multiple datasets from different disciplines to be a data scientist? Would I stop calling myself a marketing scientist if I started working with political polling data? Some argue that one becomes a statistician when they begin consulting with others from divergent fields of study.

What about teaching to students with varied backgrounds in universities or industry? Do we call it data science if one writes and distributes software that others can apply with data across diverse domains? Does proving theorems make one a statistician? How many languages must one know before they are a programmer? What role does computation play when making such discriminations?

What will we learn from dissecting the "corpus" (the detailed body of what we do and know summarized by the boxes in the above data science process)? Extending this analogy, I am recommending that the "physician, heal thyself" by applying data science methodology to provide a response to the "What is Data Science?" question. 

Hopefully, we can avoid the hype and the caricature from the popular press (sexiest job of 21st century). Moreover, I suggest that we resist the tendency to think metaphorically in terms of contrasting ideals. The simple act of comparing statisticians and data scientists shapes our perceptions and leads us to see the two as more dissimilar than suggested by their training and behavior. The distinction may be more nuance than substance, reflecting what excites and motivates rather than what is known or done. The basis for separation may reside in how much personal satisfaction is derived from the subject matter or the programming rather than the computational algorithm or the generative model.