Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.
As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?
With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).
Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.
Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.
R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).
Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.
Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.
We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.