I recently bought a berry crumble from Wal-Mart that didn't live up to my expectations: it had way too much sugar and no oatmeal or cake batter, so the entire crumble went into the compost after a few bites. Was it that I had confused crumble with cobbler? In order to prevent wasting another 5 dollars in the future, I did what any data scientist would do: run Principal Component Analysis on text-mined recipes from the internet. I formatted the data into a table, where the first column is whether either crumble or cobbler appears in the title, and then each column is either 1 or 0 for whether the word in the header appears in the recipe's text. I uploaded and ran PCA as described in a previous post. The scatterplot of component 1 vs component 2 looks like:
From the scatterplot, there's no clear distinction between the two, except for the cluster of cobblers on the lower right-hand side. The first component is able to separate a few cobblers from the rest of the recipes using the equation:
pca1 = drink*0.120 + alcoholic*0.103 + quail*0.081 + christmas*0.072 + liqueur*0.063
Turns out there is a type of cocktail called a "cobbler", and that is what this first component is successfully separating out of the recipe set. The next component is:
pca2 = bake*0.353 + fruit*0.333 + gourmet*0.307 + dessert*0.217
This component is pulling out that cobblers are slightly more likely to be baked than crumbles, and are more likely to contain fruit (as opposed to vegetables). However, unlike the alcoholic cobblers, there's no clear dilineation between the two. But now that I've written the script for grabbing recipes and creating these tables, why not attempt to ask another thing I've wondered about - what is the difference between a lunch food and a breakfast food? Doing the same process, the scatterplot looks like:
It's possible to separate a lot of the lunch recipes from the breakfasts, but almost all the breakfasts overlap with the lunch space and so could be considered lunches. The first component is:
pca1 = oyster*0.390 + shrimp*0.365 + low cholesterol*0.240 + prune*0.208+ celery*0.202 + dried fruit*0.202 + parsley*0.149 + seafood*0.113
So seafood and low-cholesterol are indicators of a recipe being a lunch food and not a breakfast food. This seems right; eggs are not a low-cholesterol food, and there are few breakfast foods that involve seafood. Unlike in the crumble vs cobbler example, this first component explains far more of the variance than the next components.
PCA can explain similar questions across many domains. Instead of identifying the groups of ingredients that define one different types of food, you might be identifying the common words in the human-written feedback and descriptions of failures. Or, you might be looking at the space of all descriptions of other products to find new potential products, or products that may be most familiar. For example, we might take our knowledge that breakfasts don't have seafood to open a seafood restaurant that is open in the morning in order to take advantage of an untapped market.