Practical Data Science Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

How it works...

In step 2, there is definitely some interesting magic at work, with a lot being done in only a few lines of code. This is both a beautiful and a problematic aspect of R. It is beautiful because it allows the concise expression of programmatically complex ideas, but it is problematic because R code can be quite inscrutable if you are not familiar with the particular library.

In the first line, we use dlply (not ddply) to take the gasCars4 data frame, split it by year, and then apply the unique function to the make variable. For each year, a list of the unique available automobile makes is computed, and then dlply returns a list of these lists (one element each year). Note dlply, and not ddply, because it takes a data frame (d) as input and returns a list (l) as output, whereas ddply takes a data frame (d) as input and outputs a data frame (d):

uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make)) 
commonMakes <- Reduce(intersect, uniqMakes)
commonMakes

The next line is even more interesting. It uses the Reduce higher order function, and this is the same Reduce function and idea in the map reduce programming paradigm introduced by Google that underlies Hadoop. R is, in some ways, a functional programming language and offers several higher order functions as part of its core. A higher order function accepts another function as input. In this line, we pass the intersect function to Reduce, which will apply the intersect function pairwise to each element in the list of unique makes per year that was created previously. Ultimately, this results in a single list of automobile makes that is present every year.

The two lines of code express a very simple concept (determining all automobile makes present every year) that took two paragraphs to describe.