Learn to Code in R: for Loops and tapply, lapply, and sapply.
Continuing on with the discussion of for loops and apply functions bring us to another set of apply functions used to, well, apply a function to data in different ways.
In this post, I will be:
In this post, I will be:
- Discussing the arrays or data arrangements for which the different apply functions are designed. That is, when to use each one.
- Comparing for loops to tapply, lapply, and sapply.
- I will write for loops for each so you can better familiarize yourself with for loops and situations where you can use the apply functions, instead.
The data I will be using for this is the same data set that I used for the apply function post. This is some code I used to prepare the data to get it to its current state. Some of which I will be discussing later. I mostly provide this for the sake of disclosure and clarity.
lapply and sapply: Apply a function over a Vector or List
This is the most apparent and obvious replacement to a for loop. You give lapply the information set that you wish to iterate through and it will apply each element to whatever function you give it, be it canned or custom built. It is called "lapply" because the output will be a list. The other very similar function sapply attempts to output the information in a simplified form. Often being a vector or matrix as opposed to a list. However, if the output isn't a singular value and there are a number of elements at each iteration, it may still produce a list. Please consult R's help documentation for this function for further details.
As a reminder, if you would like to access R's help documentation use ? and then the function, with no space, whose documentation you wish to see. For example, for lapply you would enter "?lapply" into the R console.
One great usage for lapply or sapply is to check the data types for all the columns in a data frame. This is very useful if you are reading in an external data file and want to know if numeric columns were properly read in as numeric, etc.
Take note of the difference in the output. Here sapply simplified the output to a character vector and and lapply produced a list as it always does.
You may have noticed in my data prep code that I used to organize the data that lapply appears and is used to create the US state column in the data frame. Let me take you through this as another example. The strsplit function separates singular strings into different segments. I told it to use the "," in the city description so it would separate out the city name and the state.
I then use lapply (or sapply) to move through each element of that list and keep only the second character in each list element. The second character is the state name in every case so I keep this information and use it to create the US state variable. Again, take notice of how lapply and then sapply provide their respective output.
Accomplishing the same thing in a for loop can be done with code below, but takes much longer than sapply. sapply took 0.083 seconds while the for loop took 2.8 seconds. This is a simple example where either method runs quickly. Keep in mind, the more complex the process, the greater the difference in efficiency becomes.
tapply: Apply a Function over a Ragged Array
This sounds more technical than it really is. When the R help documentation says "ragged array" it is referring to a data frame where one column contains the numeric information and another column provides the labels or groupings associated with that information.
Let's look again at the data frame that we will be using and I'll point out what I'm talking about.
The "TotalHouseholds" columns holds numeric information and the "state" column provides the grouping information that tells us to which state each of the rows belong. If I wanted to see sum total of households by US state, I can use tapply to accomplish this.
The way that we call this function and provide it the information it need is as follows,
For this data, the function call looks like,
The information is the total households count for each city. The groupings information is the US state, and we are applying the sum function. Using a for loop to produce the same output is as follows.
You'll see the numbers match. I initialized a vector of the unique elements of the state column and iterated through that while placing the sums in the sumvec object. I did this because instead of iterating through state.name, because the census contains information from other territories like DC that are not included otherwise.
One additional feature of tapply is that it will provide names to the output so the state name is provided along with the sum in this case. Another feature that should be pointed out is the you can supply more than one grouping variable, which you pass to tapply as a list. A generic example would be,
Let's look again at the data frame that we will be using and I'll point out what I'm talking about.
The "TotalHouseholds" columns holds numeric information and the "state" column provides the grouping information that tells us to which state each of the rows belong. If I wanted to see sum total of households by US state, I can use tapply to accomplish this.
The way that we call this function and provide it the information it need is as follows,
For this data, the function call looks like,
The information is the total households count for each city. The groupings information is the US state, and we are applying the sum function. Using a for loop to produce the same output is as follows.
You'll see the numbers match. I initialized a vector of the unique elements of the state column and iterated through that while placing the sums in the sumvec object. I did this because instead of iterating through state.name, because the census contains information from other territories like DC that are not included otherwise.
One additional feature of tapply is that it will provide names to the output so the state name is provided along with the sum in this case. Another feature that should be pointed out is the you can supply more than one grouping variable, which you pass to tapply as a list. A generic example would be,
If you provide 1 grouping variable, the output will be a vector. If 2, then it will be a matrix. If 3, then it will be a 3-D array. And so forth. This comes in handy when dealing with more complicated data. Then you can use apply or sweep to work with that output. This will get things done MUCH more efficiently both in system time and in the amount of code you have to write. It just takes a more time and thought to wrap your brain around the computations and what you are doing.
While there is always more to discuss and additional detail I could give, I think that I will end this post right there to provide a basic discussion of the why and how of these functions. As always,
feel free to ask me any specific questions at this site. Be aware that there is a nominal $1.50 fee to submit questions. That is because it takes time and effort to respond to your questions. Also, I'm open to suggestions for future posts so please let me know if there is a topic you would like me to discuss in the comments.
Comments
Post a Comment