Learn to Code in R: For Loops, and Apply Function.
When analyzing data you often have to iterate through a set of values, or apply the same function to to that set. While R does not have a great reputation for iterative processes, the apply functions are a way around writing a slow for loop. Mastering the use of the apply functions will make your coding much more efficient and versatile.
In this post, I will discuss the following:
In this post, I will discuss the following:
- for loops. When and how to use them. I'll also briefly mention while loops.
- How to use apply. I will address tapply, lapply, and sapply in a subsequent post.
To help demonstrate how the apply function can be used instead of a for loop, I will carry out the same task using both methods. Before I do that, I am going to go over some looping basics for those who may be unfamiliar or may need a review.
For Loop Basics
A for loop iterates through the elements of a vector (a set of values), where at each iteration will be represented by the object provided in the for statement. One convention is to use "i" but you can use whatever you want. It is the same as saving a set of values to "i" where i becomes the next element at each iteration. If this is confusing, then maybe an example will clear things up.
In the graphic below I try to demonstrate what this looks like when you iterate through a character or factor class vector.
The character vector state.name is part of the base R package and contains the names of the 50 states that make up the USA. Another common method is to provide a vector of numeric values to act as indices to iterate through the elements of your data.
If I have a more complex iterative process, I tend to use this method because I'm inserting information into a data frame or matrix row by row at each iteration. This requires me to reference rows which I couldn't do iterating through character values.
Now, there is nothing inherently wrong with running a for loop in R. The issue is that R runs for loops relatively slowly because it is a translated language. This means that when you run a loop in R, it translates that information into C (an efficient iterative language) and then back into R. This translation back and forth makes things take longer. When you have a simple process with few iterations, R will still run plenty fast. Once the process becomes more complex with more iterations, then you'll notice a greater difference between a for loop and apply.
Now, there is nothing inherently wrong with running a for loop in R. The issue is that R runs for loops relatively slowly because it is a translated language. This means that when you run a loop in R, it translates that information into C (an efficient iterative language) and then back into R. This translation back and forth makes things take longer. When you have a simple process with few iterations, R will still run plenty fast. Once the process becomes more complex with more iterations, then you'll notice a greater difference between a for loop and apply.
Regular Apply - Apply a Function to Margins of an Array or Matrix
Apply gives you a way to apply a function to a dimension or dimensions of data frames or matrices. For example, sum each row of a matrix. Hence, the name "apply", which gets at applying functions to data in various ways.
For these examples, I used US census data that has various tidbits of demographic information. This data set holds household size and population information on the city level, which the census calls "Place" level data.
For these examples, I used US census data that has various tidbits of demographic information. This data set holds household size and population information on the city level, which the census calls "Place" level data.
The apply function is used to apply a function to each row, column, or additional dimension (in the case of an array). For example, getting the sum of each column. It's pretty straightforward and can be a very powerful tool for efficient computations.
In this example I am going to sum the rows for the household size columns. Those are the columns with "1_Person" to "7+_Person" and are the 5th through 11th columns. I point this out because I will be subsetting this data frame to do this. You can get household totals for each city by summing each row. I will carry this out as a for loop and then using the apply function.
Here I initialized the vector "tot" with NAs and then inserted the sum of the "ith" row into the ith element of "tot". I provided the first few values so you can compare it to the total households column and see that they are the same. This loop took a total of 5.349 seconds to run. Not that long to wait, especially when we understand that there are 29,247 cities in this data set it iterated through.
Let's contrast this to using the apply function.
You will see that the output provided matches the total households column, and it only took 0.071 seconds to run! Much faster.
Let me explain the code for apply real quick. The first argument in the function is censdat[,5:11] which is the data frame and I told it to only include the 5th through 11th columns. The next argument is the margin to apply the function. A value of 1 indicates the rows, and 2 indicates the columns. You can give it higher values if you pass it an array but that's beyond the scope of this discussion. The third argument is the function you want applied. In this case it was the sum function. Any additional arguments passed to the function will be options for your applied function. For example, the sum function has an na.rm option which tells it to ignore NAs. To do this I would use,
This concludes this post about for loops and the apply function. I continue this line of thought in a follow up post discussing tapply, lapply, and sapply, and compare those methods to for loops. I'm trying to keep these posts from being too long and onerous. I figure if you are learning, it's better to take a bite at a time instead of a fire hose of information.
I'm always open to suggestions for future posts so please let me know if there is a topic you would like me to discuss. Feel free to ask me any specific questions at this site. Be aware that there is a nominal $1.50 fee to submit questions. That is because it takes time and effort to respond to your questions.
In this example I am going to sum the rows for the household size columns. Those are the columns with "1_Person" to "7+_Person" and are the 5th through 11th columns. I point this out because I will be subsetting this data frame to do this. You can get household totals for each city by summing each row. I will carry this out as a for loop and then using the apply function.
Let's contrast this to using the apply function.
You will see that the output provided matches the total households column, and it only took 0.071 seconds to run! Much faster.
Let me explain the code for apply real quick. The first argument in the function is censdat[,5:11] which is the data frame and I told it to only include the 5th through 11th columns. The next argument is the margin to apply the function. A value of 1 indicates the rows, and 2 indicates the columns. You can give it higher values if you pass it an array but that's beyond the scope of this discussion. The third argument is the function you want applied. In this case it was the sum function. Any additional arguments passed to the function will be options for your applied function. For example, the sum function has an na.rm option which tells it to ignore NAs. To do this I would use,
apply(censdat[,5:11],1,sum,na.rm=T).
I'm always open to suggestions for future posts so please let me know if there is a topic you would like me to discuss. Feel free to ask me any specific questions at this site. Be aware that there is a nominal $1.50 fee to submit questions. That is because it takes time and effort to respond to your questions.
Comments
Post a Comment