Performance issues

Over 40% of the R code is predominantly written in C, and a little bit over 20% still in Fortran (the rest in C++, Java, and R), making some common computational tasks very costly. Microsoft (and, before, Revolution analytics) did rewrite some of the most frequently used functions from old Fortran to C/C++ in order to address performance issues.

Many package authors did very similar things. For example, Matt Dowle—the main author of the data.table R package—did several language performance lift-ups to speed up most common data wrangling steps.

When comparing similar operations on the same dataset using different packages, such as dplyr, plyr, data.table, and sqldf, one can see the difference in the time performance with the same results.

The following R sample shows roughly a 80 MiB big object with a simple grouping function of how much difference there is in the computation time. Packages dpylr and data.table stand out and have performance gain over 25x times better in comparison to plyr and sqldf. data.table, especially, is extremely efficient and this is mainly due to Matt's extreme impetus to optimize the code of the data.table package in order to gain performance:

set.seed(6546) 
nobs <- 1e+07 
df <- data.frame("group" = as.factor(sample(1:1e+05, nobs, replace = TRUE)), "variable" = rpois(nobs, 100)) 
 
# Calculate mean of variable within each group using plyr - ddply  
library(plyr) 
system.time(grpmean <- ddply( 
  df,  
  .(group),  
  summarize,  
  grpmean = mean(variable))) 
 
 
# Calcualte mean of variable within each group using dplyr 
detach("package:plyr", unload=TRUE) 
library(dplyr) 
 
system.time( 
  grpmean2 <- df %>%  
              group_by(group) %>% 
              summarise(group_mean = mean(variable))) 
 
# Calcualte mean of variable within each group using data.table 
library(data.table) 
system.time( 
  grpmean3 <- data.table(df)[ 
    #i 
    ,mean(variable)    
    ,by=(group)] ) 
 
# Calcualte mean of variable within each group using sqldf 
library(sqldf) 
system.time(grpmean4 <- sqldf("SELECT avg(variable), [group] from df GROUP BY [group]")) 

The Microsoft RevoScaleR package, on the other hand, is optimized as well and can supersede all of these packages in terms of performance and large datasets. This is just to prove how Microsoft has made R ready for large datasets to address the performance issues.