R is one of the most powerful business analytics tool that has strong and seamless integration with SQL Server 2016, Tableau and many more tools alike. It has a very good integration with Big Data ecosystems as well and hence the need or R for analyzing data to deduce the details needed out of the available data set.
This article now introduces the data.table package which answers a lot of analytical skills that R has brought to the table, as it was considered that R couldn’t handle larger data sets ranging over 10GB in size. Let us take a closer look into how such a myth has been broken with this new package that R has to provide with.
One of the best ways to manage the data manipulations is to use the data.table packages made available by R itself. We have tried to provide valid examples to show case the usage of the various commands, within the boundaries of the article we will try to cover almost all possible syntaxes but it can never be considered a thorough documentation of all the features provided by R.
This package is an answer to the myth that R can’t handle data sets greater than 10GB size as it is not Memory efficient (the primary reason for this being, all the data gets loaded in to the RAM). The data.table package comes into play breaking the myth that it is no longer a limitation on the processing power of R. There has been successful benchmark tests done between data.table and dplyr packages and every such test is won by the data.table package. The efficiency of it is also compared against Python’s package (panda) with the same remarkable result.
Let us take a look at the syntax of data.table usage:
DT[ i , j , by]
i refers to the number of rows, also implies sub-setting rows. This is equivalent to WHERE clause on a traditional SQL statement
j refers to number of columns, also implies sub-setting columns (dropping / keeping). This is equivalent to SELECT clause on a traditional SQL statement.
by refers to adding a group so as to keep all the calculations within a group. This is equivalent to GROUP BY clause on a traditional SQL statement.
The syntax is not just restricted to these 3 parameters, there are more arguments that could be used in conjunction with data.table package which are listed as below:
on, mult, nomatch
The fread() function in data.table package, is available to read or get data from your computer or from a web page. It is similar to the read.csv() function of base R.
mydata = fread("E:flights_2014.csv")
From the CSV file that we read here, it contains a large amounts of data about the flight arrival and departure timings for year 2014. The nrow() and ncol() functions let you know that there are 253K rows and 17 columns of data available in the Input file that we have used.
We can use the %like% operator to find any pattern in the input file that we have mentioned. Alongside to the pattern, it is also similar to base R’s grepl() function or similar to the traditional SQL’s LIKE operator and similar to SAS’s CONTAINS function.
odata = mydata[,names(mydata) %like% "arr", with=FALSE]
We can rename the variables with the function setnames(). We shall rename a variable ‘arr’ to ‘arrival’.
setnames(mydata, c("arr"), c("Arrival"))
and if you want to rename multiple variables, you can simply add variables on both the sides.
setnames(mydata, c("arr","origin"), c("Arrival", "origin.of.flight"))
Let us summarize by group of origin, the command would be something as like shown below:
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = origin]
The process of merging in data.table is much similar to what the function merge() does in base R. The most common difference between these two is by default takes common key variables as the primary key to merge the datasets.
Let us take a small set of data for sample to show how the merge function works:
Inner Join, returns all the matching records that are available in both the datasets.
merge(dt1, dt2, by="A")
Left Join, returns all the matched records from the left dataset of the relation which are also present in the right dataset.
merge(dt1, dt2, by="A", all.x = TRUE)
Right Join, returns all the matched records from the right dataset of the relation as per the example provided earlier.
merge(dt1, dt2, by="A", all.y = TRUE)
A Full Join returns all the rows when there is a match available in one of the datasets.
merge(dt1, dt2, all=TRUE)
Based on the dataset that is made available, you make use of the following practice questions to deduce the R scripts to gain the details that is wanted out of the problem statement.
1. Calculate total number of rows by month and then sort on descending order
mydata[, .N, by = month] [order(-N)]
2. Find top 3 months with high mean arrival delay
mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = month][order(-mean_arr_delay)][1:3]
3. Find origin of flights having average total delay is greater than 20 minutes
mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay", "dep_delay"), by = origin][(arr_delay + dep_delay) > 20]
4. Extract average of arrival and departure delays for carrier == 'DL' by 'origin' and 'dest' variables
mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]
5. Pull first value of 'air_time' by 'origin' and then sum the returned values when it is greater than 300
mydata[, .SD, .SDcols="air_time", by=origin][air_time > 300, sum(air_time)]
Related Page: The 4 Steps To Set-Up Your MySQL Reporting
In this article, we have seen what R brings to the table with the most powerful data.table package. This package in specific provides two major advantages – lesser coding and lesser computing time / resources. Though not a very simple solution over the dplyr package provided by R as it works with decent speeds but is definitely slower than data.table packages.
Get Updates on Tech posts, Interview & Certification questions and training schedules