Handling tables in R (2024)

Language Technology and Data Analysis Laboratory

HOME
ABOUT LADAL
- People | Collabs
- News
EVENTS
- Workshops
- Computational Thinking in HASS
- LADAL Webinar Series 2022
- LADAL Webinar Series 2021
- ATAP Events
TUTORIALS
RESOURCES
- Links
- Tutorial stylesheet
CONTACT

Martin Schweinberger

This tutorial shows how to work with tables and how to tabulate datain R. A more advanced and detailed but also truly excellent and highlyrecommendable resource on processing data in R is Wickham and Grolemund (2016; see also Wickham et al.2019). Alternative but also very useful resources are Stander and Dalla Valle (2017)and Estrellado et al. (2020).

This tutorial is aimed at beginners and intermediate users of R withthe aim of showcasing how to work with and process tabulated data(tables) using R. The aim is not to provide a fully-fledged analysis butrather to show and exemplify selected useful functions and methodsassociated with tabular data.

To be able to follow this tutorial, we suggest you check out andfamiliarize yourself with the content of the following RBasics tutorials:

Getting started with R
Loading, saving, and generatingdata in R
String Processing in R
Regular Expressions in R

Click here1 todownload the entire R Notebook for thistutorial.

Clickhereto open an interactive Jupyter notebook that allows you to execute,change, and edit the code as well as to upload your own data.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are newto it, you will find an introduction to and more information how to useR here. For thistutorials, we need to install certain packages from an Rlibrary so that the scripts shown below are executed withouterrors. Before turning to the code below, please install the packages byrunning the code below this paragraph. If you have already installed thepackages mentioned below, then you can skip ahead ignore this section.To install the necessary packages, simply run the following code - itmay take some time (between 1 and 5 minutes to install all of thepackages so you do not need to worry if it takes some time).

# install packagesinstall.packages("xlsx")install.packages("tidyverse")install.packages("flextable")install.packages("openxlsx")install.packages("here")# install klippy for copy-to-clipboard button in code chunksinstall.packages("remotes")remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we can activate them asshown below.

# load packageslibrary(dplyr)library(tidyr)library(flextable)library(xlsx)library(openxlsx)library(here)# activate klippy for copy-to-clipboard buttonklippy::klippy()

Once you have installed RStudio and initiated the session byexecuting the code shown above, you are good to go.

Tables are one of the most common and important ways to process,handle, and describe data. This tutorial introduces different types oftables, how to load and save different types of tables, as well as howto modify and process tables and tabulated data.

When working with R, there are different kinds or types of tablesthat have different characteristics. The most common types of tables inR are:

matrices
data frames
tibbles

Matrices can only contain one type of data and all data points willbe converted to the type of scale with the lowest information value. Forinstance, if at least one variables in a table represents characters (orstrings), then all other variables are also converted into characters(although they may be numbers).

Data frames can contain different types of scales and variables arenot converted automatically. In this sense, data frames are moreflexible and are preferable over matrices.

Tibbles are the tidyverse equivalent of data frames which offer newfunctions and possibilities of handling and inspecting the data. .

First 10 rows of mytable.
status	attraction	money
Relationship	NotInterested	86.33
Relationship	NotInterested	45.58
Relationship	NotInterested	68.43
Relationship	NotInterested	52.93
Relationship	NotInterested	61.86
Relationship	NotInterested	48.47
Relationship	NotInterested	32.79
Relationship	NotInterested	35.91
Relationship	NotInterested	30.98
Relationship	NotInterested	44.82

First 10 rows of the exceldata.
Variable1	Variable2
6	67
65	16
12	56
56	34
45	54
84	42
38	36
46	47
64	54
24	29

First 10 rows of the excelcomp data.
Variable1	Variable2
6	67
65	16
12	56
56	34
45	54
84	42
38	36
46	47
64	54
24	29

First 10 rows of the excelcomp data.
status	attraction	money
Relationship	NotInterested	86.33
Relationship	NotInterested	45.58
Relationship	NotInterested	68.43
Relationship	NotInterested	52.93
Relationship	NotInterested	61.86
Relationship	NotInterested	48.47
Relationship	NotInterested	32.79
Relationship	NotInterested	35.91
Relationship	NotInterested	30.98
Relationship	NotInterested	44.82

Piping

Piping, done with the sequence %>%, is a very easy,intuitive, quick, and handy way to process data. Essentially pipingmeans that we take an element that is to the left of the piping symboland then do something to it; that way, the piping symbol can betranslated as and then.

We could, for example, load data and then capitalize the column namesand then group the data by status and attraction and then get the meanof money spend on deleting all observations except for the first one. Amore formal way to write this would be:

load %>% capitalize %>% group %>% summarize.

In R this command would look like this:

# example of a data processing pipelinepipeddata <- base::readRDS(url("https://slcladal.github.io/data/mld.rda", "rb")) %>% dplyr::rename(Status = status, Attraction = attraction, Money = money) %>% dplyr::group_by(Status, Attraction) %>% dplyr::summarise(Mean = mean(Money))# inspect summarized datapipeddata

## # A tibble: 4 × 3## # Groups: Status [2]## Status Attraction Mean## <chr> <chr> <dbl>## 1 Relationship Interested 99.2## 2 Relationship NotInterested 51.5## 3 Single Interested 157. ## 4 Single NotInterested 46.0

The pipe has worked and we get the resulting summary which shows themean of the money spend based on Attraction and Status.

Selecting and filtering

Among the most frequent procedures in data processing is selectingcertain columns or extracting rows based on variable levels. In thetidyverse, this is done by using the select andfilter functions. While select allows toextract columns, filter is used to extract rows, e.g.toget only observations that have a certain feature. Have a look at theexample below.

# select and filterreduceddata <- newdata %>% # select the columns attraction and money dplyr::select(attraction, money) %>% # extract rows which represent cases where the person was interested in someone dplyr::filter(attraction == "Interested")# inspect new tablenrow(reduceddata); table(reduceddata$attraction)

## [1] 50

## ## Interested ## 50

We have now reduced the data by excluding status (we have onlyselected attraction and money) and we have removed those 50 data rows ofpeople who were not interested. The select function (likemost other tidyverse functions) can also be used together with a minussign which causes a column to be removed, thusdplyr::select(-money) would remove the money column (seebelow).

# select and filterdatawithoutmoney <- newdata %>% # remove money dplyr::select(-money) # inspect datahead(datawithoutmoney)

## status attraction## 1 Relationship NotInterested## 2 Relationship NotInterested## 3 Relationship NotInterested## 4 Relationship NotInterested## 5 Relationship NotInterested## 6 Relationship NotInterested

Selecting and filtering are extremely powerful functions that canalso be combined with other functions. But before we discuss morecomplex issues, we will have a look at how we can change columns.

Changing data and adding columns

Changing and adding data is done with the mutatefunction. The mutate functions requires that we specify acolumn name - if we use the same name as the column we are changing,then we change the column but if we specify another column name, then anew column is created.

We will now create a new column (Spendalot) in which we encode if theperson has spend a lot of money (100 AUD or more) on the present or not(less than 100 AUD).

# creating a new columnnewdata <- newdata %>% dplyr::mutate(Spendalot = ifelse(money >= 100, "Alot", "Alittle")) # inspect datahead(newdata)

## status attraction money Spendalot## 1 Relationship NotInterested 86.33 Alittle## 2 Relationship NotInterested 45.58 Alittle## 3 Relationship NotInterested 68.43 Alittle## 4 Relationship NotInterested 52.93 Alittle## 5 Relationship NotInterested 61.86 Alittle## 6 Relationship NotInterested 48.47 Alittle

The table now has a new column (Spendalot) because we have specifieda column name that did not exist yet - had we writtendplyr::mutate(money = ifelse(money >= 100, "Alot", "Alittle"))then we would have changed the money column and replaced the moneyvalues with the labels Alot and Alittle.

Renaming columns

Oftentimes, column names are not really meaningful or incoherentwhich makes it easier to wrap your head around what the values in acolumn refer to. The easiest way around this is rename columns which is,fortunately very simple in the tidyverse. While the column names of ourexample table are meaningful, I want to capitalize the first letter ofeach column name. This can be done as follows.

# renaming columnsnewdata <- newdata %>% dplyr::rename(Status = status, Attraction = attraction, Money = money)# inspect datahead(newdata)

## Status Attraction Money Spendalot## 1 Relationship NotInterested 86.33 Alittle## 2 Relationship NotInterested 45.58 Alittle## 3 Relationship NotInterested 68.43 Alittle## 4 Relationship NotInterested 52.93 Alittle## 5 Relationship NotInterested 61.86 Alittle## 6 Relationship NotInterested 48.47 Alittle

The renaming was successful as all column names now begin with acapital letter.

Grouping and summarising

In contrast to mutate, which retains the number of rows, summarizingcreates new columns but collapses rows and only provides the summaryvalue (or values if more than one summary is specified). Also, columnsthat are not grouping variables are removed.

Summarizing is particularly useful when we want to get summaries ofgroups. We will modify the example from above and extract the mean andthe standard deviation of the money spend on presents by relationshipstatus and whether the giver was attracted to the giv-ee.

#grouping and summarizing data datasummary <- newdata %>% dplyr::group_by(Status, Attraction) %>% dplyr::summarise(Mean = round(mean(Money), 2), SD = round(sd(Money), 1))# inspect summarized datadatasummary

## # A tibble: 4 × 4## # Groups: Status [2]## Status Attraction Mean SD## <chr> <chr> <dbl> <dbl>## 1 Relationship Interested 99.2 14.7## 2 Relationship NotInterested 51.5 17 ## 3 Single Interested 157. 23.2## 4 Single NotInterested 46.0 19.9

Gathering and Spreading

One very common problem is that data - or at least parts of it - haveto be transformed from long to wide format or vice versa. In thetidyverse, this is done using the gather andspread function. We will convert the summary table shownabove into a wide format (we also remove the SD column as it is nolonger needed)

# converting data to wide format widedata <- datasummary %>% # remove SD column dplyr::select(-SD) %>% # convert into wide format tidyr::spread(Attraction, Mean)# inspect wide datawidedata

## # A tibble: 2 × 3## # Groups: Status [2]## Status Interested NotInterested## <chr> <dbl> <dbl>## 1 Relationship 99.2 51.5## 2 Single 157. 46.0

We can re-convert the wide into a long format using thegather function.

# converting data to long format longdata <- widedata %>% # convert into long format tidyr::gather(Attraction, Money, Interested:NotInterested)# inspect wide datalongdata

## # A tibble: 4 × 3## # Groups: Status [2]## Status Attraction Money## <chr> <chr> <dbl>## 1 Relationship Interested 99.2## 2 Single Interested 157. ## 3 Relationship NotInterested 51.5## 4 Single NotInterested 46.0

There are many more useful functions for processing, handling, andsummarizing tables but this should suffice to get you started.

Schweinberger, Martin. 2022. Handling tables in R. Brisbane:The University of Queensland. url: https://ladal.edu.au/table.html (Version2022.11.17).

@manual{schweinberger2022tabr, author = {Schweinberger, Martin}, title = {Handling tables in R}, note = {https://ladal.edu.au/table.html}, year = {2022}, organization = "The University of Queensland, School of Languages and Cultures}, address = {Brisbane}, edition = {2022.11.17}}

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 22621)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8 ## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C ## [5] LC_TIME=English_Australia.utf8 ## ## attached base packages:## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages:## [1] here_1.0.1 openxlsx_4.2.5.2 xlsx_0.6.5 flextable_0.9.1 ## [5] tidyr_1.3.0 dplyr_1.1.2 ## ## loaded via a namespace (and not attached):## [1] Rcpp_1.0.10 xlsxjars_0.6.1 assertthat_0.2.1 ## [4] rprojroot_2.0.3 digest_0.6.31 utf8_1.2.3 ## [7] mime_0.12 cellranger_1.1.0 R6_2.5.1 ## [10] evaluate_0.21 highr_0.10 pillar_1.9.0 ## [13] gdtools_0.3.3 rlang_1.1.1 readxl_1.4.2 ## [16] curl_5.0.0 uuid_1.1-0 rstudioapi_0.14 ## [19] data.table_1.14.8 jquerylib_0.1.4 klippy_0.0.0.9500 ## [22] rmarkdown_2.21 textshaping_0.3.6 shiny_1.7.4 ## [25] compiler_4.2.2 httpuv_1.6.11 xfun_0.39 ## [28] pkgconfig_2.0.3 askpass_1.1 systemfonts_1.0.4 ## [31] gfonts_0.2.0 htmltools_0.5.5 openssl_2.0.6 ## [34] tidyselect_1.2.0 tibble_3.2.1 fontBitstreamVera_0.1.1## [37] httpcode_0.3.0 fansi_1.0.4 withr_2.5.0 ## [40] crayon_1.5.2 later_1.3.1 crul_1.4.0 ## [43] grid_4.2.2 jsonlite_1.8.4 xtable_1.8-4 ## [46] lifecycle_1.0.3 magrittr_2.0.3 zip_2.3.0 ## [49] cli_3.6.1 stringi_1.7.12 cachem_1.0.8 ## [52] promises_1.2.0.1 xml2_1.3.4 bslib_0.4.2 ## [55] ellipsis_0.3.2 ragg_1.2.5 generics_0.1.3 ## [58] vctrs_0.6.2 tools_4.2.2 glue_1.6.2 ## [61] officer_0.6.2 fontquiver_0.2.1 purrr_1.0.1 ## [64] fastmap_1.1.1 yaml_2.3.7 fontLiberation_0.1.0 ## [67] rJava_1.0-6 knitr_1.43 sass_0.4.6

Back to LADAL

Estrellado, Ryan A, Emily A Freer, Jesse Mostipak, Joshua M Rosenberg,and Isabella C Velásquez. 2020. Data Science in Education Usingr. Routledge.

Stander, Julian, and Luciana Dalla Valle. 2017. “On EnthusingStudents about Big Data and Social Media Visualization and AnalysisUsing r, RStudio, and RMarkdown.” Journal of StatisticsEducation 25 (2): 60–67.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, LucyD’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019.“Welcome to the Tidyverse.” Journal of Open SourceSoftware 4 (43): 1686.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science:Import, Tidy, Transform, Visualize, and Model Data. " O’ReillyMedia, Inc.".

If you want to render the R Notebook on your machine,i.e.knitting the document to html or a pdf, you need to make sure thatyou have R and RStudio installed and you also need to download the bibliographyfile and store it in the same folder where you store theRmd file.↩︎