Home
  • Dynamic Examples

On this page

  • Why is this relevant?
  • Read input file
  • Check if the table was read correctly
    • Inspect properties and structure
    • Inspect the table
  • Troubleshooting
    • Find the errors
    • Identify the source of the problem
      • Is the source table correct?
      • One or multiple errors ?
      • Inspect columns with the wrong formats
    • Identify solutions
      • Code fix: specify column types when reading
      • Source fix: correct the source table in excel
    • Choose the most practical and safe solution
  • Preventing future issues
  • Appendix: date format adjustment

Read tables from file

Reading a table from file and troubleshooting read errors

Author

G.Fraga Gonzalez

Published

October 31, 2023

We describe a frequent first step of reading tabular data, with emphasis in the necessary troubleshooting that should help us maintaining source tables machine-readable and ready for further processing, analysis, visualizations and sharing. In this example we read a table from a real-project that was filled in by the researchers during the data collection. The information in this table is required to interpret the data collected for each subject.

Why is this relevant?

Research almost always involves manipulating tabular data, merging tables from different sources, transposing or reshaping table formats and passing data to analysis software.

Reproducible research requires these operations to be performed programmatically and not manually, to reduce errors and to allow others to reproduce them.

Interoperability of data means that we should maximize the possibility of using the data with different software and operating systems without losing information.

Read input file

First we need to load some libraries. The packages containing these libraries should have been installed before using install.packages().

library(readxl)     # for reading excel (another package is openxlsx) 
library(dplyr)      # data manipulation grammar like the pipe %>%, join operations, 
library(kableExtra) # to render simple and pretty static tables 

Then we specify the fullpath of the excel file. For this we use the path relative to this R Quarto project. The location of the R quarto project file is defined by getwd(). By using file.path() to specify the path we avoid hardcoding the folder separators increasing interoperability across operating systems.

fileinput <- file.path(getwd(),'example_data','Example_subject_info.xlsx')

Then we read the content of a specific sheet in the file using read_excel() of the package readxl. There are several R packages to read excel to do this, we chose this package as it seems to be good at interacting with series of packages (interoperability).

tbl_subj <- readxl::read_excel(fileinput, sheet='Control_room_sheet')

Check if the table was read correctly

When reading a table with code you should always check for unexpected behavior like omission of rows or column, or incorrect formatting. There are several ways of doing this.

Warning

It is not enough to just check that the dimensions of your table in R are the same as in excel!

See below

Inspect properties and structure

Check dimensions, object type, variable names and data types. You can use str() to display the internal structure of the object created when reading the file. The first line of the output specifies [n rows x n columns], type of object (e.g., data frame). The list also show data type (e.g., num = numeric, chr = characters) , dimensions and a few cases for each variable. If you have a large data set you can limit the number of output characters.

str(tbl_subj, nchar.max = 35)
tibble [34 × 15] (S3: tbl_df/tbl/data.frame)
 $ Name                        : chr [1:34] "JP01" "JP02" "JP03" "JP04" ...
 $ Sex                         : chr [1:34] "Male" "Male" "Male" "Female" ...
 $ Body weight [g]             : num [1:34] 27.5 26.6 27.5 20.2 | __truncated__ ...
 $ Cannula length (mm)         : chr [1:34] "2.2999999999999998" "2.2999999999999998" "2.2999999999999998" "2.2999999999999998" ...
 $ Dilution                    : chr [1:34] "1.5x" "1.5x" "1.5x" "1.5x" ...
 $ Concentration               : chr [1:34] "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" ...
 $ Withdrawn CA in syringe (uL): num [1:34] 5 5 5 5 5 5 5 5 17 17 ...
 $ Infused CA in mannitol (uL) : num [1:34] NA NA NA NA NA NA NA NA 0.2 0.2 ...
 $ Injection 1 rate (uL/min)   : num [1:34] 0.1 0.1 0.1 0.1 0.1 | __truncated__ ...
 $ Injection 1 volume (uL)     : num [1:34] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5 5 ...
 $ Injection 1 JP time (HH:MM) : chr [1:34] "0.62847222222222221" "0.76111111111111107" "0.87152777777777779" "0.46875" ...
 $ Injection 2 rate (uL/min)   : num [1:34] NA NA NA NA NA NA NA NA NA NA ...
 $ Injection 2 volume (uL)     : chr [1:34] NA NA NA NA ...
 $ Injection 2 JP time (HH:MM) : chr [1:34] NA NA NA NA ...
 $ Bubbles?                    : logi [1:34] NA NA NA NA NA NA ...

Inspect the table

To inspect the table, we usually work with an Integrated Development Environment like (e.g., Rstudio) where we can click on a variable and a window will allow to navigate the table. But to render the table in documents like this, recommended good for sharing, you can use kable() to render the table. In this example we also add a scroll box to make it compact.

knitr::kable(
  tbl_subj
) %>%
  kable_paper(bootstrap_options = c("basic","hover")) %>%
  scroll_box(height= "400px")
Name Sex Body weight [g] Cannula length (mm) Dilution Concentration Withdrawn CA in syringe (uL) Infused CA in mannitol (uL) Injection 1 rate (uL/min) Injection 1 volume (uL) Injection 1 JP time (HH:MM) Injection 2 rate (uL/min) Injection 2 volume (uL) Injection 2 JP time (HH:MM) Bubbles?
JP01 Male 27.50 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.62847222222222221 NA NA NA NA
JP02 Male 26.60 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.76111111111111107 NA NA NA NA
JP03 Male 27.50 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.87152777777777779 NA NA NA NA
JP04 Female 20.20 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.46875 NA NA NA NA
JP05 Female 20.20 2.2000000000000002 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.62569444444444444 NA NA NA NA
JP06 Female 20.30 2.2000000000000002 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 NA NA NA NA NA
JP07 Male 26.40 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.70833333333333337 NA NA NA NA
JP08 Male 24.60 2.2999999999999998 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 0.53125 NA NA NA NA
JP09 Male 26.80 2.2999999999999998 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 0.5805555555555556 NA NA NA NA
JP10 Male 25.70 2.2999999999999998 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 0.68958333333333333 NA NA NA NA
JP11 Male 24.40 2.2999999999999998 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 0.78125 NA NA NA NA
JP12 Male 25.30 2.2999999999999998 1x 320 mg Ba/ml 17 0.2 0.50 5.0 0.86388888888888893 NA NA NA NA
JP13 Male 24.70 2.2000000000000002 1x 320 mg Ba/ml 17 0.5 0.50 5.0 0.4680555555555555 0.50 5 0.51041666666666663 NA
JP14 Male 27.00 2.2000000000000002 1x 320 mg Ba/ml 17 0.5 0.50 2.5 13.38 0.50 2.5 0.61736111111111114 NA
JP15 Male 24.00 2.2000000000000002 1x 320 mg Ba/ml 17 0.5 0.50 2.5 0.68333333333333324 0.50 2.5 0.72638888888888886 NA
JP16 Male 25.80 2.2000000000000002 1x 320 mg Ba/ml 17 0.5 0.50 2.5 0.84027777777777779 0.50 2.5 0.8847222222222223 NA
JP17 Male 26.40 2.2000000000000002 1x 320 mg Ba/ml 17 0.5 0.50 2.5 0.95694444444444438 0.50 2.5 0.98611111111111116 NA
JP18 Male 26.60 2.2000000000000002 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 0.41736111111111113 NA NA NA NA
JP19 Male 26.90 2.2000000000000002 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 0.53749999999999998 NA NA NA NA
JP20 Male 26.20 2.2000000000000002 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 0.65208333333333335 0.50 2.5 0.69097222222222221 NA
JP21 Female 21.00 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.74513888888888891 NA NA NA NA
JP22 Female 19.50 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.8666666666666667 0.25 2.5 0.93819444444444444 NA
JP23 Female 19.60 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.97777777777777775 NA NA NA NA
JP24 Female 19.56 2.2999999999999998 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.47222222222222227 NA NA NA NA
JP25 Female 19.63 2.2999999999999998 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.54861111111111105 NA NA NA NA
JP26 Female 19.71 2.2999999999999998 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.70694444444444438 NA NA NA NA
JP27 Female 19.20 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.83611111111111114 NA NA NA NA
JP28 Female 20.20 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.92986111111111114 0.25 5 4.1666666666666666E-3 NA
JP29 Female 18.60 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.41180555555555554 NA NA NA NA
JP30 Female 19.03 2.2000000000000002 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 0.50763888888888886 0.25 ca. 2.5 0.57430555555555551 NA
JP31 Female 20.50 2.2000000000000002 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 0.61249999999999993 NA NA NA NA
JP32 Female 20.20 2.2000000000000002 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 0.7006944444444444 0.25 ca. 2.5 0.76944444444444438 NA
JP33 Female 21.20 2.2000000000000002 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 0.7993055555555556 0.25 ca. 2.5 0.86597222222222225 NA
JP34 Female 19.21 N/A NA NA NA NA NA NA N/A NA NA N/A NA

Troubleshooting

Find the errors

Here we can already observe some unexpected errors. The errors in this example are not uncommon when we are given a table:

  • The variable ‘Cannula length (mm)’ would be expected to be numeric but it is read as character (chr) and the number format seems odd , with too many decimals
  • The variables ‘Injection 1 JP time (HH:MM)’ and ‘Injection 2 JP time (HH:MM)’ do not fullfill the format indicated in the variable name, of two groups of 2 digits separated by colon.
  • We note that there are missing values (NA), those are the blank cells in the excel file.
  • Note: str() make give output like __truncated__ to indicate that not all the cases are displayed on the screen. This is just for display and it doesn’t mean that text is in the table or that the full column was not read.

Identify the source of the problem

Is the source table correct?

If we open this table in excel we see the correct values (e.g.,the column ‘cannula length (mm)’ contains values like 2.2 or 2.3, and ‘Injection JP time (HH:MM)’ values like 18:45). So we know this is a problem when reading the table in R (likely to happen in other software as well).

One or multiple errors ?

Several columns were not correctly read but it seems the problem is the same. They are all read as characters when should have been read in a different format. Do they have something in common?

Inspect columns with the wrong formats

Let us print the unique values of these columns. The function unique() outputs the values that do not repeat in the column:

unique(tbl_subj$`Cannula length (mm)`)
[1] "2.2999999999999998" "2.2000000000000002" "N/A"               
unique(tbl_subj$`Injection 1 JP time (HH:MM)`)
 [1] "0.62847222222222221" "0.76111111111111107" "0.87152777777777779"
 [4] "0.46875"             "0.62569444444444444" NA                   
 [7] "0.70833333333333337" "0.53125"             "0.5805555555555556" 
[10] "0.68958333333333333" "0.78125"             "0.86388888888888893"
[13] "0.4680555555555555"  "13.38"               "0.68333333333333324"
[16] "0.84027777777777779" "0.95694444444444438" "0.41736111111111113"
[19] "0.53749999999999998" "0.65208333333333335" "0.74513888888888891"
[22] "0.8666666666666667"  "0.97777777777777775" "0.47222222222222227"
[25] "0.54861111111111105" "0.70694444444444438" "0.83611111111111114"
[28] "0.92986111111111114" "0.41180555555555554" "0.50763888888888886"
[31] "0.61249999999999993" "0.7006944444444444"  "0.7993055555555556" 
[34] "N/A"                
unique(tbl_subj$`Injection 2 JP time (HH:MM)`)
 [1] NA                      "0.51041666666666663"   "0.61736111111111114"  
 [4] "0.72638888888888886"   "0.8847222222222223"    "0.98611111111111116"  
 [7] "0.69097222222222221"   "0.93819444444444444"   "4.1666666666666666E-3"
[10] "0.57430555555555551"   "0.76944444444444438"   "0.86597222222222225"  
[13] "N/A"                  

The three columns contained some cells with missing values that the researchers coded with the string “N/A” . Note that some contain also the value NA, which is the symbol that R uses to represent missing values (means not available), they are not surrounded by double quotes because they are not a character type cells, unlike “N/A”. Cells with NA are blank in excel.

Here, because researchers wrote a character string “N/A” in some cells, the entire column was read as having character type data, which prevented from reading numbers and date/time cells correctly.

Identify solutions

So we have a problem when reading a table in R that we don’t have if we use excel. What can we do?

Code fix: specify column types when reading

Most packages allow to specify your column types. Below we specify that the read_excel function should guess the format for all columns except the columns we had troubles with (the package allows to specify text, numeric, guess or date)

  • PRO: if we specify all formats then it is clear from reading the code what are the formats of each variable
  • CON: this may not always be a convenient solution for large tables with many columns.

SPECIFY NA STRINGS

tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',col_types = c("guess","guess","guess","numeric","guess","guess","guess","guess","guess","guess","date","guess","guess","date","guess"))
tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',na = c("","N/A"))


knitr::kable(
  tbl
) %>%
  kable_paper(bootstrap_options = c("basic","hover")) %>%
  scroll_box(height= "400px")
Name Sex Body weight [g] Cannula length (mm) Dilution Concentration Withdrawn CA in syringe (uL) Infused CA in mannitol (uL) Injection 1 rate (uL/min) Injection 1 volume (uL) Injection 1 JP time (HH:MM) Injection 2 rate (uL/min) Injection 2 volume (uL) Injection 2 JP time (HH:MM) Bubbles?
JP01 Male 27.50 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 15:05:00 NA NA NA NA
JP02 Male 26.60 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 18:16:00 NA NA NA NA
JP03 Male 27.50 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 20:55:00 NA NA NA NA
JP04 Female 20.20 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 11:15:00 NA NA NA NA
JP05 Female 20.20 2.2 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 15:01:00 NA NA NA NA
JP06 Female 20.30 2.2 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 NA NA NA NA NA
JP07 Male 26.40 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 17:00:00 NA NA NA NA
JP08 Male 24.60 2.3 1.5x 480 mg Ba/ml 5 NA 0.10 2.5 1899-12-31 12:45:00 NA NA NA NA
JP09 Male 26.80 2.3 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 1899-12-31 13:56:00 NA NA NA NA
JP10 Male 25.70 2.3 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 1899-12-31 16:33:00 NA NA NA NA
JP11 Male 24.40 2.3 1.5x 480 mg Ba/ml 17 0.2 0.50 5.0 1899-12-31 18:45:00 NA NA NA NA
JP12 Male 25.30 2.3 1x 320 mg Ba/ml 17 0.2 0.50 5.0 1899-12-31 20:44:00 NA NA NA NA
JP13 Male 24.70 2.2 1x 320 mg Ba/ml 17 0.5 0.50 5.0 1899-12-31 11:14:00 0.50 5 1899-12-31 12:15:00 NA
JP14 Male 27.00 2.2 1x 320 mg Ba/ml 17 0.5 0.50 2.5 1900-01-13 09:07:12 0.50 2.5 1899-12-31 14:49:00 NA
JP15 Male 24.00 2.2 1x 320 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 16:24:00 0.50 2.5 1899-12-31 17:26:00 NA
JP16 Male 25.80 2.2 1x 320 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 20:10:00 0.50 2.5 1899-12-31 21:14:00 NA
JP17 Male 26.40 2.2 1x 320 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 22:58:00 0.50 2.5 1899-12-31 23:40:00 NA
JP18 Male 26.60 2.2 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 10:01:00 NA NA NA NA
JP19 Male 26.90 2.2 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 12:54:00 NA NA NA NA
JP20 Male 26.20 2.2 0.75x 240 mg Ba/ml 17 0.5 0.50 2.5 1899-12-31 15:39:00 0.50 2.5 1899-12-31 16:35:00 NA
JP21 Female 21.00 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 17:53:00 NA NA NA NA
JP22 Female 19.50 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 20:48:00 0.25 2.5 1899-12-31 22:31:00 NA
JP23 Female 19.60 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 23:28:00 NA NA NA NA
JP24 Female 19.56 2.3 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 11:20:00 NA NA NA NA
JP25 Female 19.63 2.3 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 13:10:00 NA NA NA NA
JP26 Female 19.71 2.3 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 16:58:00 NA NA NA NA
JP27 Female 19.20 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 20:04:00 NA NA NA NA
JP28 Female 20.20 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 22:19:00 0.25 5 1899-12-31 00:06:00 NA
JP29 Female 18.60 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 09:53:00 NA NA NA NA
JP30 Female 19.03 2.2 0.75x 240 mg Ba/ml 10 0.5 0.25 5.0 1899-12-31 12:11:00 0.25 ca. 2.5 1899-12-31 13:47:00 NA
JP31 Female 20.50 2.2 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 1899-12-31 14:42:00 NA NA NA NA
JP32 Female 20.20 2.2 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 1899-12-31 16:49:00 0.25 ca. 2.5 1899-12-31 18:28:00 NA
JP33 Female 21.20 2.2 0.75x 240 mg Ba/ml 10 1.0 0.25 5.0 1899-12-31 19:11:00 0.25 ca. 2.5 1899-12-31 20:47:00 NA
JP34 Female 19.21 NA NA NA NA NA NA NA NA NA NA NA NA
str(tbl)
tibble [34 × 15] (S3: tbl_df/tbl/data.frame)
 $ Name                        : chr [1:34] "JP01" "JP02" "JP03" "JP04" ...
 $ Sex                         : chr [1:34] "Male" "Male" "Male" "Female" ...
 $ Body weight [g]             : num [1:34] 27.5 26.6 27.5 20.2 20.2 20.3 26.4 24.6 26.8 25.7 ...
 $ Cannula length (mm)         : num [1:34] 2.3 2.3 2.3 2.3 2.2 2.2 2.3 2.3 2.3 2.3 ...
 $ Dilution                    : chr [1:34] "1.5x" "1.5x" "1.5x" "1.5x" ...
 $ Concentration               : chr [1:34] "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" ...
 $ Withdrawn CA in syringe (uL): num [1:34] 5 5 5 5 5 5 5 5 17 17 ...
 $ Infused CA in mannitol (uL) : num [1:34] NA NA NA NA NA NA NA NA 0.2 0.2 ...
 $ Injection 1 rate (uL/min)   : num [1:34] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.5 0.5 ...
 $ Injection 1 volume (uL)     : num [1:34] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5 5 ...
 $ Injection 1 JP time (HH:MM) : POSIXct[1:34], format: "1899-12-31 15:05:00" "1899-12-31 18:16:00" ...
 $ Injection 2 rate (uL/min)   : num [1:34] NA NA NA NA NA NA NA NA NA NA ...
 $ Injection 2 volume (uL)     : chr [1:34] NA NA NA NA ...
 $ Injection 2 JP time (HH:MM) : POSIXct[1:34], format: NA NA ...
 $ Bubbles?                    : logi [1:34] NA NA NA NA NA NA ...

Source fix: correct the source table in excel

In this example it would just be easier to go to the source table and delete those “N/A” entries, the cells will be then blank and read as NA by R without affecting the column format. This modification can be done with a script if the table is large

  • PRO: the source will be read correctly in any new script.
  • CON: if the table is too complex it will add a new step ( a ‘correcting’ script) from original data, if the it is simple and it is done manually there is a small chance for human error

Choose the most practical and safe solution

In this example, the source table required manual entries and inspections by researchers because they had to fill some on-site information from the lab. Thus, the most practical solution seems to delete ‘N/A’ entries and instruct researchers to leave empty those cells instead. This does not compromise safety as the table would have required manual inspection and entries anyway. Note that this would have been needed for consistency in the table, even if this format-read issue (some cells had been left empty while some others had ‘N/A’, this is inconsistent formatting unless it is required to convey different information).

Preventing future issues

Consider the following points to favor interoperability and minimize reading errors:

  • Keep formats consistent. If tables are filled in by several users, have explicit instructions (e.g., how to represent missing values). Whenever possible avoid mix formats within a variable

  • Avoid encoding information in formatting, for example, using colors or conditional formatting). No information should be loss if only the plain content (without formatting) of the cells is read.

  • Use consistent variables names, use informative and descriptive names while avoid whenever possible special characters and exceedingly long names. It is also recommended to have additional sheets with a codebook describing each variable name and what it means

Appendix: date format adjustment

Back to top
Source Code
---
title: "Read tables from file"
subtitle: "Reading a table from file and troubleshooting read errors" 
date: last-modified 
author: "G.Fraga Gonzalez"

format:
    html: 
      self-contained: true
      page-layout: full
      toc: true
      code-fold: false
      code-copy: true
      code-overflow: scroll
      code-tools: true
      number-sections: false
      title-block-style: default
---

We describe a frequent first step of reading tabular data, with emphasis in the necessary troubleshooting that should help us maintaining source tables machine-readable and ready for further processing, analysis, visualizations and sharing. In this example we read a table from a *real-project* that was filled in by the researchers during the data collection. The information in this table is required to interpret the data collected for each subject.

# Why is this relevant?

Research almost always involves manipulating tabular data, merging tables from different sources, transposing or reshaping table formats and passing data to analysis software. 

**Reproducible** research requires these operations to be performed programmatically and not manually, to reduce errors and to allow others to reproduce them.

**Interoperability** of data means that we should maximize the possibility of using the data with different software and operating systems without losing information. 

# Read input file

First we need to load some libraries. The packages containing these libraries should have been installed before using `install.packages()`.

```{r load packages}
#| warning: false
library(readxl)     # for reading excel (another package is openxlsx) 
library(dplyr)      # data manipulation grammar like the pipe %>%, join operations, 
library(kableExtra) # to render simple and pretty static tables 

```

Then we specify the fullpath of the excel file. For this we use the path relative to this R Quarto project. The location of the *R quarto project file* is defined by `getwd()`. By using `file.path()` to specify the path we avoid hardcoding the folder separators increasing interoperability across operating systems.

```{r}
#| warning: FALSE
fileinput <- file.path(getwd(),'example_data','Example_subject_info.xlsx')
```

Then we read the content of a specific sheet in the file using `read_excel()` of the package readxl. There are several R packages to read excel to do this, we chose this package as it seems to be good at interacting with series of packages (interoperability).

```{r readTableFile}
#| warning: FALSE
tbl_subj <- readxl::read_excel(fileinput, sheet='Control_room_sheet')
```

# Check if the table was read correctly

When reading a table with code you should **always** check for unexpected behavior like omission of rows or column, or incorrect formatting. There are several ways of doing this.

::: callout-warning
## Warning

It is not enough to just check that the dimensions of your table in R are the same as in excel!

See below
:::

## Inspect properties and structure

Check *dimensions, object type, variable names and data types*. You can use `str()` to display the internal structure of the object created when reading the file. The first line of the output specifies \[n rows x n columns\], type of object (e.g., data frame). The list also show data type (e.g., num = numeric, chr = characters) , dimensions and a few cases for each variable. If you have a large data set you can limit the number of output characters.

```{r}
#| warning: FALSE
#| results: markup 
str(tbl_subj, nchar.max = 35)

```

## Inspect the table

To inspect the table, we usually work with an Integrated Development Environment like (e.g., Rstudio) where we can click on a variable and a window will allow to navigate the table. But to render the table in documents like this, recommended good for sharing, you can use `kable()` to render the table. In this example we also add a scroll box to make it compact.

```{r}
knitr::kable(
  tbl_subj
) %>%
  kable_paper(bootstrap_options = c("basic","hover")) %>%
  scroll_box(height= "400px")

```

# Troubleshooting

## Find the errors

Here we can already observe some unexpected errors. The errors in this example are not uncommon when we are given a table:

-   The variable 'Cannula length (mm)' would be expected to be numeric but it is read as character (*chr)* and the number format seems odd , with too many decimals
-   The variables 'Injection 1 JP time (HH:MM)' and 'Injection 2 JP time (HH:MM)' do not fullfill the format indicated in the variable name, of two groups of 2 digits separated by colon.
-   We note that there are missing values (NA), those are the blank cells in the excel file.
-   Note: str() make give output like `__truncated__` to indicate that not all the cases are displayed on the screen. This is just for display and it doesn't mean that text is in the table or that the full column was not read.

## Identify the source of the problem

### Is the source table correct?

If we open this table in excel we see the correct values (e.g.,the column 'cannula length (mm)' contains values like 2.2 or 2.3, and 'Injection JP time (HH:MM)' values like 18:45). So we know this is a problem when reading the table in R (likely to happen in other software as well).

### One or multiple errors ?

Several columns were not correctly read but it seems the problem is the same. They are all read as characters when should have been read in a different format. Do they have something in common?

### Inspect columns with the wrong formats

Let us print the unique values of these columns. The function `unique()` outputs the values that do not repeat in the column:

```{r}
#| warning: FALSE
#| results: markup 
unique(tbl_subj$`Cannula length (mm)`)
unique(tbl_subj$`Injection 1 JP time (HH:MM)`)
unique(tbl_subj$`Injection 2 JP time (HH:MM)`)

```

The three columns contained some cells with missing values that the researchers coded with the string "N/A" . Note that some contain also the value *NA*, which is the symbol that R uses to represent missing values (means *not available*), they are not surrounded by double quotes because they are not a character type cells, unlike "N/A". Cells with NA are blank in excel.

Here, because researchers wrote a character string "N/A" in some cells, the entire column was read as having character type data, which prevented from reading numbers and date/time cells correctly.

## Identify solutions

So we have a problem when reading a table in R that we don't have if we use excel. What can we do?

### Code fix: specify column types when reading

Most packages allow to specify your column types. Below we specify that the read_excel function should guess the format for all columns except the columns we had troubles with (the package allows to specify *text*, *numeric*, *guess* or *date*)

-   PRO: if we specify all formats then it is clear from reading the code what are the formats of each variable
-   CON: this may not always be a convenient solution for large tables with many columns.\




SPECIFY NA STRINGS 

```{r}
#| warning: FALSE
#| results: markup
tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',col_types = c("guess","guess","guess","numeric","guess","guess","guess","guess","guess","guess","date","guess","guess","date","guess"))

```

```{r}
#| warning: FALSE
#| results: markup
tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',na = c("","N/A"))


knitr::kable(
  tbl
) %>%
  kable_paper(bootstrap_options = c("basic","hover")) %>%
  scroll_box(height= "400px")


str(tbl)
```
### Source fix: correct the source table in excel

In this example it would just be easier to go to the source table and delete those "N/A" entries, the cells will be then blank and read as NA by R without affecting the column format. This modification can be done with a script if the table is large

-   PRO: the source will be read correctly in any new script.
-   CON: if the table is too complex it will add a new step ( a 'correcting' script) from original data, if the it is simple and it is done manually there is a small chance for human error

## Choose the most practical and safe solution

In this example, the source table required manual entries and inspections by researchers because they had to fill some on-site information from the lab. Thus, the most practical solution seems to delete 'N/A' entries and instruct researchers to leave empty those cells instead. This does not compromise safety as the table would have required manual inspection and entries anyway. Note that this would have been needed for consistency in the table, even if this format-read issue (some cells had been left empty while some others had 'N/A', this is inconsistent formatting unless it is required to convey different information).

# Preventing future issues
Consider the following points to favor interoperability and minimize reading errors: 

-   *Keep formats consistent*. If tables are filled in by several users, have explicit instructions (e.g., how to represent missing values). Whenever possible avoid mix formats within a variable

-   *Avoid encoding information in formatting*, for example, using colors or conditional formatting). No information should be loss if only the plain content (without formatting) of the cells is read.

-    *Use consistent variables names*, use informative and descriptive names while avoid whenever possible special characters and exceedingly long names. It is also recommended to have additional sheets with a *codebook* describing each variable name and what it means


# Appendix: date format adjustment
Created by the Center for Reproducible Science of the University of Zurich.
 
This website is and will always be free, licensed under the CC BY internationa license 4.0 License.