Read tables from file
Reading a table from file and troubleshooting read errors
We describe a frequent first step of reading tabular data, with emphasis in the necessary troubleshooting that should help us maintaining source tables machine-readable and ready for further processing, analysis, visualizations and sharing. In this example we read a table from a real-project that was filled in by the researchers during the data collection. The information in this table is required to interpret the data collected for each subject.
Why is this relevant?
Research almost always involves manipulating tabular data, merging tables from different sources, transposing or reshaping table formats and passing data to analysis software.
Reproducible research requires these operations to be performed programmatically and not manually, to reduce errors and to allow others to reproduce them.
Interoperability of data means that we should maximize the possibility of using the data with different software and operating systems without losing information.
Read input file
First we need to load some libraries. The packages containing these libraries should have been installed before using install.packages()
.
Then we specify the fullpath of the excel file. For this we use the path relative to this R Quarto project. The location of the R quarto project file is defined by getwd()
. By using file.path()
to specify the path we avoid hardcoding the folder separators increasing interoperability across operating systems.
Then we read the content of a specific sheet in the file using read_excel()
of the package readxl. There are several R packages to read excel to do this, we chose this package as it seems to be good at interacting with series of packages (interoperability).
tbl_subj <- readxl::read_excel(fileinput, sheet='Control_room_sheet')
Check if the table was read correctly
When reading a table with code you should always check for unexpected behavior like omission of rows or column, or incorrect formatting. There are several ways of doing this.
It is not enough to just check that the dimensions of your table in R are the same as in excel!
See below
Inspect properties and structure
Check dimensions, object type, variable names and data types. You can use str()
to display the internal structure of the object created when reading the file. The first line of the output specifies [n rows x n columns], type of object (e.g., data frame). The list also show data type (e.g., num = numeric, chr = characters) , dimensions and a few cases for each variable. If you have a large data set you can limit the number of output characters.
str(tbl_subj, nchar.max = 35)
tibble [34 × 15] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:34] "JP01" "JP02" "JP03" "JP04" ...
$ Sex : chr [1:34] "Male" "Male" "Male" "Female" ...
$ Body weight [g] : num [1:34] 27.5 26.6 27.5 20.2 | __truncated__ ...
$ Cannula length (mm) : chr [1:34] "2.2999999999999998" "2.2999999999999998" "2.2999999999999998" "2.2999999999999998" ...
$ Dilution : chr [1:34] "1.5x" "1.5x" "1.5x" "1.5x" ...
$ Concentration : chr [1:34] "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" ...
$ Withdrawn CA in syringe (uL): num [1:34] 5 5 5 5 5 5 5 5 17 17 ...
$ Infused CA in mannitol (uL) : num [1:34] NA NA NA NA NA NA NA NA 0.2 0.2 ...
$ Injection 1 rate (uL/min) : num [1:34] 0.1 0.1 0.1 0.1 0.1 | __truncated__ ...
$ Injection 1 volume (uL) : num [1:34] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5 5 ...
$ Injection 1 JP time (HH:MM) : chr [1:34] "0.62847222222222221" "0.76111111111111107" "0.87152777777777779" "0.46875" ...
$ Injection 2 rate (uL/min) : num [1:34] NA NA NA NA NA NA NA NA NA NA ...
$ Injection 2 volume (uL) : chr [1:34] NA NA NA NA ...
$ Injection 2 JP time (HH:MM) : chr [1:34] NA NA NA NA ...
$ Bubbles? : logi [1:34] NA NA NA NA NA NA ...
Inspect the table
To inspect the table, we usually work with an Integrated Development Environment like (e.g., Rstudio) where we can click on a variable and a window will allow to navigate the table. But to render the table in documents like this, recommended good for sharing, you can use kable()
to render the table. In this example we also add a scroll box to make it compact.
knitr::kable(
tbl_subj
) %>%
kable_paper(bootstrap_options = c("basic","hover")) %>%
scroll_box(height= "400px")
Name | Sex | Body weight [g] | Cannula length (mm) | Dilution | Concentration | Withdrawn CA in syringe (uL) | Infused CA in mannitol (uL) | Injection 1 rate (uL/min) | Injection 1 volume (uL) | Injection 1 JP time (HH:MM) | Injection 2 rate (uL/min) | Injection 2 volume (uL) | Injection 2 JP time (HH:MM) | Bubbles? |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
JP01 | Male | 27.50 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.62847222222222221 | NA | NA | NA | NA |
JP02 | Male | 26.60 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.76111111111111107 | NA | NA | NA | NA |
JP03 | Male | 27.50 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.87152777777777779 | NA | NA | NA | NA |
JP04 | Female | 20.20 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.46875 | NA | NA | NA | NA |
JP05 | Female | 20.20 | 2.2000000000000002 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.62569444444444444 | NA | NA | NA | NA |
JP06 | Female | 20.30 | 2.2000000000000002 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | NA | NA | NA | NA | NA |
JP07 | Male | 26.40 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.70833333333333337 | NA | NA | NA | NA |
JP08 | Male | 24.60 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 0.53125 | NA | NA | NA | NA |
JP09 | Male | 26.80 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 0.5805555555555556 | NA | NA | NA | NA |
JP10 | Male | 25.70 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 0.68958333333333333 | NA | NA | NA | NA |
JP11 | Male | 24.40 | 2.2999999999999998 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 0.78125 | NA | NA | NA | NA |
JP12 | Male | 25.30 | 2.2999999999999998 | 1x | 320 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 0.86388888888888893 | NA | NA | NA | NA |
JP13 | Male | 24.70 | 2.2000000000000002 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 5.0 | 0.4680555555555555 | 0.50 | 5 | 0.51041666666666663 | NA |
JP14 | Male | 27.00 | 2.2000000000000002 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 13.38 | 0.50 | 2.5 | 0.61736111111111114 | NA |
JP15 | Male | 24.00 | 2.2000000000000002 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.68333333333333324 | 0.50 | 2.5 | 0.72638888888888886 | NA |
JP16 | Male | 25.80 | 2.2000000000000002 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.84027777777777779 | 0.50 | 2.5 | 0.8847222222222223 | NA |
JP17 | Male | 26.40 | 2.2000000000000002 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.95694444444444438 | 0.50 | 2.5 | 0.98611111111111116 | NA |
JP18 | Male | 26.60 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.41736111111111113 | NA | NA | NA | NA |
JP19 | Male | 26.90 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.53749999999999998 | NA | NA | NA | NA |
JP20 | Male | 26.20 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 0.65208333333333335 | 0.50 | 2.5 | 0.69097222222222221 | NA |
JP21 | Female | 21.00 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.74513888888888891 | NA | NA | NA | NA |
JP22 | Female | 19.50 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.8666666666666667 | 0.25 | 2.5 | 0.93819444444444444 | NA |
JP23 | Female | 19.60 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.97777777777777775 | NA | NA | NA | NA |
JP24 | Female | 19.56 | 2.2999999999999998 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.47222222222222227 | NA | NA | NA | NA |
JP25 | Female | 19.63 | 2.2999999999999998 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.54861111111111105 | NA | NA | NA | NA |
JP26 | Female | 19.71 | 2.2999999999999998 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.70694444444444438 | NA | NA | NA | NA |
JP27 | Female | 19.20 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.83611111111111114 | NA | NA | NA | NA |
JP28 | Female | 20.20 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.92986111111111114 | 0.25 | 5 | 4.1666666666666666E-3 | NA |
JP29 | Female | 18.60 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.41180555555555554 | NA | NA | NA | NA |
JP30 | Female | 19.03 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 0.50763888888888886 | 0.25 | ca. 2.5 | 0.57430555555555551 | NA |
JP31 | Female | 20.50 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 0.61249999999999993 | NA | NA | NA | NA |
JP32 | Female | 20.20 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 0.7006944444444444 | 0.25 | ca. 2.5 | 0.76944444444444438 | NA |
JP33 | Female | 21.20 | 2.2000000000000002 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 0.7993055555555556 | 0.25 | ca. 2.5 | 0.86597222222222225 | NA |
JP34 | Female | 19.21 | N/A | NA | NA | NA | NA | NA | NA | N/A | NA | NA | N/A | NA |
Troubleshooting
Find the errors
Here we can already observe some unexpected errors. The errors in this example are not uncommon when we are given a table:
- The variable ‘Cannula length (mm)’ would be expected to be numeric but it is read as character (chr) and the number format seems odd , with too many decimals
- The variables ‘Injection 1 JP time (HH:MM)’ and ‘Injection 2 JP time (HH:MM)’ do not fullfill the format indicated in the variable name, of two groups of 2 digits separated by colon.
- We note that there are missing values (NA), those are the blank cells in the excel file.
- Note: str() make give output like
__truncated__
to indicate that not all the cases are displayed on the screen. This is just for display and it doesn’t mean that text is in the table or that the full column was not read.
Identify the source of the problem
Is the source table correct?
If we open this table in excel we see the correct values (e.g.,the column ‘cannula length (mm)’ contains values like 2.2 or 2.3, and ‘Injection JP time (HH:MM)’ values like 18:45). So we know this is a problem when reading the table in R (likely to happen in other software as well).
One or multiple errors ?
Several columns were not correctly read but it seems the problem is the same. They are all read as characters when should have been read in a different format. Do they have something in common?
Inspect columns with the wrong formats
Let us print the unique values of these columns. The function unique()
outputs the values that do not repeat in the column:
unique(tbl_subj$`Cannula length (mm)`)
[1] "2.2999999999999998" "2.2000000000000002" "N/A"
unique(tbl_subj$`Injection 1 JP time (HH:MM)`)
[1] "0.62847222222222221" "0.76111111111111107" "0.87152777777777779"
[4] "0.46875" "0.62569444444444444" NA
[7] "0.70833333333333337" "0.53125" "0.5805555555555556"
[10] "0.68958333333333333" "0.78125" "0.86388888888888893"
[13] "0.4680555555555555" "13.38" "0.68333333333333324"
[16] "0.84027777777777779" "0.95694444444444438" "0.41736111111111113"
[19] "0.53749999999999998" "0.65208333333333335" "0.74513888888888891"
[22] "0.8666666666666667" "0.97777777777777775" "0.47222222222222227"
[25] "0.54861111111111105" "0.70694444444444438" "0.83611111111111114"
[28] "0.92986111111111114" "0.41180555555555554" "0.50763888888888886"
[31] "0.61249999999999993" "0.7006944444444444" "0.7993055555555556"
[34] "N/A"
unique(tbl_subj$`Injection 2 JP time (HH:MM)`)
[1] NA "0.51041666666666663" "0.61736111111111114"
[4] "0.72638888888888886" "0.8847222222222223" "0.98611111111111116"
[7] "0.69097222222222221" "0.93819444444444444" "4.1666666666666666E-3"
[10] "0.57430555555555551" "0.76944444444444438" "0.86597222222222225"
[13] "N/A"
The three columns contained some cells with missing values that the researchers coded with the string “N/A” . Note that some contain also the value NA, which is the symbol that R uses to represent missing values (means not available), they are not surrounded by double quotes because they are not a character type cells, unlike “N/A”. Cells with NA are blank in excel.
Here, because researchers wrote a character string “N/A” in some cells, the entire column was read as having character type data, which prevented from reading numbers and date/time cells correctly.
Identify solutions
So we have a problem when reading a table in R that we don’t have if we use excel. What can we do?
Code fix: specify column types when reading
Most packages allow to specify your column types. Below we specify that the read_excel function should guess the format for all columns except the columns we had troubles with (the package allows to specify text, numeric, guess or date)
- PRO: if we specify all formats then it is clear from reading the code what are the formats of each variable
- CON: this may not always be a convenient solution for large tables with many columns.
SPECIFY NA STRINGS
tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',col_types = c("guess","guess","guess","numeric","guess","guess","guess","guess","guess","guess","date","guess","guess","date","guess"))
tbl <- readxl::read_excel(fileinput, sheet='Control_room_sheet',na = c("","N/A"))
knitr::kable(
tbl
) %>%
kable_paper(bootstrap_options = c("basic","hover")) %>%
scroll_box(height= "400px")
Name | Sex | Body weight [g] | Cannula length (mm) | Dilution | Concentration | Withdrawn CA in syringe (uL) | Infused CA in mannitol (uL) | Injection 1 rate (uL/min) | Injection 1 volume (uL) | Injection 1 JP time (HH:MM) | Injection 2 rate (uL/min) | Injection 2 volume (uL) | Injection 2 JP time (HH:MM) | Bubbles? |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
JP01 | Male | 27.50 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 15:05:00 | NA | NA | NA | NA |
JP02 | Male | 26.60 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 18:16:00 | NA | NA | NA | NA |
JP03 | Male | 27.50 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 20:55:00 | NA | NA | NA | NA |
JP04 | Female | 20.20 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 11:15:00 | NA | NA | NA | NA |
JP05 | Female | 20.20 | 2.2 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 15:01:00 | NA | NA | NA | NA |
JP06 | Female | 20.30 | 2.2 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | NA | NA | NA | NA | NA |
JP07 | Male | 26.40 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 17:00:00 | NA | NA | NA | NA |
JP08 | Male | 24.60 | 2.3 | 1.5x | 480 mg Ba/ml | 5 | NA | 0.10 | 2.5 | 1899-12-31 12:45:00 | NA | NA | NA | NA |
JP09 | Male | 26.80 | 2.3 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 1899-12-31 13:56:00 | NA | NA | NA | NA |
JP10 | Male | 25.70 | 2.3 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 1899-12-31 16:33:00 | NA | NA | NA | NA |
JP11 | Male | 24.40 | 2.3 | 1.5x | 480 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 1899-12-31 18:45:00 | NA | NA | NA | NA |
JP12 | Male | 25.30 | 2.3 | 1x | 320 mg Ba/ml | 17 | 0.2 | 0.50 | 5.0 | 1899-12-31 20:44:00 | NA | NA | NA | NA |
JP13 | Male | 24.70 | 2.2 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 5.0 | 1899-12-31 11:14:00 | 0.50 | 5 | 1899-12-31 12:15:00 | NA |
JP14 | Male | 27.00 | 2.2 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1900-01-13 09:07:12 | 0.50 | 2.5 | 1899-12-31 14:49:00 | NA |
JP15 | Male | 24.00 | 2.2 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 16:24:00 | 0.50 | 2.5 | 1899-12-31 17:26:00 | NA |
JP16 | Male | 25.80 | 2.2 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 20:10:00 | 0.50 | 2.5 | 1899-12-31 21:14:00 | NA |
JP17 | Male | 26.40 | 2.2 | 1x | 320 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 22:58:00 | 0.50 | 2.5 | 1899-12-31 23:40:00 | NA |
JP18 | Male | 26.60 | 2.2 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 10:01:00 | NA | NA | NA | NA |
JP19 | Male | 26.90 | 2.2 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 12:54:00 | NA | NA | NA | NA |
JP20 | Male | 26.20 | 2.2 | 0.75x | 240 mg Ba/ml | 17 | 0.5 | 0.50 | 2.5 | 1899-12-31 15:39:00 | 0.50 | 2.5 | 1899-12-31 16:35:00 | NA |
JP21 | Female | 21.00 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 17:53:00 | NA | NA | NA | NA |
JP22 | Female | 19.50 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 20:48:00 | 0.25 | 2.5 | 1899-12-31 22:31:00 | NA |
JP23 | Female | 19.60 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 23:28:00 | NA | NA | NA | NA |
JP24 | Female | 19.56 | 2.3 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 11:20:00 | NA | NA | NA | NA |
JP25 | Female | 19.63 | 2.3 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 13:10:00 | NA | NA | NA | NA |
JP26 | Female | 19.71 | 2.3 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 16:58:00 | NA | NA | NA | NA |
JP27 | Female | 19.20 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 20:04:00 | NA | NA | NA | NA |
JP28 | Female | 20.20 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 22:19:00 | 0.25 | 5 | 1899-12-31 00:06:00 | NA |
JP29 | Female | 18.60 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 09:53:00 | NA | NA | NA | NA |
JP30 | Female | 19.03 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 0.5 | 0.25 | 5.0 | 1899-12-31 12:11:00 | 0.25 | ca. 2.5 | 1899-12-31 13:47:00 | NA |
JP31 | Female | 20.50 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 1899-12-31 14:42:00 | NA | NA | NA | NA |
JP32 | Female | 20.20 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 1899-12-31 16:49:00 | 0.25 | ca. 2.5 | 1899-12-31 18:28:00 | NA |
JP33 | Female | 21.20 | 2.2 | 0.75x | 240 mg Ba/ml | 10 | 1.0 | 0.25 | 5.0 | 1899-12-31 19:11:00 | 0.25 | ca. 2.5 | 1899-12-31 20:47:00 | NA |
JP34 | Female | 19.21 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
str(tbl)
tibble [34 × 15] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:34] "JP01" "JP02" "JP03" "JP04" ...
$ Sex : chr [1:34] "Male" "Male" "Male" "Female" ...
$ Body weight [g] : num [1:34] 27.5 26.6 27.5 20.2 20.2 20.3 26.4 24.6 26.8 25.7 ...
$ Cannula length (mm) : num [1:34] 2.3 2.3 2.3 2.3 2.2 2.2 2.3 2.3 2.3 2.3 ...
$ Dilution : chr [1:34] "1.5x" "1.5x" "1.5x" "1.5x" ...
$ Concentration : chr [1:34] "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" "480 mg Ba/ml" ...
$ Withdrawn CA in syringe (uL): num [1:34] 5 5 5 5 5 5 5 5 17 17 ...
$ Infused CA in mannitol (uL) : num [1:34] NA NA NA NA NA NA NA NA 0.2 0.2 ...
$ Injection 1 rate (uL/min) : num [1:34] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.5 0.5 ...
$ Injection 1 volume (uL) : num [1:34] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5 5 ...
$ Injection 1 JP time (HH:MM) : POSIXct[1:34], format: "1899-12-31 15:05:00" "1899-12-31 18:16:00" ...
$ Injection 2 rate (uL/min) : num [1:34] NA NA NA NA NA NA NA NA NA NA ...
$ Injection 2 volume (uL) : chr [1:34] NA NA NA NA ...
$ Injection 2 JP time (HH:MM) : POSIXct[1:34], format: NA NA ...
$ Bubbles? : logi [1:34] NA NA NA NA NA NA ...
Source fix: correct the source table in excel
In this example it would just be easier to go to the source table and delete those “N/A” entries, the cells will be then blank and read as NA by R without affecting the column format. This modification can be done with a script if the table is large
- PRO: the source will be read correctly in any new script.
- CON: if the table is too complex it will add a new step ( a ‘correcting’ script) from original data, if the it is simple and it is done manually there is a small chance for human error
Choose the most practical and safe solution
In this example, the source table required manual entries and inspections by researchers because they had to fill some on-site information from the lab. Thus, the most practical solution seems to delete ‘N/A’ entries and instruct researchers to leave empty those cells instead. This does not compromise safety as the table would have required manual inspection and entries anyway. Note that this would have been needed for consistency in the table, even if this format-read issue (some cells had been left empty while some others had ‘N/A’, this is inconsistent formatting unless it is required to convey different information).
Preventing future issues
Consider the following points to favor interoperability and minimize reading errors:
Keep formats consistent. If tables are filled in by several users, have explicit instructions (e.g., how to represent missing values). Whenever possible avoid mix formats within a variable
Avoid encoding information in formatting, for example, using colors or conditional formatting). No information should be loss if only the plain content (without formatting) of the cells is read.
Use consistent variables names, use informative and descriptive names while avoid whenever possible special characters and exceedingly long names. It is also recommended to have additional sheets with a codebook describing each variable name and what it means