Data cleaning is a critical step of data analysis, especially considering the messy nature of real-world data, which often includes duplicates, errors, incomplete entries, and irrelevant formats. Addressing these issues is essential for producing accurate, reliable, and reproducible results. However, data cleaning can pose a substantial barrier in data analysis due to the time-consuming nature of the process.
{cleanepi} is an R package designed specifically to address this challenge by offering tools to clean, curate, and standardize datasets. Tailored specifically for epidemiological data and compatible with data frame-like structures, {cleanepi} offers a suite of functions designed to streamline common data cleaning tasks.
This vignette provides a comprehensive guide to the functionalities encapsulated within {cleanepi}. It provides users with detailed insights into each function’s purpose and practical usage, equipping them with the tools necessary to navigate and manipulate cluttered datasets effectively.
The main function in {cleanepi} is
clean_data()
that can perform the following tasks:
NA
.character
columns into Date
if the
column actually contains values of type Date
to some extent
(default is at least 60% of the values are Date
).In addition, the package also has two surrogate functions: 1.
scan_data()
: scan the input data to determine the percent
of missing, numeric, character, logical and date values in every
character column of the input data frame. 2.
print_report()
: print the data cleaning report.
# IMPORTING THE TEST DATASET
test_data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
---|---|---|---|---|---|---|---|
PS001P2 | day 0 | 2 | Gambia | 01/12/2020 | 06/01/1972 | Dec 01, 2020 | 1 |
PS002P2 | day 0 | 2 | Gambia | 28/01/2021 | 02/20/1952 | Jan 01, 2021 | 1 |
PS004P2-1 | day 0 | 2 | Gambia | 15/02/2021 | 06/15/1961 | Feb 11, 2021 | -99 |
PS003P2 | day 0 | 2 | Gambia | 11/02/2021 | 11/11/1947 | Feb 01, 2021 | 1 |
P0005P2 | day 0 | 2 | Gambia | 17/02/2021 | 09/26/2000 | Feb 16, 2021 | 2 |
PS006P2 | day 0 | 2 | Gambia | 17/02/2021 | -99 | May 02, 2021 | 2 |
PB500P2 | day 0 | 2 | Gambia | 28/02/2021 | 11/03/1989 | Feb 19, 2021 | 1 |
PS008P2 | day 0 | 2 | Gambia | 22/02/2021 | 10/05/1976 | Sep 20, 2021 | 2 |
PS010P2 | day 0 | 2 | Gambia | 02/03/2021 | 09/23/1991 | Feb 26, 2021 | 1 |
PS011P2 | day 0 | 2 | Gambia | 05/03/2021 | 02/08/1991 | Mar 03, 2021 | 2 |
Field_names | missing | numeric | date | character | logical |
---|---|---|---|---|---|
study_id | 0 | 0.0 | 0.0 | 1 | 0 |
event_name | 0 | 0.0 | 0.0 | 1 | 0 |
country_name | 0 | 0.0 | 0.0 | 1 | 0 |
date.of.admission | 0 | 0.0 | 1.0 | 0 | 0 |
dateOfBirth | 0 | 0.1 | 0.9 | 0 | 0 |
date_first_pcr_positive_test | 0 | 0.0 | 1.0 | 0 | 0 |
In {cleanepi}, every cleaning operation is encapsulated within a module, with detailed descriptions provided in the package design vignette. Each module also specifies the parameters required for its main function, outlined in the sections below.
The clean_data()
function, as described above, can take
the following arguments:
data.frame
or
linelist
.standardize_column_names()
function.NA
. The elements
of the list are the inputs for the replace_missing_values()
function.remove_duplicates()
function.remove_constants()
function.standardize_dates()
function.check_subject_ids()
.convert_to_numeric()
.clean_using_dictionary()
function.check_date_sequence()
function.It is important to note that only cleaning operations that are explicitly defined will be performed.
In the following chunk, we define a set of cleaning operations that we want to perform on the input data.
# PARAMETERS FOR REPLACING MISSING VALUES WITH NA
replace_missing_values <- list(target_columns = NULL, na_strings = "-99")
# PARAMETERS FOR COLUMN NAMES STANDARDIZATION
standardize_column_names <- list(keep = NULL, rename = NULL)
# PARAMETERS FOR DUBLICATES DETECTION AND REMOVAL
remove_duplicates <- list(target_columns = NULL)
# PARAMETERS FOR STANDARDING DATES
standardize_dates <- list(
target_columns = NULL,
error_tolerance = 0.4,
format = NULL,
timeframe = as.Date(c("1973-05-29", "2023-05-29")),
orders = list(
world_named_months = c("Ybd", "dby"),
world_digit_months = c("dmy", "Ymd"),
US_formats = c("Omdy", "YOmd")
)
)
# PARAMETERS FOR STANDARDING SUBJECT IDs
standardize_subject_ids <- list(
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1, 100),
nchar = 7
)
# CONVERT THE 'sex' COLUMN INTO NUMERIC
to_numeric <- list(target_columns = "sex", lang = "en")
# PARAMETERS FOR CONSTANT COLUMNS, EMPTY ROWS AND COLUMNS REMOVAL
remove_constants <- list(cutoff = 1)
# LAOD THE DATA DICTIONARY
dictionary <- readRDS(
system.file("extdata", "test_dictionary.RDS", package = "cleanepi")
)
# CLEAN THE INPUT DATA FRAME
cleaned_data <- clean_data(
data = test_data,
remove_constants = remove_constants,
replace_missing_values = replace_missing_values,
remove_duplicates = remove_duplicates,
standardize_dates = standardize_dates,
standardize_subject_ids = standardize_subject_ids,
to_numeric = to_numeric,
dictionary = dictionary
)
cleaning column names
replacing missing values with NA
removing the constant columns, empty rows and columns
removing duplicated rows
No duplicates were found.
standardising date columns
checking subject IDs format
Warning: Detected incorrect subject ids at lines: 3, 5, 7
Use the correct_subject_ids() function to adjust them.
converting sex, en into numeric
performing dictionary-based cleaning
It returns the cleaned dataset. The report generated from the data
cleaning operations is a list object
that is attached to
the cleaned data and can be accessed using the attr()
function. The report contains details of each cleaning operation that
was performed during the process. However, users can access the report
using the code below:
# ACCESS THE DATA CLEANING REPORT
report <- attr(cleaned_data, "report")
# SUMMARIZE THE REPORT OBJECT
summary(report)
#> Length Class Mode
#> colnames 2 data.frame list
#> missing_values_replaced_at 1 -none- character
#> constant_data 4 data.frame list
#> out_of_range_dates 3 data.frame list
#> multi_format_dates 5 data.frame list
#> incorrect_subject_id 2 data.frame list
#> converted_into_numeric 1 -none- character
The report can also be displayed in an HTML format using the
print_report()
function as shown below:
{cleanepi} luckily provides users with the flexibility to call a specific function if they wish to perform that particular task individually. This approach allows users to have more control over the data cleaning process and to apply additional data cleaning functions as needed.
For example, some data cleaning operations, such as renaming columns,
removing empty rows and columns, removing columns with the same values
across all rows, and standardizing date columns, play a central role in
standardizing the epidemiological data. These operations can be
performed within the framework of the clean_data()
function
or alone by calling the relevant function.
This setup offers users both convenience and flexibility, as they can perform all the cleaning operations at once, or execute them individually according to their specific needs.
Some datasets contain constant columns (columns with the same values
across all rows), and/or empty rows and columns (rows or columns where
all values are missing i.e NA
). The
remove_constants()
function can be used to remove such
“noise”. The function takes the following argument:
0
and
1
, to be used when removing empty rows and columns. When
provided, only rows and columns where the percent of missing data is
greater than this cut-off will removed. Rows and columns with 100%
missing values will be remove by default.study_id | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
---|---|---|---|---|
PS001P2 | 01/12/2020 | 06/01/1972 | Dec 01, 2020 | 1 |
PS002P2 | 28/01/2021 | 02/20/1952 | Jan 01, 2021 | 1 |
PS004P2-1 | 15/02/2021 | 06/15/1961 | Feb 11, 2021 | -99 |
PS003P2 | 11/02/2021 | 11/11/1947 | Feb 01, 2021 | 1 |
P0005P2 | 17/02/2021 | 09/26/2000 | Feb 16, 2021 | 2 |
PS006P2 | 17/02/2021 | -99 | May 02, 2021 | 2 |
PB500P2 | 28/02/2021 | 11/03/1989 | Feb 19, 2021 | 1 |
PS008P2 | 22/02/2021 | 10/05/1976 | Sep 20, 2021 | 2 |
PS010P2 | 02/03/2021 | 09/23/1991 | Feb 26, 2021 | 1 |
PS011P2 | 05/03/2021 | 02/08/1991 | Mar 03, 2021 | 2 |
The remove_constants()
function returns a dataset where
all constant columns, empty rows and columns are iteratively removed.
Note that when the first iteration of constant data removal results in a
dataset with new empty rows and/or columns and constant columns, this
process will be carried on several times until there is no more constant
data. Rows and columns that were deleted at any iterations will be
reported in the report object.
The syntax used to name the columns of a dataset during its creation
depends on many factors such as the language, the naming convention,
etc. We provide, in {cleanepi}, the
standardize_column_names()
function to clean up column
names and convert them to more intuitive formats. It performs many
sub-tasks including: replacing a space, dot, or hyphen between two words
with underscore; converting camel-cases to snake-cases; substituting
foreign characters with their corresponding English characters; and
splitting along word into multiple short words by capital characters
within, if any, and connecting them with underscores. The function can
take the following arguments:
linelist_tags
, to maintain the tagged column names. The
Default is NULL
.new_name = "old_name"
. If not provided, all
columns will undergo standardization.# IMPORT AND PRINT THE INITAL COLUMN NAMES
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
print(colnames(data))
col_name_cleaning [1] "study_id" "event_name"
col_name_cleaning [3] "country_code" "country_name"
col_name_cleaning [5] "date.of.admission" "dateOfBirth"
col_name_cleaning [7] "date_first_pcr_positive_test" "sex"
# KEEP 'date.of.admission' AS IS
cleaned_data <- standardize_column_names(
data = data,
keep = "date.of.admission"
)
print(colnames(data))
col_name_cleaning [1] "study_id" "event_name"
col_name_cleaning [3] "country_code" "country_name"
col_name_cleaning [5] "date.of.admission" "dateOfBirth"
col_name_cleaning [7] "date_first_pcr_positive_test" "sex"
# KEEP 'date.of.admission' AS IS, BUT RENAME 'dateOfBirth' AND 'sex' TO
# 'DOB' AND 'gender' RESPECTIVELY
cleaned_data <- standardize_column_names(
data = data,
keep = "date.of.admission",
rename = c(DOB = "dateOfBirth", gender = "sex")
)
print(colnames(data))
col_name_cleaning [1] "study_id" "event_name"
col_name_cleaning [3] "country_code" "country_name"
col_name_cleaning [5] "date.of.admission" "dateOfBirth"
col_name_cleaning [7] "date_first_pcr_positive_test" "sex"
By providing the function with these parameters, the users can redefine the name of the columns and get easy to work with column names. This enables a more comprehensive naming system that is tailored to the needs of the user.
NA
It is common to have missing values in an input dataset. By default,
R expects missing values to be represented by NA
, However,
this is not always the case as some dataset can contain a specific
character string that denotes the missing value. In the presence of such
a scenario, user can call the replace_missing_values()
function to substitute these missing values with NA
. This
will make the data suitable for any data science operations. The
function takes the following arguments:
linelist
object, this parameter can be set to linelist_tags
if you
wish to replace missing values with NA
on tagged columns
only. The default value NULL
i.e. replace missing values
across all columns.cleanepi::common_na_strings
. However, if the
missing values string in the columns of interest is not included in this
predefined vector, it can be used as the value for this argument.# VISUALIZE THE PREDEFINED VECTOR OF MISSING CHARACTERS
print(cleanepi::common_na_strings)
default_missing_values [1] "missing" "NA" "N A" "N/A"
default_missing_values [5] "#N/A" "NA " " NA" "N /A"
default_missing_values [9] "N / A" " N / A" "N / A " "na"
default_missing_values [13] "n a" "n/a" "na " " na"
default_missing_values [17] "n /a" "n / a" " a / a" "n / a "
default_missing_values [21] "NULL" "null" "" "\\?"
default_missing_values [25] "\\*" "\\." "not available" "Not Available"
default_missing_values [29] "NOt available" "not avail" "Not Avail" "nan"
default_missing_values [33] "NAN" "not a number" "Not A Number"
# REPLACE ALL OCCURENCES OF "-99" WITH NA IN THE "sex" COLUMN
cleaned_data <- replace_missing_values(
data = readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")),
target_columns = "sex",
na_strings = "-99"
)
# REPLACE ALL OCCURENCES OF "-99" WITH NA FROM ALL COLUMNS
cleaned_data <- replace_missing_values(
data = readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")),
target_columns = NULL,
na_strings = "-99"
)
study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
---|---|---|---|---|---|---|---|
PS001P2 | day 0 | 2 | Gambia | 01/12/2020 | 06/01/1972 | Dec 01, 2020 | 1 |
PS002P2 | day 0 | 2 | Gambia | 28/01/2021 | 02/20/1952 | Jan 01, 2021 | 1 |
PS004P2-1 | day 0 | 2 | Gambia | 15/02/2021 | 06/15/1961 | Feb 11, 2021 | NA |
PS003P2 | day 0 | 2 | Gambia | 11/02/2021 | 11/11/1947 | Feb 01, 2021 | 1 |
P0005P2 | day 0 | 2 | Gambia | 17/02/2021 | 09/26/2000 | Feb 16, 2021 | 2 |
PS006P2 | day 0 | 2 | Gambia | 17/02/2021 | NA | May 02, 2021 | 2 |
PB500P2 | day 0 | 2 | Gambia | 28/02/2021 | 11/03/1989 | Feb 19, 2021 | 1 |
PS008P2 | day 0 | 2 | Gambia | 22/02/2021 | 10/05/1976 | Sep 20, 2021 | 2 |
PS010P2 | day 0 | 2 | Gambia | 02/03/2021 | 09/23/1991 | Feb 26, 2021 | 1 |
PS011P2 | day 0 | 2 | Gambia | 05/03/2021 | 02/08/1991 | Mar 03, 2021 | 2 |
The default date format in R is Ymd
(the ISO8601
format). However, it is very common to encounter date values that are
written differently from this. Also, there are cases where a column in a
data frame contains both values of type Date
,
character
or others.
The standardize_dates()
function provides a
comprehensive set of options for converting date columns into a
specified format and handling various scenarios, such as different date
formats and mixed data types in a column.
Entries which cannot be processed result in NA
. An error
threshold can be used to define the maximum number of resulting
NA
(i.e. entries without an identified date) that can be
tolerated. If this threshold is exceeded, the original vector is
returned.
The function expects the following arguments:
NA
(optional).NA
values (non date values) that can be allowed in a
converted column. Default is 40% i.e. 0.4
.orders <- list(
quarter_partial_dates = c("Y", "Ym", "Yq"),
world_digit_months = c("ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm",
"dmY", "mdY", "mYd", "dYm"),
world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb",
"bdY", "bYd", "Ybd", "Ydb"),
us_format = c("Omdy", "YOmd")
)
⚠️ The
error_tolerance
must be used with caution. When it is set, and the percentage of non-date values (NA i.e. values that were not converted into date) in a character column is greater than this threshold, the column will be returned as it is. The values outside of this timeframe can be accessed from the report object via itout_of_range_dates
element.
The value for the
orders
argument can be modified to suit the user’s needs. Other date formats can be specified too. For instance, if you want to prioritize American-style dates with numeric months, you can switch the second and third elements of the default orders as shown below:
# GIVE PRIORITY TO AMERICAN-STYLE DATES
us_ord <- orders[c(1L, 3L, 2L)]
# ADD A FORMAT WITH HOURS TO THE EXISTING orders
# THIS WILL ALLOW FOR THE CONVERSION OF VALUES SUCH AS "2014_04_05_23:15:43"
# WHEN THEY APPEAR IN THE TARGET COLUMNS.
orders$ymdhms <- c("Ymdhms", "Ymdhm")
This function provides users with the flexibility to standardize date
columns in their dataset according to specified requirements, including
format
, timeframe
, and
error tolerance
for conversion from character to date
columns.
# STANDARDIZE VALUES IN THE 'date_first_pcr_positive_test' COLUMN
test_data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
head(test_data$date_first_pcr_positive_test)
date_standardisation [1] "Dec 01, 2020" "Jan 01, 2021" "Feb 11, 2021" "Feb 01, 2021" "Feb 16, 2021"
date_standardisation [6] "May 02, 2021"
res <- standardize_dates(
data = test_data,
target_columns = "date_first_pcr_positive_test",
format = NULL,
timeframe = NULL,
error_tolerance = 0.4,
orders = list(
world_named_months = c("Ybd", "dby"),
world_digit_months = c("dmy", "Ymd"),
US_formats = c("Omdy", "YOmd")
)
)
This function returns the input dataset where the (specified) columns are converted into Date if the condition is met.
study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
---|---|---|---|---|---|---|---|
PS001P2 | day 0 | 2 | Gambia | 01/12/2020 | 06/01/1972 | 2020-12-01 | 1 |
PS002P2 | day 0 | 2 | Gambia | 28/01/2021 | 02/20/1952 | 2021-01-01 | 1 |
PS004P2-1 | day 0 | 2 | Gambia | 15/02/2021 | 06/15/1961 | 2021-02-11 | -99 |
PS003P2 | day 0 | 2 | Gambia | 11/02/2021 | 11/11/1947 | 2021-02-01 | 1 |
P0005P2 | day 0 | 2 | Gambia | 17/02/2021 | 09/26/2000 | 2021-02-16 | 2 |
PS006P2 | day 0 | 2 | Gambia | 17/02/2021 | -99 | 2021-05-02 | 2 |
PB500P2 | day 0 | 2 | Gambia | 28/02/2021 | 11/03/1989 | 2021-02-19 | 1 |
PS008P2 | day 0 | 2 | Gambia | 22/02/2021 | 10/05/1976 | 2021-09-20 | 2 |
PS010P2 | day 0 | 2 | Gambia | 02/03/2021 | 09/23/1991 | 2021-02-26 | 1 |
PS011P2 | day 0 | 2 | Gambia | 05/03/2021 | 02/08/1991 | 2021-03-03 | 2 |
It also adds two or three elements to the report object:
field | idx | world_named_months | world_digit_months | US_formats |
---|---|---|---|---|
date.of.admission | 1 | NA | 2020-12-01 | 2020-01-12 |
date.of.admission | 4 | NA | 2021-02-11 | 2021-11-02 |
date.of.admission | 9 | NA | 2021-03-02 | 2021-02-03 |
date.of.admission | 10 | NA | 2021-03-05 | 2021-05-03 |
dateOfBirth | 1 | NA | 1972-01-06 | 1972-06-01 |
dateOfBirth | 7 | NA | 1989-03-11 | 1989-11-03 |
dateOfBirth | 8 | NA | 1976-05-10 | 1976-10-05 |
dateOfBirth | 10 | NA | 1991-08-02 | 1991-02-08 |
The check_subject_ids()
function is designed to identify
rows from the input dataset where the ids don’t comply with the expected
subject ids format. It expects the following parameters:
By providing these parameters, the function becomes a versatile tool for data cleaning, ensuring that the user is alerted on the presence of unexpected, missing and duplicated subject ids. When using the function, make sure to tailor the parameters according to the specific requirements of your dataset and the expected characteristics of the subject IDs.
# DETECT AND REMOVE INCORRECT SUBJECT IDs
res <- check_subject_ids(
data = readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")),
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1L, 100L),
nchar = 7L
)
subject_ids_standardisation Warning: Detected incorrect subject ids at lines: 3, 5, 7
subject_ids_standardisation Use the correct_subject_ids() function to adjust them.
# EXTRACT REPORT
report <- attr(res, "report")
# SUMMARIZE THE REPORT OBJECT
summary(report)
subject_ids_standardisation Length Class Mode
subject_ids_standardisation incorrect_subject_id 2 data.frame list
The check_subject_ids()
function returns the input
dataset and send a warming when there are some incorrect ids.
In addition to detecting undesirable subject ids, the function will
also look for missing and duplicated IDs. As the result of this, the
report made from this operation might contain two extra elements:
missing_ids (a vector of row indexes where there is a
missing IDs) and duplicated_ids (a data frame of rows
with the duplicated IDs). Use the print_report()
function
to display the report made from this operation.
After the detection of the incorrect subject ids using the
check_subject_ids()
, use the
correct_subject_ids()
to replace non complying ids with the
correct ones. The function requires a data frame with the following two
columns:
# IMPORT THE INPUT DATA
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# GENERATE THE CORRECTION TABLE
correction_table <- data.frame(
from = c("P0005P2", "PB500P2", "PS004P2-1"),
to = c("PB005P2", "PB050P2", "PS004P2"),
stringsAsFactors = FALSE
)
# PERFORM THE CORRECTION
dat <- correct_subject_ids(
data = data,
target_columns = "study_id",
correction_table = correction_table
)
The check_date_sequence()
function verifies the order of
sequences in date event columns within a dataset. It ensures that the
values in the specified date columns follow the desired chronological
order. Here are the arguments accepted by the function:
target_columns = c("date_of_infection", "date_of_admission", "date_of_death")
.By utilizing these arguments, the check_date_sequence()
function facilitates the validation of date sequences within a dataset,
ensuring data integrity and accuracy for further analysis. Additionally,
it offers flexibility by allowing users to choose whether to remove rows
with incorrect sequences or store them for further examination in the
report object.
# DETECT ROWS WITH INCORRECT DATE SEQUENCE
res <- check_date_sequence(
data = readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")),
target_columns = c("date_first_pcr_positive_test", "date.of.admission")
)
check_date_order Warning: Detected 10 incorrect date sequences at line(s): 1, 2, 3, 4, 5, 6, 7,
check_date_order 8, 9, 10
# EXTRACT THE REPORT
report <- attr(res, "report")
# SUMMARIZE THE REPORT OBJECT
summary(report)
check_date_order Length Class Mode
check_date_order incorrect_date_sequence 2 data.table list
The check_date_sequence()
function returns the input
dataset, augmented with an attributes named as
incorrect_date_sequence
if there are rows with incorrect
date sequences. This attribute highlights any discrepancies found in the
date sequences, enabling users to take appropriate actions. Use the
print_report()
function to display the report made from
this operation.
In certain scenarios, the input data contains columns where the
number are written in letters. For instance, the ages of a study
participants can be written in letters. Similarly, a column can contain
values written in both numbers and letters (an age column where some
values are written in numbers and others in letters). The
convert_to_numeric()
function offers the framework to
convert all numbers written in letters from a given column into numeric.
It takes the following arguments:
linelist_tags
if the tagged columns are the one to be
converted into numeric. When target_columns = NULL
, the
function uses the output form the scan_data()
function to
identify the columns where the proportion of numeric values is at least
twice as the percentage of character values. Those columns will be the
columns of interest and the character values in them will be converted
into numeric. Note that any string in such column that can not be
converted into numeric will be set to NA
in the resulting
data."en", "fr", or "es"
.# CONVERT THE 'age' COLUMN IN THE TEST LINELIST DATA
dat <- readRDS(system.file("extdata", "messy_data.RDS", package = "cleanepi"))
head(dat$age, 10L)
check_date_order [1] "37" "seventy-four" "17" "3" "37"
check_date_order [6] NA "6" "26" "eleven" "44"
dat <- convert_to_numeric(
data = dat,
target_columns = "age",
lang = "en"
)
head(dat$age, 10L)
check_date_order [1] 37 74 17 3 37 NA 6 26 11 44
Some columns in a data frame might contain numeric values that
represents the number of days elapsed between two events. For instance,
the recruitment day of individuals in a study can be stored as a numeric
column where the numeric values are the count of days between when they
are recruited and when there were admitted in the hospital. The actual
dates when the individuals were recruited can be retrieved using the
convert_numeric_to_date()
function. This function can take
the following parameters:
linelist
object, this parameter can be set to
linelist_tags
if tagged variables are the target
columns.TRUE
) or not
(FALSE
). The default is TRUE
.data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")) %>%
standardize_dates(target_columns = "date.of.admission")
# CREATE THE RECRUITMENT DATE COLUMNS
data$recruitment_date <- sample(20:50, nrow(data), replace = FALSE)
# RETRIVE THE DATE INDIVIDUALS WERE RECRUITED
data <- convert_numeric_to_date(
data = data,
target_columns = "recruitment_date",
ref_date = "date.of.admission",
forward = TRUE
)
# RETRIVE THE DATE INDIVIDUALS WERE RECRUITED
data <- convert_numeric_to_date(
data = data,
target_columns = "recruitment_date",
ref_date = as.Date("2019-10-13"),
forward = FALSE
)
The function returns the input data where the values in the target columns(s) are converted into Date. This enables the usage of {cleanepi}’s functions that operates on Date columns as well as the powerful functions that can be used to manipulate Dates from the {base} R or {lubridate} packages.
The find_duplicates()
function serves the purpose of
identifying duplicated rows within a given dataset. It accepts the
following parameters:
NULL
is passed, duplicates will be detected across all
columns of the dataset. Notably, if the input dataset is a
linelist
object, target_columns
can be set to
linelist_tags
specifically to identify duplicates across
the tagged variables only.By leveraging the find_duplicates()
function with
appropriate parameters, users can efficiently pinpoint duplicated rows
within their datasets, either across all columns or selectively across
tagged variables in a linelist
object.
# IMPORT A `linelist` DATA
data <- readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
# SHOW THE TAGGED VARIABLES
linelist::tags(data)
$date_onset
[1] "dt_onset"
$date_reporting
[1] "dt_report"
$gender
[1] "sex"
$outcome
[1] "outcome"
# FIND DUPLICATES ACROSS ALL COLUMNS EXCEPT THE SUBJECT IDs COLUMN
all_columns <- names(data)
target_columns <- all_columns[all_columns != "id"]
dups <- find_duplicates(
data = data,
target_columns = target_columns
)
No duplicates were found.
# FIND DUPLICATES ACROSS TAGGED VARIABLES
dups <- find_duplicates(
data = data,
target_columns = "linelist_tags"
)
Found 57 duplicated rows in the dataset. Please consult the report for more details.
Upon execution, the find_duplicates()
function
identifies all duplicated rows either based on all columns or those
specified, and stores them in the report. In addition to the existing
columns, it appends two extra columns to the dataset:
row_id
: Contains indexes of the duplicated rows from
the original input dataset.group_id
: Contains unique identifiers assigned to each
duplicated group, which is defined as a set of rows sharing identical
values in the designated columns of interest.By including these extra columns, users gain insights into the specific rows identified as duplicates and their corresponding group identifiers, enabling efficient analysis and management of duplicated data within the dataset.
# VISUALIZE THE DUPLICATES
report <- attr(dups, "report")
duplicates <- report$duplicated_rows
duplicates %>%
kableExtra::kbl() %>%
kableExtra::kable_paper("striped", font_size = 14, full_width = FALSE) %>%
kableExtra::scroll_box(height = "200px", width = "100%",
box_css = "border: 1px solid #ddd; padding: 5px; ",
extra_css = NULL,
fixed_thead = TRUE)
row_id | group_id | dt_onset | dt_report | sex | outcome |
---|---|---|---|---|---|
26 | 1 | 2015-05-21 | 2015-06-03 | M | Alive |
33 | 1 | 2015-05-21 | 2015-06-03 | M | Alive |
55 | 2 | 2015-05-30 | 2015-06-06 | M | Alive |
62 | 2 | 2015-05-30 | 2015-06-06 | M | Alive |
23 | 3 | 2015-05-31 | 2015-06-02 | M | Dead |
24 | 3 | 2015-05-31 | 2015-06-02 | M | Dead |
99 | 4 | 2015-05-31 | 2015-06-09 | M | Alive |
105 | 4 | 2015-05-31 | 2015-06-09 | M | Alive |
27 | 5 | 2015-06-01 | 2015-06-03 | M | Alive |
31 | 5 | 2015-06-01 | 2015-06-03 | M | Alive |
44 | 6 | 2015-06-01 | 2015-06-06 | F | Alive |
60 | 6 | 2015-06-01 | 2015-06-06 | F | Alive |
65 | 7 | 2015-06-01 | 2015-06-07 | F | Alive |
73 | 7 | 2015-06-01 | 2015-06-07 | F | Alive |
78 | 7 | 2015-06-01 | 2015-06-07 | F | Alive |
82 | 7 | 2015-06-01 | 2015-06-07 | F | Alive |
85 | 7 | 2015-06-01 | 2015-06-07 | F | Alive |
47 | 8 | 2015-06-02 | 2015-06-06 | F | Dead |
51 | 8 | 2015-06-02 | 2015-06-06 | F | Dead |
56 | 9 | 2015-06-02 | 2015-06-06 | M | Alive |
58 | 9 | 2015-06-02 | 2015-06-06 | M | Alive |
61 | 9 | 2015-06-02 | 2015-06-06 | M | Alive |
110 | 10 | 2015-06-02 | 2015-06-10 | F | Alive |
122 | 10 | 2015-06-02 | 2015-06-10 | F | Alive |
54 | 11 | 2015-06-03 | 2015-06-06 | F | Alive |
63 | 11 | 2015-06-03 | 2015-06-06 | F | Alive |
68 | 12 | 2015-06-04 | 2015-06-07 | F | Alive |
71 | 12 | 2015-06-04 | 2015-06-07 | F | Alive |
79 | 12 | 2015-06-04 | 2015-06-07 | F | Alive |
87 | 12 | 2015-06-04 | 2015-06-07 | F | Alive |
102 | 13 | 2015-06-04 | 2015-06-09 | F | Alive |
108 | 13 | 2015-06-04 | 2015-06-09 | F | Alive |
96 | 14 | 2015-06-06 | 2015-06-09 | F | Alive |
107 | 14 | 2015-06-06 | 2015-06-09 | F | Alive |
103 | 15 | 2015-06-06 | 2015-06-09 | M | Alive |
104 | 15 | 2015-06-06 | 2015-06-09 | M | Alive |
132 | 16 | 2015-06-08 | 2015-06-12 | M | Alive |
133 | 16 | 2015-06-08 | 2015-06-12 | M | Alive |
135 | 16 | 2015-06-08 | 2015-06-12 | M | Alive |
129 | 17 | 2015-06-10 | 2015-06-12 | M | Alive |
138 | 17 | 2015-06-10 | 2015-06-12 | M | Alive |
46 | 18 | NA | 2015-06-06 | M | Alive |
49 | 18 | NA | 2015-06-06 | M | Alive |
69 | 19 | NA | 2015-06-07 | M | Alive |
80 | 19 | NA | 2015-06-07 | M | Alive |
91 | 20 | NA | 2015-06-08 | M | Alive |
95 | 20 | NA | 2015-06-08 | M | Alive |
119 | 21 | NA | 2015-06-10 | M | Alive |
121 | 21 | NA | 2015-06-10 | M | Alive |
155 | 22 | NA | 2015-06-16 | F | Alive |
161 | 22 | NA | 2015-06-16 | F | Alive |
156 | 23 | NA | 2015-06-16 | M | Alive |
157 | 23 | NA | 2015-06-16 | M | Alive |
158 | 23 | NA | 2015-06-16 | M | Alive |
159 | 23 | NA | 2015-06-16 | M | Alive |
160 | 23 | NA | 2015-06-16 | M | Alive |
162 | 23 | NA | 2015-06-16 | M | Alive |
To eliminate duplicated rows from a dataset, the
remove_duplicates()
function can be employed. This function
internally utilizes the find_duplicates()
function and
expects the following parameters:
NULL
, the function will detect
duplicates across all columns. If the input dataset is a
linelist
object, setting this parameter to
linelist_tags
will identify duplicates across the tagged
variables only.# REMOVE DUPLICATE ACROSS TAGGED COLUMNS ONLY.
res <- remove_duplicates(
data = readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
),
target_columns = "linelist_tags"
)
#> Found 57 duplicated rows in the dataset. Please consult the report for more details.
Upon execution, the remove_duplicates()
function returns
the input dataset without duplicated rows removed (if found). The
details about the duplicates removal operation are stored in the report
object that is attached to the output object. When duplicates are found,
this report will contain the following elements:
By examining these elements within the report, users gain insights into the specific duplicated rows, those that were removed, and the columns used to identify the duplicates, thus facilitating transparency and documentation of the duplicates removal process.
# ACCESS THE REPORT
report <- attr(res, "report")
# SUMMARIZE THE REPORT OBJECT
summary(report)
Length Class Mode
duplicated_rows 6 grouped_df list
duplicates_checked_from 1 -none- character
removed_duplicates 5 grouped_df list
Use the print_report()
function to display the report
made from this operation.
The output from find_duplicates()
function can also be
passed to remove_duplicates()
function to specify which
duplicated rows to be removed.
# DETECT DUPLICATES FROM TAGGED COLUMNS
dups <- find_duplicates(
data = readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
),
target_columns = "linelist_tags"
)
find_and_remove_dups Found 57 duplicated rows in the dataset. Please consult the report for more details.
# EXTRACT THE DUPLICATES
report <- attr(dups, "report")
duplicates <- report$duplicated_rows
# REMOVE FIRST OCCURRENCE OF DUPLICATED ROWS
dups_index_to_remove <- duplicates[["row_id"]][seq(1L, nrow(dups), 2L)]
dups_index_to_remove <- dups_index_to_remove[!is.na(dups_index_to_remove)]
no_dups <- data[-dups_index_to_remove, ]
The clean_using_dictionary()
function offers a
convenient way to replace the options in a data frame or linelist with
their corresponding values stored in a data dictionary. The function
expects the following arguments:
clean_using_dictionary()
function relies on
functions from this package.options | values | grp | orders |
---|---|---|---|
1 | male | sex | 1 |
2 | female | sex | 2 |
The add_to_dictionary()
function is a useful tool for
expanding the coverage of a data dictionary by defining options that are
present in the input data but not originally included in the dictionary.
This function enables users to dynamically update the dictionary to
accommodate new values encountered in the dataset. In addition to the
current data dictionary the function takes the arguments defined
below:
option, value, grp, order: the values for the options to be added in the data dictionary. The example below shows how this function is used.
By employing the add_to_dictionary()
function, users can
ensure that the data dictionary remains comprehensive and aligned with
the evolving nature of the input dataset, thereby enhancing the accuracy
and completeness of data interpretation and analysis. In the example
below, we add -99
to our test data dictionary,
test_dictionary
.
# READING IN THE DATA
data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
# ADD THE EXTRA OPTION TO THE DICTIONARY
dictionary <- add_to_dictionary(
dictionary = dictionary,
option = "-99",
value = "unknow",
grp = "sex",
order = NULL
)
options | values | grp | orders |
---|---|---|---|
1 | male | sex | 1 |
2 | female | sex | 2 |
-99 | unknow | sex | 3 |
# PERFORM THE DICTIONARY-BASED SUBSTITUTION
cleaned_df <- clean_using_dictionary(
data = data,
dictionary = dictionary
)
study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
---|---|---|---|---|---|---|---|
PS001P2 | day 0 | 2 | Gambia | 01/12/2020 | 06/01/1972 | Dec 01, 2020 | male |
PS002P2 | day 0 | 2 | Gambia | 28/01/2021 | 02/20/1952 | Jan 01, 2021 | male |
PS004P2-1 | day 0 | 2 | Gambia | 15/02/2021 | 06/15/1961 | Feb 11, 2021 | unknow |
PS003P2 | day 0 | 2 | Gambia | 11/02/2021 | 11/11/1947 | Feb 01, 2021 | male |
P0005P2 | day 0 | 2 | Gambia | 17/02/2021 | 09/26/2000 | Feb 16, 2021 | female |
PS006P2 | day 0 | 2 | Gambia | 17/02/2021 | -99 | May 02, 2021 | female |
PB500P2 | day 0 | 2 | Gambia | 28/02/2021 | 11/03/1989 | Feb 19, 2021 | male |
PS008P2 | day 0 | 2 | Gambia | 22/02/2021 | 10/05/1976 | Sep 20, 2021 | female |
PS010P2 | day 0 | 2 | Gambia | 02/03/2021 | 09/23/1991 | Feb 26, 2021 | male |
PS011P2 | day 0 | 2 | Gambia | 05/03/2021 | 02/08/1991 | Mar 03, 2021 | female |
The timespan()
function computes the time span between
two elements of type Date. The resulting time span can be expressed in
“years”, “months”, “weeks”, or “days”, depending on the user-specified
unit. The functions can take the following arguments:
Ymd
i.e. 2024-01-31
. The time span
will calculated between these values and the end_date
defined below.Sys.Date()
.span
.span_remainder_unit = NULL
.With these arguments, the function offers flexibility in determining the time span in different units. It facilitates various analytics tasks where the time span computation is a necessary component, providing users with the ability to customize the output according to their specific requirements.
# IMPORT DATA, REPLACE MISSING VALUES WITH 'NA' & STANDARDIZE DATES
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")) %>%
replace_missing_values(
target_columns = "dateOfBirth",
na_strings = "-99"
) %>%
standardize_dates(
target_columns = "dateOfBirth",
error_tolerance = 0.0,
format = "%m/%d/%Y" # nolint: nonportable_path_linter.
)
#> Warning: Using %m/%d/%Y to standardize target columns...
# CALCULATE INDIVIDUAL AGE IN YEARS FROM THE 'dateOfBirth' COLUMN AND SEND THE
# REMAINDER IN MONTHS
age <- timespan(
data = data,
target_column = "dateOfBirth",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "age_in_years",
span_remainder_unit = "months"
)
# CALCULATE THE TIME SPAN IN YEARS BETWEEN INDIVIDUALS 'dateOfBirth' AND THE DAY
# THEY TESTED POSITIVE
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
data <- data %>%
replace_missing_values(
target_columns = "dateOfBirth",
na_strings = "-99"
) %>%
standardize_dates(
target_columns = c("date_first_pcr_positive_test", "date.of.admission"),
error_tolerance = 0.0,
format = NULL
) %>%
timespan(
target_column = "date.of.admission",
end_date = "date_first_pcr_positive_test",
span_unit = "years",
span_column_name = "elapsed_time",
span_remainder_unit = NULL
)
The timespan()
function augments the input dataset by
adding one or two extra columns containing age-related information.
These additional columns are as follows:
Calculated time span in the specified scale: Contains the calculated time span in the specified unit (“years”, “months”, “weeks”, or “days”).
Remaining number of days: Indicates the remaining number of “days” or “weeks” or “months” after calculating the time span, representing the fractional part of the time span calculation. This column is included if needed, and provides additional granularity in the time span representation.
study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex | elapsed_time |
---|---|---|---|---|---|---|---|---|
PS001P2 | day 0 | 2 | Gambia | 2020-12-01 | 06/01/1972 | 2020-12-01 | 1 | 0.0000000 |
PS002P2 | day 0 | 2 | Gambia | 2021-01-28 | 02/20/1952 | 2021-01-01 | 1 | -0.0739220 |
PS004P2-1 | day 0 | 2 | Gambia | 2021-02-15 | 06/15/1961 | 2021-02-11 | -99 | -0.0109514 |
PS003P2 | day 0 | 2 | Gambia | 2021-02-11 | 11/11/1947 | 2021-02-01 | 1 | -0.0273785 |
P0005P2 | day 0 | 2 | Gambia | 2021-02-17 | 09/26/2000 | 2021-02-16 | 2 | -0.0027379 |
PS006P2 | day 0 | 2 | Gambia | 2021-02-17 | NA | 2021-05-02 | 2 | 0.2026010 |
PB500P2 | day 0 | 2 | Gambia | 2021-02-28 | 11/03/1989 | 2021-02-19 | 1 | -0.0246407 |
PS008P2 | day 0 | 2 | Gambia | 2021-02-22 | 10/05/1976 | 2021-09-20 | 2 | 0.5749487 |
PS010P2 | day 0 | 2 | Gambia | 2021-03-02 | 09/23/1991 | 2021-02-26 | 1 | -0.0109514 |
PS011P2 | day 0 | 2 | Gambia | 2021-03-05 | 02/08/1991 | 2021-03-03 | 2 | -0.0054757 |
NA