---
title: "Introduction to epidemiological case definitions with diyar"
date: "`r format(Sys.Date(), '%d %B %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to epidemiological case definitions with diyar}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r include=FALSE}
library(diyar)
```
In epidemiological analyses an exact distinction between temporal events is not always possible. 
Therefore, static but reasonable cut-offs are used to distinguish one occurrence of an event from another. 
This is an important aspect of most case definitions. 
For example, distinguishing repeat or recurrent infections from the first occurrence of that infection.

This vignette provides a brief introduction to how this is done with `episodes()`.

We begin by reviewing `event_dt` below. 
This is a sample dataset of 11 infection events and includes data on the location and type of infection.
```{r warning=FALSE}
event_dt <- seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-11"), by = 1)
event_dt <- data.frame(date = event_dt)
event_dt$attr_1 <- c("BSI", "UTI", "RTI", "RTI", "BSI", "BSI", "BSI", "RTI", "RTI", "BSI", "RTI")
event_dt$attr_2 <- c("Ward 1", "Ward 1", "Ward 3", "Ward 3", "Ward 2", "Ward 2", 
                     "Ward 1", "Ward 1", "Ward 3","Ward 3", "Ward 2")

event_dt
```

The simplest kind of case definitions are those where only one event is retained making every other event a duplicate `(D)` of the index event `(C)`.
```{r warning=FALSE}
event_dt$ep1 <- episodes(event_dt$date, episode_type = "fixed")
event_dt
```
As seen above, we can use `episodes()` to create a group identifier unique to each occurrence.
Each group is referred to as an episode.
The identifier is stored as an `S4` (`epid`) class with useful information about each group in its slots. These information can be accessed with `as.data.frame()` or `as.list()`.

We can build on this case definition by adding a temporal boundary within which events are considered part of the same occurrence.
For example, we can specify that only events within 4 days (3 day-difference) of an index are considered part of the same occurrence.
```{r warning=FALSE}
event_dt$ep2 <- episodes(event_dt$date, 3, episode_type = "fixed")
event_dt
```
We can see that every event after the fourth day has now been assigned to a different episode, each with its own index event. This type of episode is referred to as a `"fixed"` episode.

`episodes()` can also produce `"rolling"` (recurring) episodes. See examples of each below.
```{r warning=FALSE}
event_dt$ep3 <- episodes(event_dt$date, 3, episode_type = "rolling")
event_dt$ep4 <- episodes(event_dt$date, 3, 
                         episode_type = "rolling", reference_event = "all_record")
event_dt
```

Here, we see the creation (or rather categorisation) of recurrent events `(R)`.
These are recurrence of index events as opposed to their duplicates. For example, `ep3` initially ends on `2021-01-04`. However, for rolling and recursive episodes, the function takes the additional step of looking for other events within 4 days of `2021-01-04`. Records within this period are then categorised as recurrent events of the `2021-01-04` event, which itself is a duplicate of the `2021-01-01` event. This chaining process continues indefinitely until there are no more events within the period of recurrence or it's stop by the `rolls_max` argument.

Even though `ep1` has the same outcome (one episode) as `ep3` and `ep4`, they are very different case definitions hence, the structure of their episodes are different.
This is demonstrated below by using `schema()` to visualise the identifiers.

```{r warning=FALSE, fig.width=7, message=FALSE}
schema(event_dt$ep1, seed = 2, show_labels = c("case_nm", "length_arrow"))
schema(event_dt$ep3, seed = 2, show_labels = c("case_nm", "length_arrow"))
schema(event_dt$ep4, seed = 2, show_labels = c("case_nm", "length_arrow"))
```

You may have noticed that this process starts from the earliest event and proceeds forward in time.
If the `from_last` argument is changed, the process will happen in the opposite direction.
```{r warning=FALSE, fig.width=7}
event_dt$ep5 <- episodes(event_dt$date, 3, episode_type = "rolling", from_last = TRUE)
event_dt
```
We can now see that in `ep5` the last record is now the index of the episode.

So far, we've discussed the different type of episodes and the direction of time in which they are tracked.
The next important part of the tracking process is the selection of index events. 
This determines which record is used as the reference point when checking for other events within its window of occurrence.
By default, this will be the earliest record from the direction of the tracking process but this can be changed to a custom preference using the `custom_sort` argument.

To demonstrate this, let's preferentially use `"RTI"` infection events as index events for episodes.

```{r warning=FALSE, fig.width=7}
custom_pref <- ifelse(event_dt$attr_1 == "RTI", 1, 2)
event_dt$ep6 <- episodes(event_dt$date, 3, custom_sort = custom_pref, 
                         skip_order = 1)
event_dt$ep7 <- episodes(event_dt$date, 3, custom_sort = custom_pref, 
                         skip_order = 1, from_last = TRUE)
```

```{r warning=FALSE, fig.width=7}
schema(event_dt$ep6, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep6@case_nm), " (", event_dt$attr_1, ")"))
schema(event_dt$ep7, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep7@case_nm), " (", event_dt$attr_1, ")"))
```

Another main aspect of the tracking process is the duration between events.
This is handled by the `case_length` and `recurrence_length` arguments.
A length of `3` means that events within 4 days (3-day difference) of the of the index event will be linked together as part of the same occurrence.
Note that this is 4 days towards the direction of the tracking process therefore, events in the opposite direction will not be considered. We can see this above in the schema for `ep6`.

If records in both directions need to be considered, then a range can be provided as shown with `ep_8`.
Additionally, multiple periods from the index event can be specified by providing a list of lengths or ranges as shown in `ep_9`.

```{r warning=FALSE, fig.width=7}
event_dt$ep8 <- episodes(event_dt$date, number_line(-3, 3), episode_type = "fixed", 
                         custom_sort = custom_pref, skip_order = 1)
event_dt$ep9 <- episodes(event_dt$date, list(number_line(-2, -1),
                                             number_line(4, 5),
                                             number_line(7, 9)), episode_type = "fixed", 
                         custom_sort = custom_pref, 
                         skip_order = 1, 
                         skip_if_b4_lengths = TRUE)
```

```{r warning=FALSE, fig.width=7}
schema(event_dt$ep8, seed = 2, show_labels = c("length_arrow"), 
       custom_label = paste0(decode(event_dt$ep8@case_nm), " (", event_dt$attr_1, ")"))
schema(event_dt$ep9, seed = 2, show_labels = c("length_arrow", "length"))
```

It's possible to track episodes among events with known start and end points, instead of those occurring at a single point in time. It is worth noting that the reference point for such events is their end point. Therefore, using a `case_length` of `1` when working with such events is interpreted as 2 days (1-day difference) after the end point of the interval. By default, other events that occur on or after the start point but before the end point of the index event will not be captured. If this is required, use `index_window()` to get the range required to capture such events, and supply it to the `case_length` or `recurrence_length` arguments. See the examples below.
```{r warning=FALSE}
# Dummy data of hospital stays
data(hospital_admissions)
dfr <- hospital_admissions[c("admin_dt", "discharge_dt")]
dfr$admin_period <- number_line(dfr$admin_dt, dfr$discharge_dt)
                                  
# Group overlapping hospital stays
dfr$ep_len1 <- index_window(dfr$admin_period)
dfr$ep1 <- episodes(date = dfr$admin_period, case_length = dfr$ep_len1)

# Group overlapping hospital stays and those within 21 days of the end point of an index hospital stay 
dfr$ep_len2 <- expand_number_line(index_window(dfr$admin_period), 20, "right")
dfr$ep2 <- episodes(date = dfr$admin_period, case_length =  dfr$ep_len2)
```

```{r warning=FALSE, fig.width=7}
dfr[c("admin_period", "ep_len1", "ep_len2", "ep1", "ep2")]
schema(dfr$ep1, seed = 2, show_labels = "length_arrow")
schema(dfr$ep2, seed = 2, show_labels = "length_arrow")
```

Currently, the `diyar` packages recognises 7 mutually exclusive and 2 mutually inclusive ways a pair of intervals can overlap. Please see `vignette("number_line")` for more details about this. Using `case_overlap_methods` and `recurrence_overlap_methods`, the episode tracking process can be made to only capture intervals that overlap by specific methods. See the example below.

```{r warning=FALSE}
# Wrapper function for a fixed episode
episodes_wf <- function(x){
  epids <- episodes(date = dfr$admin_period, 
                    sn = dfr$rd_id, 
                    case_length = index_window(dfr$admin_period), 
                    case_overlap_methods = x)
  return(epids)
}

# Methods 
methods <- list(
  # Identical intervals
  exact = "exact",
  # Intervals with their start or end points within another
  across = "across",
  # Intervals with aligned start points
  aligns_start = "aligns_start",
  # Intervals with aligned end points
  aligns_end = "aligns_end",
  # Intervals with start points that align with the end point of another, and vice versa
  chain = "chain",
  # Intervals occurring completely within others
  inbetween = "inbetween",
  # A combination of `chain` and `inbetween` methods
  cb1 = "chain|inbetween",
  # A combination of `exact`, `chain` and `inbetween` methods
  cb2 = "exact|chain|inbetween",
  # A combination of `across`, `chain` and `aligns_end` methods
  cb3 = "across|chain|aligns_end"
)

epids <- lapply(methods, episodes_wf)
names(epids) <- methods

# Use `schema()` to visualise each.
epids
```
 See `overlap_methods$options` for a list of supported options.

Additional matching criteria (separate from temporal links) can be implemented by passing a `sub_criteria` object to the `case_sub_criteria` and `recurrence_sub_criteria` arguments. See the `vignette("links")` for more details about `sub_criteria` objects. In the example below, we specify that only index events that are an RTI or occurred in Ward 3 can start an episode.

```{r warning=FALSE}
match_funx1 <- function(x, y) y == "RTI"
match_funx2 <- function(x, y) y == "Ward 2"
# Sub-criteria 1 - Matching source of infection OR patient location
sub_cri_1 <- sub_criteria(event_dt$attr_1, event_dt$attr_2, 
                          operator = "or", 
                          match_funcs = c(match_funx1, match_funx2))

event_dt$ep10 <- episodes(event_dt$date, case_length = 5, case_sub_criteria = sub_cri_1)
```

```{r warning=FALSE, fig.width=7}
schema(event_dt$ep10, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))
```

Alternatively, we can specify that BSI events or those that occurred in Ward 1 should not be linked to any episode. Note that they will be able to start episodes unless specified otherwise (`y`) as with `event_dt$ep10`.   
```{r warning=FALSE}
match_funx3 <- function(x, y) !(x$attr_1 == "BSI" | x$attr_2 == "Ward 1")
equal_funx <- function(x, y) TRUE
# Sub-criteria 1 - Matching source of infection OR patient location
sub_cri_2 <- sub_criteria(attrs(attr_1 = event_dt$attr_1, 
                                attr_2 = event_dt$attr_2), 
                          operator = "and", 
                          match_funcs = match_funx3, 
                          equal_funcs = equal_funx)

event_dt$ep11 <- episodes(event_dt$date, case_length = 5, case_sub_criteria = sub_cri_2)
```

```{r warning=FALSE, fig.width=7}
schema(event_dt$ep11, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))
```

We can even combine both conditions into a nested condition.
```{r warning=FALSE}
combined_sub_cri <- sub_criteria(sub_cri_1, sub_cri_2, operator = "and")
event_dt$ep12 <- episodes(event_dt$date, case_length = 5, 
                          case_sub_criteria = combined_sub_cri)
```

```{r warning=FALSE, fig.width=7}
schema(event_dt$ep12, seed = 2, show_labels = c("length_arrow"),
       custom_label = paste0(event_dt$attr_1, " in ", event_dt$attr_2))
```

A `sub_criteria` can be very useful when complex rules are required as part of the case definition.

Finally, everything we've just discussed can be done separately for different subsets of the dataset by using the `strata` argument. `data_source` lets us keep track of the source of records in each episode. `data_links` places limits on episode creation based on their `data_source`. The `group_stats` argument creates useful information for each episode which are stored in the `epid` object. To demonstrate these, we'll modify our first example. 

```{r warning=FALSE}
event_dt$ep13 <- episodes(event_dt$date, episode_type = "fixed", 
                          strata = event_dt$attr_2, data_source = event_dt$attr_1, 
                          group_stats = TRUE)

as.data.frame(event_dt$ep13)
```