{farrago}: an R Package of odds-and-ends

After working with R both professionally and through hobby projects I have accumulated an assortment of coding ‘odds-and-ends’, snippets that I have found useful enough to collate and document. {farrago} is an R package serving as a personal collection of tools to assist with data workflows and analysis, with a focus on health surveillance and epidemiological data; however, it may have utility to other audiences as well.

The functions are organized by general operation type, such as calculations, conversions, content creation, data transfer, and plotting. As the package continues to grow, I anticipate that particular function themes will be unified under their own distinct packages. This will ensure that the number of dependencies are minimized, making it simpler to maintain.

Functions that have been topics of previous posts (e.g. stow() and retrieve()) are included in {farrago}. Further development and fixes are routinely incorporated in the package (and are not typically deserving of additional blog spotlight). The package content continues to expand and other functions in {farrago} may warrant additional attention in future blog posts. A few of particular interest I have given a cursory introduction below…

File transfer

Uploading and downloading files via SFTP is a common operation and, in most instances, it is possible to automate these steps. When working on Windows OS, WinSCP is a common tool used for this purpose and the option of using batch files helps streamline the process. It is a small leap to then have R write and pass these WinSCP batch files to a pre-established configuration on WinSCP. {farrago} can upload and download from these connections after the initial setup in WinSCP through the transfer_winscp() function:

drop_location <- 'C:/PATH/TO/DESIRED/FOLDER/'
  
transfer_winscp(file ='my_rmt_file.csv'),
               direction = 'download',
               connection = 'sftp://myusername:mypwd@hostlocation.ca/'
               rmt_path = './location/',
               drop_location = drop_location)

my_transferred_file <- read.csv(paste0(drop_location, 'my_rmt_file.csv'))

Pesky dates

When working with data it is hard to escape working with dates. In several instances they can be frustrating to deal with.

One of which is determining flu weeks each year…

# Sample date list for flu seasons
date_list <- c('2022-01-01',
               '2021-01-01',
               '2020-08-30',
               '2020-09-01',
               '2020-09-23',
               '2020-01-01',
               '2019-12-31',
               '2018-01-01', 
               '2017-01-01', 
               '2016-01-01')

# Return just the week and season category
convert_date2fluseason(date_list, return_values = c('week', 'season'))
## $week
##  [1] 52  1 36 36 39  1  1  1  1  1
## 
## $season
##  [1] "2021-2022" "2020-2021" "2020-2021" "2020-2021" "2020-2021" "2019-2020"
##  [7] "2019-2020" "2017-2018" "2016-2017" "2015-2016"

Another example is determining holidays, which can be useful to schedule reports…

# Find labour day (first monday in Sept) 
list_weekdays(2021, 9, 'Monday')[1]
## [1] "2021-09-06"
# Find observed holiday of xmas day
find_observedday(2021, month = 12, day = 25)
## [1] "2021-12-27"

Cohorts

Several operations are useful for working with cohorts but can be difficult to code efficiently in R, especially if they require for loops. This includes identifying overlapping time-frames, collapsing cohort entry/exit by a threshold, and assigning episode periods across repeat observations.

test_data <- tibble::tribble(~grp_id, ~date_start, ~date_end,
                      1, '2020-01-01', '2020-01-02',
                      1, '2020-01-03', '2020-01-04',
                      1, '2020-01-04', '2020-09-02',
                      2, '2020-01-01', '2020-09-02',
                      2, '2020-09-10', '2020-09-20',
                      2, '2020-09-21', '2020-09-22',
                      3, '2020-01-01', '2020-01-02',
                      3, '2020-01-02', '2020-01-20',
                      3, '2020-01-21', '2020-01-22',
                      3, '2020-01-22', '2020-04-02',
                      3, '2020-04-22', '2021-04-22',
                      3, '2021-06-09', '2021-06-22') %>%
   dplyr::mutate_at(dplyr::vars(contains('date')), lubridate::ymd)

# Assign an incremental id to rows overlapping (i.e. date end is same or prior to the next start date; these will have same value)
test_data$overlap_id <- identify_overlap(data = test_data,
                                      grp_id = grp_id,
                                      date_start = date_start,
                                      date_end = date_end)

# Create group when difference between date end and start start is 1 day
test_data$collapsed_grp <- collapse_timesteps(data = test_data,
                                              grp_id = grp_id,
                                              date_start = date_start,
                                              date_end = date_end,
                                              threshold = 1)

# Create episode period based upon start dates and a threshold of 10
test_data$episode_period <- assign_episode(data = test_data,
                             grp_id = grp_id,
                             date = date_start,
                             threshold = 10)

knitr::kable(test_data) %>% kableExtra::scroll_box(width = "100%", box_css = 'padding: 5px;')
grp_id date_start date_end overlap_id collapsed_grp episode_period
1 2020-01-01 2020-01-02 0 1 1
1 2020-01-03 2020-01-04 1 1 1
1 2020-01-04 2020-09-02 1 1 1
2 2020-01-01 2020-09-02 0 1 1
2 2020-09-10 2020-09-20 1 2 2
2 2020-09-21 2020-09-22 2 2 3
3 2020-01-01 2020-01-02 0 1 1
3 2020-01-02 2020-01-20 0 1 1
3 2020-01-21 2020-01-22 1 1 2
3 2020-01-22 2020-04-02 1 1 2
3 2020-04-22 2021-04-22 2 2 3
3 2021-06-09 2021-06-22 3 3 4

Create and summarise groups

Although this can be achieved easily with packages like {dplyr}, the create_hypercube function makes it simple to create summarised tables that include rows for totals by provided grouping columns.

create_hypercube(mtcars, columns = c('gear', 'cyl'), drop_sum_columns = 'cyl')
## # A tibble: 12 x 3
##    gear    cyl     n
##    <chr> <dbl> <dbl>
##  1 3         4     1
##  2 4         4     8
##  3 5         4     2
##  4 Sum       4    11
##  5 3         6     2
##  6 4         6     4
##  7 5         6     1
##  8 Sum       6     7
##  9 3         8    12
## 10 4         8     0
## 11 5         8     2
## 12 Sum       8    14

When working with ages or binning numeric values into discrete categories it is convenient to have automatic labeling…just assign the breaks and let create_breaks do the rest.

data = data.frame(val = c(-1,0,10,5,999,9))

breaks = c(0, 1, 10,50,100)

data$cat <- create_breaks(data$val, breaks, TRUE)

data
##   val   cat
## 1  -1    <0
## 2   0     0
## 3  10 10-49
## 4   5   1-9
## 5 999 >=100
## 6   9   1-9

If you have used SAS, you may be familiar with MULTILABEL FORMAT summaries. In {farrago}, create_multilevel_factor can simulate this kind of behavior and create a dataset with combined category levels. Below shows an example of how overlapping age groups can be added and summarized.

# Sample data with groups and conditions
example_data <- tibble(group = c('Area1', 'Area1', 'Area1', 'Area2', 'Area3', 'Area3'),
                       condition = factor(c('<5', '5-10', '11-20', '<5', '5-10', '20+'), ordered = FALSE))

# Create dataset with additional levels defined
new_data <- create_multilevel_factor(example_data,
                         target_col = 'condition',
                         new_levels = list('NewGrp:0-10' = c('<5', '5-10'), 'NewGrp:5-20' = c('11-20', '5-10')), # Define combined categories
                         collapse = FALSE,
                         track = TRUE)

# Summarise the multi-level grouping by their condition
tmptbl <- addmargins(table(new_data$group,new_data$condition), margin = 1)
knitr::kable(tmptbl)
<5 11-20 20+ 5-10 NewGrp:0-10 NewGrp:5-20
Area1 1 1 0 1 2 2
Area2 1 0 0 0 1 0
Area3 0 0 1 1 1 1
Sum 2 1 1 2 4 3

Wrap-up

If the examples above have kept your attention, I encourage interested readers to explore the package and associated documentation more thoroughly on the repository. Due to the scope of the package, it is simply hosted on GitHub.

Avatar
Allen O'Brien
Infectious Disease Epidemiologist

I am an epidemiologist with a passion for teaching and all things data.

Related