Skip to main content
Table of contents

Migrating to botor

The new version of RStudio on the Analytical Platform uses botor to access our Amazon S3 data storage, replacing s3tools. This guidance gives some hints on how to get going with botor and migrate code that uses s3tools.

This guidance assumes you have experience with renv. If you require a recap or guidance, please see either the renv documentation from the platform guidance or the official renv documentation.

Table of Contents

Installation

Using Analytical Tools [rstudio 4.1.7] Rstudio 4.0.5 or later:

# install renv if it doesn't exist on your system
if(!"renv" %in% installed.packages()[, "Package"]) install.packages("renv") 
# Remove bare = TRUE if you'd like to move your existing packages over to 
# renv. This is not a good idea if you're migrating from s3tools as
# renv will attempt to install that library.
renv::init(bare = TRUE) 
# Tell renv to use Python and set up a virtual environment
renv::use_python()
# Install reticulate so we can make calls to Python libraries, required by
# botor
renv::install('reticulate')
# Install the Python library, boto3, used by botor to access S3 
reticulate::py_install('boto3')
# Install botor itself
renv::install('botor')

If you get an error

Error in main_process_python_info() : 
  function 'Rcpp_precious_remove' not provided by package 'Rcpp'

then install a later version of Rcpp using

renv::install('Rcpp@1.0.7')

and restart your R session.

On earlier test versions it’s not quite that simple:

First open your project, and in the console run

# install renv if it doesn't exist on your system
if(!"renv" %in% installed.packages()[, "Package"]) install.packages("renv") 
# Initialise the renv environment. Remove bare = TRUE if you'd like to move 
# your existing packages over to renv, but keep it set to TRUE if you're 
# already using botor, dbtools or reticulate otherwise it will point to the 
# wrong or a non-existent Python.
renv::init(bare = TRUE)

Then in the terminal run

python3 -m venv renv/venv --without-pip --system-site-packages

to create a Python virtual environment.

Finally, in the RStudio console run the remaining lines:

renv::use_python('renv/venv/bin/python')
renv::install('reticulate')
reticulate::py_install('boto3')
renv::install('botor')

If this process goes wrong run

renv::deactivate()

in the terminal and

rm -rf renv renv.lock .Rprofile requirements.txt

in the terminal, restart your R session, and start again.

You should now be able to use library(botor) as usual, and renv::snapshot() to lock the R and Python library versions for recreation by collaborators or within a deployment.

For more on renv see the documentation, particularly on Using Python with renv. The reticulate documentation is also likely to be useful.

Usage

Reading data

Reading from S3 is handled by the s3_read function. This takes a full S3 path, which must include the prefix s3://, and a read function as parameters. For example, to read a csv file run

# For a dataframe,
df <- s3_read("s3://alpha-mybucket/my_data.csv", read.csv)
# for a tibble,
t <- s3_read("s3://alpha-mybucket/my_data.csv", readr::read_csv)
# and for a data.table
dt <- s3_read("s3://alpha-mybucket/my_data.csv", data.table::fread)

Other formats can be handled in a similar fashion.

sas_data <- s3_read("s3://alpha-mybucket/my_data.sas7bdat", haven::read_sas)
feather_data <- s3_read("s3://alpha-mybucket/my_data.feather", 
                        feather::read_feather)
json_data <- s3_read('s3://botor/example-data/mtcars.json', jsonlite::fromJSON)

See ?botor::s3_read for additional options.

openxlsx::read.xlsx will only open files with the .xlsx suffix but s3_read uses a temporary file without a suffix. To get around this either use readxl::read_excel or run the following for openxlsx

s3_read_xlsx <- function(s3_path, ...) {
    temp_loc <- tempfile(fileext=".xlsx")
    botor::s3_download_file(s3_path, temp_loc)
    wb <- openxlsx::read.xlsx(s3_path, temp_loc, ...)
    unlink(temp_loc)
    return(wb)
}
s3_read_xlsx("s3://alpha-mybucket/my_data.xlsx")

For large data files there’s a handy option to uncompress compressed files.

df <- s3_read("s3://alpha-mybucket/my_data.csv.gz", read.csv, 
              extract = "gzip")

To download a file to the local filesystem run

s3_download_file("s3://alpha-mybucket/my_data.csv", "my_local_data.csv")

Writing data

Similarly writing to S3 is handled by the s3_write function. This takes an R object, a write function with file as a parameter specifying the path to the data, and a full S3 path as parameters. For example,

s3_write(df, readr::write_csv, "s3://alpha-mybucket/my_data.csv")
s3_write(df, haven::write_sas, "s3://alpha-mybucket/my_data.sas7bdat")
s3_write(df, openxlsx::write.xlsx, "s3://alpha-mybucket/my_data.xlsx")

This won’t work for functions such as feather::write_feather that use an alternative name for the file parameter.

s3_write_feather <- function(df, s3_path, ...) {
    temp_loc <- tempfile(fileext=".feather")
    feather::write_feather(df, temp_loc, ...)
    botor::s3_upload_file(temp_loc, s3_path)
    unlink(temp_loc)
}
s3_write_feather(df, "s3://alpha-mybucket/my_data.feather")

Data can also be compressed before writing.

s3_write(df, "s3://alpha-mybucket/my_data.csv.gz", readr::write_csv,
         compress = "gzip")

To upload a file from the local filesystem run

s3_upload_file("my_local_data.csv", "s3://alpha-mybucket/my_data.csv")

Other useful functions

s3_ls lists the objects matching an S3 path prefix, though the full bucket name must be present.

# List all the objects in a bucket
s3_ls("s3://alpha-mybucket")
# List all the objects in the bucket's data directory
s3_ls("s3://alpha-mybucket/data/")
# List all the objects in the bucket's data directory beginning with p
s3_ls("s3://alpha-mybucket/data/p")
# Won't work
s3_ls("s3://alpha-mybu")

This returns a data.frame, the uri column is probably the most useful as it contains paths that can be read by s3_read.

s3_exists checks if an object exists, which is useful as s3_write will overwrite an object regardless.

if (!s3_exists("s3://alpha-mybucket/my_data.csv")) {
    s3_write(df, "s3://alpha-mybucket/my_data.csv", readr::write_csv)
}

For further information consult the botor documentation.

Note: To turn off the debugging warnings found within the botor library, please use the following: r library(logger) log_threshold(WARN, namespace = 'botor')

Migrating from s3tools

Replacement functions

This section contains functions that can be used within legacy code to replace s3tools calls. All these functions need botor and its dependencies installed, please see Installation for guidance.

read_using

read_using <- function(FUN, s3_path, ...) {
  # trim s3:// if included by the user
  s3_path <- paste0("s3://", gsub('^s3://', "", s3_path))
  # find fileext
  file_ext <- paste0('.', tolower(tools::file_ext(s3_path)))
  # download file to tempfile()
  tmp <- botor::s3_download_file(s3_path, 
                                 tempfile(fileext = file_ext), 
                                 force = TRUE)
  FUN(tmp, ...)
}
Examples
read_using(
  FUN = openxlsx::read.xlsx,
  s3_path = 's3://alpha-my_bucket/my_data.xlsx',
  startRow = 1,
  sheet = 2
)

read_using(FUN=readxl::read_excel, s3_path="alpha-test-team/mpg.xlsx")

s3_path_to_full_df

# if you are using a file with .gz, .bz or .xz extension, please use
# botor::s3_read directly
s3_path_to_full_df <- function(s3_path, ...) {
  # trim s3:// if included by the user
  s3_path <- paste0('s3://', gsub('^s3://', "", s3_path))
  # fileexts accepted by s3_read
  accepted_direct_fileext <- c('csv' = read.csv, 
                               'json' = jsonlite::fromJSON,
                               'jsonl' = jsonlite::stream_in,
                               'rds' = readRDS,
                               'sas7bdat' = haven::read_sas,
                               'sav' = haven::read_spss,
                               'dta' = haven::read_stata)
  # specify all other accepted filetypes
  excel_filepaths <- c('xlsx', 'xls', 'xlsm')
  accepted_fileext <- c(names(accepted_direct_fileext), excel_filepaths)
  fileext <- tolower(tools::file_ext(s3_path))
  # error if invalid filepath is entered
  if(!grepl(paste0('(?i)', accepted_fileext, collapse = "|"), fileext)) {
    stop(paste0("Invalid filetype entered. Please confirm that your file",
                " extension is one of the following: ", 
                paste0(accepted_fileext, collapse = ', '), ". \n ",
                "Alternatively, use botor directly to read in your file."))
  }
  # if we are using a function accepted by s3_read, then use that to parse 
  # the data
  if(grepl(paste0('(?i)', names(accepted_direct_fileext), collapse = "|"), 
           fileext)) {
    # read from s3 using our designated method
    tryCatch({
      botor::s3_read(s3_path, fun = accepted_direct_fileext[[tolower(fileext)]])
    },
    error = function(cond){
      stop("\nError, file cannot be parsed. \nYou either don't have access to this bucket, or are using an invalid s3_path argument (the s3_path you've entered needs correcting).")
    })

} else { tryCatch({ read_using(FUN = readxl::read_excel, s3_path = s3_path, ...) }, error = function(cond){ stop("\nError, file cannot be parsed. \nYou either don't have access to this bucket, or are using an invalid s3_path argument (the s3_path you've entered needs correcting).") })

} }

Examples
s3_path_to_full_df('s3://alpha-everyone/mtcars_boto.csv')
s3_path_to_full_df(
    "alpha-everyone/my_excel_workbook.xlsx"), 
    sheet = 2)
s3_path_to_full_df(
    "alpha-everyone/folder1/test.xlsm"), 
    sheet = 1)

s3_path_to_preview_df

The library stringr needs to be installed.

s3_path_to_preview_df = function(s3_path, ...) {
  s3_path <- stringr::str_replace(s3_path, "s3://", "")
  split_path <- stringr::str_split(s3_path, "/")[[1]]
  bucket <- split_path[1]
  key <- stringr::str_c(split_path[-1], collapse="/")
fext <- tolower(tools::file_ext(key)) if (!(fext %in% c("csv", "tsv"))) { message(stringr::str_glue("Preview not supported for {fext} files")) NULL } else { tryCatch( { client <- botor::botor()$client("s3") obj <- client$get_object(Bucket = bucket, Key = key, Range = "bytes=0-12000") obj$Body$read()$decode() %>% textConnection() %>% read.csv() %>% head(n = 5) }, error = function(c) { message("Could not read ", s3_path) stop(c) } ) } }
Examples
s3_path_to_preview_df('alpha-mybucket/my_data.csv')

download_file_from_s3

download_file_from_s3 <- function(s3_path, local_path, overwrite = FALSE) {
  # trim s3:// if included by the user and add it back in where required
  s3_path <- paste0("s3://", gsub('^s3://', "", s3_path))
  if (!(file.exists(local_path)) || overwrite) {
    local_path_folders <- stringr::str_extract(local_path, ".*[\/]+")
    if(!is.na(local_path)) {
      dir.create(local_path_folders, showWarnings = FALSE, recursive = TRUE)
    }
    # download file
    tryCatch({
      # download file to tempfile()
      botor::s3_download_file(s3_path, 
                              local_path, 
                              force = overwrite)
    },
    error = function(cond){
      stop("\nError, file cannot be found. \nYou either don't have access to this bucket, or are using an invalid s3_path argument (file does not exist).")
    })

} else { stop(paste0("The file already exists locally and you didn't specify", " overwrite=TRUE")) } }

Examples
download_file_from_s3("alpha-everyone/mtcars_boto.csv", 
                      "local_folder/mtcars_boto.csv", overwrite = TRUE)

write_df_to_csv_in_s3

write_df_to_csv_in_s3 <- function(df, s3_path, overwrite = FALSE,
                                  multipart = "unused", ...) { 
  # add errors
  if(!any(grepl('data.frame', class(df)))) {
    stop("df entered isn't a valid dataframe object")
  }
  if(tools::file_ext(s3_path) != 'csv') {
    stop("s3_path entered is either not a csv or is missing the .csv suffix")
  }
  # trim s3:// if included by the user - removed so we can supply both 
  # alpha-... and s3://alpha - and then add again
  s3_path <- paste0("s3://", gsub('^s3://', "", s3_path))
  if(!overwrite & botor::s3_exists(s3_path)) {
    stop("s3_path entered already exists and overwrite is FALSE")
  }
  # write csv
  botor::s3_write(df, fun = write.csv, uri = s3_path, ...)
}
Examples
write_df_to_csv_in_s3(
    df = mtcars, 
    s3_path = "alpha-everyone/mtcars_boto.csv"
)
write_df_to_csv_in_s3(
    df = mtcars, 
    s3_path = "alpha-everyone/mtcars_boto.csv", 
    row.names = FALSE
)

write_df_to_table_in_s3

write_df_to_table_in_s3 <- function(df, s3_path, overwrite = FALSE,
                                  multipart = "unused", ...) { 
  # add errors
  if(!any(grepl('data.frame', class(df)))) {
    stop("df entered isn't a valid dataframe object")
  }
  if(tolower(tools::file_ext(s3_path)) != 'csv') {
    stop("s3_path entered is either not a csv or is missing the .csv suffix")
  }
  # trim s3:// if included by the user - removed so we can supply both 
  # alpha-... and s3://alpha - and then add again
  s3_path <- paste0("s3://", gsub('^s3://', "", s3_path))
  if(!overwrite & botor::s3_exists(s3_path)) {
    stop("s3_path entered already exists and overwrite is FALSE")
  }
  # write csv
  botor::s3_write(df, fun = write.table, uri = s3_path, ...)
}
Examples
write_df_to_table_in_s3(
    df = mtcars, 
    s3_path = "alpha-everyone/mtcars_boto.csv"
)
write_df_to_table_in_s3(
    df = mtcars, 
    s3_path = "alpha-everyone/mtcars_boto.csv", 
    row.names = FALSE
)

write_file_to_s3

write_file_to_s3 <- function(local_file_path, s3_path, overwrite=FALSE, 
                             multipart = "unused") {
  # ensure s3:// is present if not already
  s3_path <- paste0("s3://", gsub("^s3://", "", s3_path))
  if (overwrite || !(botor::s3_exists(s3_path))) {
    tryCatch(
      botor::s3_upload_file(local_file_path, s3_path),
      error = function(c) {
        message(paste0("Could not upload ", local_file_path, " to ", s3_path),
                appendLF = TRUE)
        stop(c, appendLF = TRUE)
      }
    )

} else { stop("File already exists and you haven't set overwrite = TRUE, stopping") } }

Examples
write_file_to_s3("my_local_data.csv", "s3://alpha-mybucket/my_data.csv")

list_files_in_buckets

The libraries dplyr and purrr need to be installed.

list_files_in_buckets <- function(bucket_filter = NULL, prefix = NULL,
                                  path_only = FALSE, max = "unused") {
  if (is.null(bucket_filter)) {
    stop(paste0("You must provide one or more buckets e.g. ",
                "accessible_files_df('alpha-everyone')  This function will ",
                "list their contents"))
  }
  if(!is.character(bucket_filter)) {
    stop("Supplied bucket_filter is not of class: character")
  }
  if(!is.character(prefix)&!is.null(prefix)) {
    stop("Supplied prefix is not of class: character")
  }
  list_files_in_bucket <- function(bucket) {
    # trim s3:// if included by the user - removed so we can supply both
    # alpha-... and s3://alpha-...
    bucket <- gsub('^s3://', "", bucket)
    cols_to_keep <- c("key","last_modified","size","bucket_name")
    path_prefix <- (paste0('s3://', bucket, "/", prefix))
    list <- botor::s3_ls(path_prefix)
    if (is.null(list)) {
      warning(path_prefix, ' matches 0 files')
      return(list)
    }
    list <- list[,cols_to_keep]
    list["path"] <- paste(list$bucket_name, list$key, sep = '/')
    if(is.null(prefix)) {
      return(list)
    } else {
      return(list[grepl(prefix, list$key, ignore.case = TRUE),])
    }
  }
  file_list <- dplyr::bind_rows(purrr::map(bucket_filter, 
                                           list_files_in_bucket))
  if(is.numeric(max)) file_list <- head(file_list, max)
  if (path_only) return(file_list$path)
  file_list
}
Examples
list_files_in_buckets(bucket_filter = "alpha-hmpps-covid-data-processing", 
                      prefix = 'BASS.csv')
list_files_in_buckets(bucket_filter = "alpha-hmpps-covid-data-processing", 
                      prefix = 'deaths')
# Type in the full string you watch to match...
list_files_in_buckets(bucket_filter = "alpha-hmpps-covid-data-processing", 
                      prefix = 'fatalities') 
# ... or match the start of the string, so a shorter string will work
list_files_in_buckets(bucket_filter = "alpha-hmpps-covid-data-processing", 
                      prefix = 'fat') 

botor examples

Some of the replacement functions above are somewhat verbose to maintain compatibility with s3tools, so you may prefer to convert your code to use botor directly. This section attempts to provide botor equivalents for each of the examples on the s3tools homepage, which will hopefully help with converting existing code.

Which buckets do I have access to?

s3tools::accessible_buckets()

There is no obvious way to achieve this with botor, s3_list_buckets is the closest, listing all buckets in the organisation.

What files do I have access to?

## List all the files in the alpha-everyone bucket
s3tools::list_files_in_buckets('alpha-everyone')
# Using botor, note the s3:// must be present
botor::s3_ls('s3://alpha-everyone')

## You can list files in more than one bucket: s3tools::list_files_in_buckets(c('alpha-everyone', 'alpha-dash')) # botor only accepts one bucket as an argument purrr::map(c('s3://alpha-everyone', 's3://alpha-dash'), botor::s3_ls) %>% dplyr::bind_rows()

## You can filter by prefix, to return only files in a folder s3tools::list_files_in_buckets('alpha-everyone', prefix='s3tools_tests') # Using botor the bucket and prefix are joined botor::s3_ls('s3://alpha-everyone/s3tools_tests')

## The 'prefix' argument is used to filter results to any path that begins ## with the prefix. s3tools::list_files_in_buckets('alpha-everyone', prefix='s3tools_tests', path_only = TRUE) # Using botor select the uri column from the resulting data frame botor::s3_ls('s3://alpha-everyone/s3tools_tests')$uri

## For more complex filters, you can always filter down the dataframe using ## standard R code: library(dplyr)

## All files containing the string 'iris' s3tools::list_files_in_buckets('alpha-everyone') %>% dplyr::filter(grepl("iris",path)) # Use a regular expression # Using botor botor::s3_ls('s3://alpha-everyone') %>% dplyr::filter(grepl("iris",path))

## All excel files containing 'iris'; s3tools::list_files_in_buckets('alpha-everyone') %>% dplyr::filter(grepl("iris*.xls",path)) # Using botor botor::s3_ls('s3://alpha-everyone') %>% dplyr::filter(grepl("iris*.xls",path))

Reading files

csv files

df <- s3tools::s3_path_to_full_df(
    "alpha-everyone/s3tools_tests/folder1/iris_folder1_1.csv")
# Use s3_read in botor to specify the read function
df <- botor::s3_read(
    "s3://alpha-everyone/s3tools_tests/folder1/iris_folder1_1.csv",
    read.csv)

Previewing the first few rows of a file is more complicated, it’s not directly supported by botor.

df <- s3tools::s3_path_to_preview_df(
    "alpha-moj-analytics-scratch/my_folder/10mb_random.csv")
# botor() returns a cached boto3 client, which can be used in the same
# way as using Python's boto3 library via reticulate 
client <- botor::botor()$client("s3")
obj <- client$get_object(
    Bucket = "alpha-moj-analytics-scratch", 
    Key = "my_folder/10mb_random.csv",
    Range = "bytes=0-12000")
df <- obj$Body$read()$decode() %>%
    textConnection() %>%
    read.csv() %>%
    head(n = 5)

Other file types

# Uses readxl if installed, otherwise errors
df <- s3tools::s3_path_to_full_df(
    "alpha-everyone/s3tools_tests/iris_base.xlsx")
# Use s3_read in botor to specify the read function df <- botor::s3_read("s3://alpha-everyone/s3tools_tests/iris_base.xlsx", readxl::read_excel)

# Uses haven if installed, otherwise errors df <- s3tools::s3_path_to_full_df( "alpha-everyone/s3tools_tests/iris_base.sav")
# Use s3_read in botor to specify the read function df <- botor::s3_read("s3://alpha-everyone/s3tools_tests/iris_base.sav", haven::read_sav)

# Uses haven if installed, otherwise errors df <- s3tools::s3_path_to_full_df( "alpha-everyone/s3tools_tests/iris_base.dta")
# Use s3_read in botor to specify the read function df <- botor::s3_read("s3://alpha-everyone/s3tools_tests/iris_base.dta", haven::read_dta)

# Uses haven if installed, otherwise errors df <- s3tools::s3_path_to_full_df( "alpha-everyone/s3tools_tests/iris_base.sas7bdat")
# Use s3_read in botor to specify the read function df <- botor::s3_read("s3://alpha-everyone/s3tools_tests/iris_base.sas7bdat", haven::read_sas)

df <- s3tools::read_using( FUN=readr::read_csv, path = "alpha-everyone/s3tools_tests/iris_base.csv") df <- s3_read("s3://alpha-everyone/s3tools_tests/iris_base.csv", readr::read_csv)

Downloading files

botor will default to not checking whether it’s overwriting files.

df <- s3tools::download_file_from_s3(
    "alpha-everyone/s3tools_tests/iris_base.csv", "my_downloaded_file.csv")
# force is the same as overwrite in s3tools::download_file_from_s3, set to
# TRUE by default
botor::s3_download_file(
    "s3://alpha-everyone/s3tools_tests/iris_base.csv",
    "my_downloaded_file.csv", force = FALSE
    )

# By default, if the file already exists you will receive an error. # To override: df <- s3tools::download_file_from_s3( "alpha-everyone/s3tools_tests/iris_base.csv", "my_downloaded_file.csv", overwrite = TRUE ) # No need to change the default behaviour df <- botor::s3_download_file( "s3://alpha-everyone/s3tools_tests/iris_base.csv", "my_downloaded_file.csv" )

Writing data to S3

Writing files to S3

botor will not check if files written to S3 already exist.

s3tools::write_file_to_s3("my_downloaded_file.csv", 
                          "alpha-everyone/delete/my_downloaded_file.csv")
# With botor check if an S3 object already exists with s3_exists
if (!botor::s3_exists(
        "s3://alpha-everyone/delete/my_downloaded_file.csv") {
    botor::s3_upload_file(
        "my_downloaded_file.csv", 
        "s3://alpha-everyone/delete/my_downloaded_file.csv"
    ) 
} else {
    stop("s3://alpha-everyone/delete/my_downloaded_file.csv exists")
}

# By default, if the file already exists you will receive an error. # To override: s3tools::write_file_to_s3("my_downloaded_file.csv", "alpha-everyone/delete/my_downloaded_file.csv", overwrite = TRUE) # s3_upload_file will always overwrite an existing object botor::s3_upload_file("my_downloaded_file.csv", "s3://alpha-everyone/delete/my_downloaded_file.csv")

Writing a dataframe to s3 in csv format

s3tools::write_df_to_csv_in_s3(iris, "alpha-everyone/delete/iris.csv")
# s3_write will always overwrite an existing object 
if (!botor::s3_exists("s3://alpha-everyone/delete/iris.csv") {
    botor::s3_write(iris, "s3://alpha-everyone/delete/iris.csv", write.csv)
} else {
    stop("s3://alpha-everyone/delete/iris.csv already exists")
}

# By default, if the file already exists you will receive an error. # To override: s3tools::write_df_to_csv_in_s3(iris, "alpha-everyone/delete/iris.csv", overwrite =TRUE) botor::s3_write(iris, "s3://alpha-everyone/delete/iris.csv", write.csv)