Introduction
The rcdf
package is a powerful toolkit for securely
working with RCDF (Encrypted Parquet) files in R. RCDF is a custom data
format designed to provide strong encryption and metadata management for
sensitive datasets. With rcdf
, users can easily handle
encrypted data, including reading, writing, and exporting data stored in
this secure format.
This vignette will walk you through the key features of the package, including how to encrypt and save your data in RCDF format, how to decrypt and read RCDF files, and how to export data to other common formats.
Installation
To use the rcdf
package, you’ll need to install it
first. You can install the package directly from GitHub using the
devtools
package:
# Install the package from GitHub
devtools::install_github("yng-me/rcdf")
Once installed, you can load the package and start working with RCDF files.
Writing data to RCDF format
The core function for writing data to the RCDF format is
write_rcdf()
. This function encrypts your data using AES
encryption, generates encrypted metadata for version control using RSA
encryption, and saves the data as encrypted Parquet files inside a zip
archive. This ensures that the data is stored securely and can only be
decrypted using the correct key.
Usage:
write_rcdf(data, path, pub_key, ..., metadata = list())
Parameters:
-
data
: A list of data frames or tables to be written to RCDF format. Each element of the list represents a record. -
path
: The path where the RCDF file will be written. The file will be saved with a.rcdf
extension if not already specified. -
pub_key
: The public RSA key used to encrypt the AES encryption keys. -
...
: Additional arguments passed to helper functions if needed. -
metadata
: A list of metadata to be included in the RCDF file. Can contain system information or other relevant details.
# Sample data (list of data frames)
data <- rcdf_list()
data$table1 = data.frame(x = 1:10, y = letters[1:10])
data$table2 = data.frame(a = rnorm(10), b = rnorm(10))
# Sample public RSA key (for encryption)
pub_key <- file.path(system.file("extdata", package = "rcdf"), "sample-public-key.pem")
# Write the data to an RCDF file
write_rcdf(data = data, path = "path/to/rcdf_file.rcdf", pub_key = pub_key)
In this example:
-
data
is a list containing two data frames. These will be encrypted and saved as separate Parquet files within the RCDF. -
pub_key
is the RSA public key used to encrypt the AES keys. The AES keys are used for encrypting the data in a fast and secure manner.
The write_rcdf()
function will create a zip archive
containing the encrypted Parquet files and metadata, then save it to
path.
Reading and decrypting RCDF data
To read and decrypt an RCDF file, you can use the
read_rcdf()
function. This function extracts the encrypted
Parquet files from the RCDF archive, decrypts them using the provided
decryption key, and loads the data back into R as an RCDF object.
Usage:
read_rcdf(path, decryption_key, ..., password = NULL, metadata = NULL)
Parameters:
-
path
: A string specifying the path to the RCDF archive (zip file). -
decryption_key
: The key used to decrypt the RCDF contents. This can be an RSA or AES key, depending on how the RCDF was encrypted. -
...
: Additional parameters passed to other functions, if needed. -
password
: A password used for RSA decryption (optional). -
metadata
: An optional metadata object containing data dictionaries and value sets. This metadata is applied to the data if provided.
# Using sample RCDF data
dir <- system.file("extdata", package = "rcdf")
rcdf_path <- file.path(dir, 'mtcars.rcdf')
private_key <- file.path(dir, 'sample-private-key.pem')
rcdf_data <- read_rcdf(path = rcdf_path, decryption_key = private_key)
rcdf_data
# Using encrypted/password protected private key
rcdf_path_pw <- file.path(dir, 'mtcars-pw.rcdf')
private_key_pw <- file.path(dir, 'sample-private-key-pw.pem')
pw <- '1234'
rcdf_data_with_pw <- read_rcdf(
path = rcdf_path_pw,
decryption_key = private_key_pw,
password = pw
)
rcdf_data_with_pw
In this example:
-
path
is the path to the RCDF file that contains the encrypted data. -
decryption_key
is the key used to decrypt the AES keys and Parquet files. If the RCDF was encrypted using RSA, you’ll need the private RSA key to decrypt it.
The read_rcdf()
function returns an RCDF object, which
is essentially a list of decrypted Parquet files (one for each data
frame in the original data) along with metadata about the file.
Exporting data to other formats
Once the data has been decrypted and read into R, you can export it
to other formats using the write_rcdf_as()
or
write_rcdf_*()
family of functions. These function support
a wide variety of common formats, including CSV, TSV, JSON, Excel,
Stata, SPSS, and SQLite.
Exporting data to CSV format
The write_rcdf_csv()
function allows you to export data
stored in an RCDF object to CSV files. This is useful when you need to
share or process the data in a non-encrypted, readable format.
Usage:
write_rcdf_csv(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The RCDF object that contains the decrypted data. This is the data you obtained from callingread_rcdf()
or other decryption methods. -
path
: The target directory or file where the CSV files will be saved. -
...
: Additional arguments passed to thewrite.csv()
function for customizing the CSV export (e.g., setting delimiters, row names, etc.). -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_csv(data = rcdf_data, path = "path/to/output", row.names = FALSE)
This will save each table in the RCDF object as a separate CSV file in the specified directory.
Exporting data to TSV format
The write_rcdf_tsv()
function is similar to the CSV
export function but writes the data as tab-separated values (TSV)
files.
Usage:
write_rcdf_tsv(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object containing the data to export. -
path
: The target directory or file for the output TSV files. -
...
: Additional arguments for customizing the TSV export passed to thewrite.table()
function (e.g., setting delimiters, handling row names). -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_tsv(data = rcdf_data, path = "path/to/output", row.names = FALSE)
This function will save each data frame in the RCDF object as a separate TSV file in the target location.
Exporting data to JSON format
The write_rcdf_json()
function allows you to export the
decrypted RCDF data to JSON format. This is useful when working with
APIs or other systems that require data in JSON.
Usage:
write_rcdf_json(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The target directory or file for saving the JSON files. -
...
: Additional arguments to customize the JSON export passed tojsonlite::toJSON()
(such as specifying indentation or compactness of the JSON output). -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_json(data = rcdf_data, path = "path/to/output", pretty = TRUE)
This will convert each data frame in the RCDF object into a separate
JSON file and save them in the specified directory. The
pretty = TRUE
option ensures that the output JSON files are
human-readable with proper indentation.
Exporting data to Parquet format
The write_rcdf_parquet()
function exports the decrypted
data back into the Parquet format. Parquet is a columnar storage format
that is highly efficient for big data processing.
Usage:
write_rcdf_parquet(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The directory or file path where the Parquet files will be saved. -
...
: Additional arguments passed to thewrite_parquet()
function for customization, such as specifying compression type. -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_parquet(data = rcdf_data, path = "path/to/output")
This function will write each data frame in the RCDF object into separate Parquet files, storing them in the specified directory.
Exporting data to Excel format
The write_rcdf_xlsx()
function is used to export the
decrypted RCDF data to Excel (.xlsx) format. It’s helpful when sharing
data with users who prefer spreadsheet software.
Usage:
write_rcdf_xlsx(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The directory or file path where the Excel file will be saved. -
...
: Additional arguments to customize the Excel file export in theopenxlsx
package. -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_excel(data = rcdf_data, path = "path/to/output.xlsx", sheetName = "Sheet1")
Exporting data to Stata format
The write_rcdf_dta()
function allows you to export the
data to Stata’s .dta file format. This is useful for users who need to
work with the data in Stata.
Usage:
write_rcdf_dta(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The path where the Stata .dta file will be saved. -
...
: Additional arguments passed to thewrite.dta()
function (e.g., specifying version of Stata). -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_dta(data = rcdf_data, path = "path/to/output")
Exporting data to SPSS format
The write_rcdf_sav()
function is for exporting the
decrypted RCDF data to SPSS’s .sav file format.
Usage:
write_rcdf_sav(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The path where the .sav file will be saved. -
...
: Additional arguments for customizing the SPSS file export. -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_sav(data = rcdf_data, path = "path/to/output")
Exporting data to SQLite database format
The write_rcdf_sqlite()
function allows you to export
the decrypted RCDF data to an SQLite database (with .db extension). Each
data frame is saved as a table within the SQLite database.
Usage:
write_rcdf_sqlite(data, path, ..., parent_dir = NULL)
Parameters:
-
data
: The decrypted RCDF object. -
path
: The path where the SQLite database file will be created. -
...
: Additional arguments for customizing the SQLite export. -
parent_dir
: An optional parent directory to be included in the path where the files will be written.
write_rcdf_sqlite(data = rcdf_data, path = "path/to/output")
Exporting data to multiple formats simultaneously
The write_rcdf_as()
function allows you to export
decrypted RCDF data into multiple file formats simultaneously.
Usage:
write_rcdf_as(data, path, formats, ...)
Parameters:
-
data
: A named list or RCDF object. Each element should be a table or tibble-like object (typically adbplyr
ordplyr
table). -
path
: The target directory where output files should be saved. -
formats
: A character vector of file formats to export to. Supported formats include:"csv"
,"tsv"
,"json"
,"parquet"
,"xlsx"
,"dta"
,"sav"
, and"sqlite"
. -
...
: Additional arguments passed to the respective writer functions.