Memory: 16 GB
Chip: Apple M3
Total Number of Cores: 8 (4 performance and 4 efficiency)
At work, we primarily work in SAS. This due to SAS as being a validated language and because of the large size of the data we work with. That said, this is 2024—SAS licenses are extremely expensive and the language is not nearly as flexible as open source offerings.
I generally prefer working in R, but one of the biggest challenges I’ve faced in using R at work is the fact that while SAS is able to handle data sets that are larger than memory by using a combination of fast disk I/O and memory management, R—like Python, natively prefers to manipulate data sets in memory. This is challenging when the datasets can be 100s of gigabytes in in question.
I’ve recently heard a lot of good things about DuckDB and been reading through @hrbrmstr’s Cooking with DuckDB. DuckDB is a relational database system designed for Online analytic processing (OLAP) workloads. It can be compiled with no external dependencies—incredibly useful for embedded applications—and is known for its ability to query large databases very quickly. I decided to give it a try. This post is intended to do some basic benchmarking against data in the Sentinel Common Data Model (SCDM). It has R, Python, and Rust APIs, which will make it incredibly useful for querying large datasets from these languages1.
Getting Started
While it’s not nearly as large as the data our Data Partners hold or our internal test data, the CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use Files (SynPUFs) are publically available synthetic data in the Sentinel Common Data Model (SCDM) format, able to be used publicly, and still of a decent size. The SynPUFs are 5% synthetic samples of Medicare claims data from 2008 to 2010 and consist of 20, mutually exclusive datasets in the SAS7BDAT format.
The first challenge is that these datasets are large—e.g., the diagnosis table consists of twenty 810 MB .SAS7BDAT format files. As SAS7BDAT is a propriety format and I don’t have a SAS license on my home machine, I needed to convert these files to a format that DuckDB could read. The haven library offers read_sas(), but it requires holding the data in memory, and with 16 Gb of RAM on my MBP, binding the the subsamples for the largest datasets together and writing to parquet was a non-starter.
Fortunately, thanks to R’s purrr(), I was able to devise an inelegant solution2. I didn’t bother benchmarking this step, but it took roughly an hour on the MBP to chew through all of the datasets, appending each partition to a DuckDB table one-by-one, and then writing to parquet. Listing 1 shows the code used for this.
tables <- c("death", "demographic", "diagnosis", "dispensing", "encounter", "enrollment",
"procedure", "facility", "provider")
con <- DBI::dbConnect(duckdb::duckdb(dbdir = "~/dev/SynPUFsDuckDB/data/duckdb.duckdb"))
purrr::walk(tables, function(x) {
data <- haven::read_sas(paste0("data/", x, "_1", ".sas7bdat"))
DBI::dbExecute(con, "drop table if exists data")
duckdb::dbWriteTable(con, "data", data)
num <- 2:20
purrr::walk(num, function(y) {
data <- haven::read_sas(paste0("data/", x, "_", y, ".sas7bdat"))
duckdb::dbAppendTable(con, "data", data)
})
DBI::dbExecute(con, sprintf("copy data to '~/dev/SynPUFsDuckDB/data/%s-snappy.parquet' (format 'parquet');",
x))
})
DBI::dbExecute(con, "drop table if exists data")
DBI::dbDisconnect(con, shutdown = TRUE)Now that we have that data in a more usable format. Let’s look at what we have. I’ll start by summarizing the tables. DuckDB’s summarize feature is useful, but overkill here, so we’re going simply going to describe the tables.
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/enrollment-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| Enr_Start | DATE | YES | |||
| Enr_End | DATE | YES | |||
| MedCov | VARCHAR | YES | |||
| DrugCov | VARCHAR | YES | |||
| Chart | VARCHAR | YES |
real 0m0.012s user 0m0.008s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/demographic-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| Birth_date | DATE | YES | |||
| Sex | VARCHAR | YES | |||
| Hispanic | VARCHAR | YES | |||
| Race | VARCHAR | YES | |||
| PostalCode | VARCHAR | YES | |||
| PostalCode_date | DATE | YES |
real 0m0.011s user 0m0.007s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/dispensing-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| ProviderID | UBIGINT | YES | |||
| RxDate | DATE | YES | |||
| Rx | VARCHAR | YES | |||
| Rx_CodeType | VARCHAR | YES | |||
| RxSup | UINTEGER | YES | |||
| RxAmt | UINTEGER | YES |
real 0m0.016s user 0m0.012s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/encounter-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| EncounterID | UBIGINT | YES | |||
| ADate | DATE | YES | |||
| DDate | DATE | YES | |||
| EncType | VARCHAR | YES | |||
| FacilityID | UBIGINT | YES | |||
| Discharge_Disposition | VARCHAR | YES | |||
| Discharge_Status | VARCHAR | YES | |||
| DRG | VARCHAR | YES | |||
| DRG_Type | VARCHAR | YES | |||
| Admitting_Source | VARCHAR | YES |
real 0m0.019s user 0m0.016s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/diagnosis-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| EncounterID | UBIGINT | YES | |||
| ADate | DATE | YES | |||
| ProviderID | UBIGINT | YES | |||
| EncType | VARCHAR | YES | |||
| DX | VARCHAR | YES | |||
| DX_codetype | VARCHAR | YES | |||
| OrigDX | VARCHAR | YES | |||
| PDX | VARCHAR | YES | |||
| PAdmit | VARCHAR | YES |
real 0m0.040s user 0m0.034s sys 0m0.004s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/procedure-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| EncounterID | UBIGINT | YES | |||
| ADate | DATE | YES | |||
| ProviderID | UBIGINT | YES | |||
| EncType | VARCHAR | YES | |||
| PX | VARCHAR | YES | |||
| PX_codetype | VARCHAR | YES | |||
| OrigPX | VARCHAR | YES |
real 0m0.026s user 0m0.021s sys 0m0.003s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/death-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| PatID | UBIGINT | YES | |||
| DeathDt | DATE | YES | |||
| DtImpute | VARCHAR | YES | |||
| Source | VARCHAR | YES | |||
| Confidence | VARCHAR | YES |
real 0m0.011s user 0m0.007s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/facility-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| FacilityID | UBIGINT | YES | |||
| Facility_Location | VARCHAR | YES |
real 0m0.010s user 0m0.007s sys 0m0.002s
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
DESCRIBE SELECT *
FROM '~/dev/SynPUFsDuckDB/data/provider-snappy.parquet';"| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| ProviderID | UBIGINT | YES | |||
| Specialty | VARCHAR | YES | |||
| Specialty_CodeType | VARCHAR | YES |
real 0m0.011s user 0m0.007s sys 0m0.002s
When I initially ran this, I discovered a new challenge: As R doesn’t natively have a 8-byte (64-bit) numeric type, it coerced any column that was an integer larger than 8-bytes to a double. I didn’t want to take the chance of out of memory errors to work with these as a dataframe, so I used DuckDB to alter the type. UBIGINT is DuckDB’s unsigned 8-byte integer type.
Let’s count grab the number of records in each parquet file and unioning them together:
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
SELECT format('Enrollment') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/enrollment-snappy.parquet'
UNION
SELECT format('Demographic') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/demographic-snappy.parquet'
UNION
SELECT format('Dispensing') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/dispensing-snappy.parquet'
UNION
SELECT format('Encounter') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/encounter-snappy.parquet'
UNION
SELECT format('Diagnosis') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/diagnosis-snappy.parquet'
UNION
SELECT format('Procedure') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/procedure-snappy.parquet'
UNION
SELECT format('Death') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/death-snappy.parquet'
UNION
SELECT format('Facility') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/facility-snappy.parquet'
UNION
SELECT format('Provider') as Table
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/provider-snappy.parquet'
;"| Table | count |
|---|---|
| Death | 98152 |
| Demographic | 2224739 |
| Provider | 12803343 |
| Facility | 53686 |
| Encounter | 111738882 |
| Dispensing | 111085327 |
| Enrollment | 3668891 |
| Diagnosis | 352441529 |
| Procedure | 222919598 |
real 0m0.129s user 0m0.185s sys 0m0.025s
As this tells you, we’re working with tables that have between ~53K to 350+ M records. Hardly large compared to some of the real datasets we might work with, but reasonably sized for benchmarking.
Querying the Data
As I explored DuckDB, I ran a number of different SQL queries—some of which had practical purposes, others which were essentially cursed—as I attempted to put the tool through its passes. I was consistently impressed by the speed at which it returned answers. I decided to start with a relatively simple query of some practical use. Listing 2 illustrates SAS code which does the following:
- Create a SAS format which bins ages into age cohorts
- Calculates age as of the maximum date of the data—2010-12-31, stratifies by Sex, Ethnicity, and Race, and counts the number of records of a given age.
- Applies the label and re-aggregates count.
- Removes Sex, Ethnicity, and Race from the dataset, and re-aggregates the count.
For reference, the demographic table is smaller than the diagnosis table with:
libname synpufs "/path/obscured/SynPUFsDuckDB/data/";
proc format;
value agecat_years
. = "00. Missing"
low-<0 = "00. Negative"
0-<2 = "01. 0-1 yrs"
2-<5 = "02. 2-4 yrs"
5-<10 = "03. 5-9 yrs"
10-<15 = "04. 10-14 yrs"
15-<19 = "05. 15-18 yrs"
19-<22 = "06. 19-21 yrs"
22-<25 = "07. 22-24 yrs"
25-<35 = "08. 25-34 yrs"
35-<45 = "09. 35-44 yrs"
45-<55 = "10. 45-54 yrs"
55-<60 = "11. 55-59 yrs"
60-<65 = "12. 60-64 yrs"
65-<70 = "13. 65-69 yrs"
70-<75 = "14. 70-74 yrs"
75-high = "15. 75+ yrs"
;
run;
proc sql noprint;
create table _dem_l2_age as
select floor((intck("month",birth_date,"31Dec2010"d)-(day("31Dec2010"d)<day(birth_date)))/12) as age_years label="Age (Years)"
, sex
, hispanic
, race
, count(*) as count format=comma16.
from synpufs.demographic
group by calculated age_years, sex, hispanic, race;
quit;
proc sql noprint;
create table dem_l2_agecat_catvars as
select put(age_years,agecat_years.) as agecat_years label="Age Category (Years)"
, sex
, hispanic
, race
, sum(count) as count format=comma16.
from _dem_l2_age
group by calculated agecat_years, sex, hispanic, race
order by calculated agecat_years;
quit;
proc sql noprint;
create table dem_l2_agecat as
select agecat_years
, sum(count) as count format=comma16.
from dem_l2_agecat_catvars
group by agecat_years;
quit;
I called this on the SAS Server via the CLI with time, and got the following results:
real 0m5.178s
user 0m1.871s
sys 0m0.153s
Now let’s run the same query via DuckDB in the command line—note the time below Listing 3.
time /opt/homebrew/bin/duckdb duckdb.duckdb -markdown -c "
CREATE OR REPLACE TEMP TABLE dem_l3_tmp AS
SELECT (CAST(FLOOR(CAST(DATESUB('month', Birth_date, '2010-12-31') AS INTEGER) / 12) AS INTEGER)) as age
, sex
, hispanic
, race
, count(*) as count
FROM '~/dev/SynPUFsDuckDB/data/demographic-snappy.parquet'
GROUP BY age, sex, hispanic, race;
CREATE OR REPLACE TEMP TABLE dem_l3_agecat_catvars AS
SELECT CASE WHEN age IS NULL THEN '00. MISSING'
WHEN age < 0 THEN '00. NEGATIVE'
WHEN AGE BETWEEN 2 AND 4 THEN '02. 2-4 yrs'
WHEN AGE BETWEEN 5 AND 9 THEN '03. 5-9 yrs'
WHEN AGE BETWEEN 10 AND 14 THEN '04. 10-14 yrs'
WHEN AGE BETWEEN 15 AND 18 THEN '05. 15-18 yrs'
WHEN AGE BETWEEN 19 AND 21 THEN '06. 19-21 yrs'
WHEN AGE BETWEEN 22 AND 24 THEN '07. 22-24 yrs'
WHEN AGE BETWEEN 25 AND 34 THEN '08. 25-34 yrs'
WHEN AGE BETWEEN 35 AND 44 THEN '09. 35-44 yrs'
WHEN AGE BETWEEN 45 AND 54 THEN '10. 45-54 yrs'
WHEN AGE BETWEEN 55 AND 59 THEN '11. 55-59 yrs'
WHEN AGE BETWEEN 60 AND 64 THEN '12. 60-64 yrs'
WHEN AGE BETWEEN 65 AND 69 THEN '13. 65-69 yrs'
WHEN AGE BETWEEN 70 AND 74 THEN '14. 70-74 yrs'
ELSE '15. 75+' END as agecat_years
, sex
, hispanic
, race
, sum(count) as count
FROM dem_l3_tmp
GROUP BY agecat_years, sex, hispanic, race
ORDER BY agecat_years;
CREATE OR REPLACE TABLE dem_l3_agecat AS
SELECT
agecat_years,
SUM(count) AS count
FROM
dem_l3_agecat_catvars
GROUP BY
agecat_years
ORDER BY
agecat_years;
"
real 0m0.059s
user 0m0.283s
sys 0m0.012s
As you can see, this results in a runtime that is a fraction of what SAS is delivering. For the sake of completeness, let’s compare the results.
| agecat_years | count |
|---|---|
| 08. 25-34 yrs | 24747 |
| 09. 35-44 yrs | 49660 |
| 10. 45-54 yrs | 100906 |
| 11. 55-59 yrs | 70964 |
| 12. 60-64 yrs | 80540 |
| 13. 65-69 yrs | 342196 |
| 14. 70-74 yrs | 460971 |
| 15. 75+ | 1094755 |
Obs agecat_years count
1 08. 25-34 yrs 24,747
2 09. 35-44 yrs 49,660
3 10. 45-54 yrs 100,906
4 11. 55-59 yrs 70,964
5 12. 60-64 yrs 80,540
6 13. 65-69 yrs 342,196
7 14. 70-74 yrs 460,971
8 15. 75+ yrs 1,094,755
Let’s try something a little more cursed. Listing 4 shows a query which attempts to calculate the number of diagnoses per Encounter per Patient. It’s going to output 111706063 rows, so I’m not going to bother printing results, just the runtime.
time /opt/homebrew/bin/duckdb duckdb.duckdb -c "
CREATE OR REPLACE TEMP TABLE dia_l2_enc_patid AS
SELECT a.PatID
, a.EncounterID
, count(b.DX) as count
FROM '~/dev/SynPUFsDuckDB/data/encounter-snappy.parquet' as a
JOIN '~/dev/SynPUFsDuckDB/data/diagnosis-snappy.parquet' as b
ON (a.PatID = b.PatID and a.EncounterID = b.EncounterID)
GROUP BY a.EncounterID, a.PatID;"
real 0m12.780s
user 0m50.538s
sys 0m12.321s
We’re joining a table with 352,441,529 records to a table with 111,738,882 M records. Each time I run this query on my laptop, it takes seconds. I gave up on SAS after about 45 minutes.
Final Thoughts
DuckDB yields amazing performance even on machine without a ton of horsepower and it’s ability to quickly query larger than memory datasets will make it much easier to work with larger than memory datasets in languages like R and Python. The fact that it is designed for embedded applications with minimal external dependencies makes it a good candidate for deployment in distributed research networks, and it’s use of fairly standard SQL makes migrating to it incredibly easy.
What about R?
There are a number of methods for running a query and returning an R object. They deserve their own post.
Footnotes
I am aware of arrow and pola.rs and I will eventually benchmark these versus DuckDB.↩︎
A better solution would have been to use SAS to convert the SynPUF data from the SAS7BDAT format to JSON or CSV, however, I don’t have a SAS license for my personal machine, my work laptop doesn’t have the hard drive space left to hold it, and it VPN and bandwidth constraints would have made it challenging to remove from my work environment.↩︎
Citation
@online{scarnecchia2024,
author = {Scarnecchia, D.},
title = {Manipulating {Sentinel} {Data} with {DuckDB}},
date = {2024-03-31},
url = {https://musings.mountainherder.xyz/posts/2024-04-03_duckdb-r-sentinel-data},
langid = {en}
}