Introducing the claimsdb package for teaching and learning

health care
claims
Medicare
databases
Published

February 12, 2022

Diagram of claims database tables with arrows to show data relationships

In the fields of health informatics and health services research, health insurance claims data are a valuable resource to help answer questions about health care access and financing. However, claims data in the real world often contains both sensitive (protected health information) and proprietary (trade secrets) elements. For most students and educators seeking opportunities to learn how to use claims data, there are few available sources for practice.

To help with this problem, claimsdb 📦 provides easy access to a sample of health insurance enrollment and claims data from the Centers for Medicare and Medicaid Services (CMS) Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF), as a set of relational tables or as an in-memory database using DuckDB. All the data is contained within a single package, so users do not need to download any external data. This package is inspired by and based on the starwarsdb package.

The data are structured as actual Medicare claims data but are fully “synthetic,” after a process of alterations meant to reduce the risk of re-identification of real Medicare beneficiaries. The synthetic process that CMS adopted changes the co-variation across variables, so analysts should be cautious about drawing inferences about the actual Medicare population.

The data included in claimsdb comes from 500 randomly selected 2008 Medicare beneficiaries from Sample 2 of the DE-SynPUF, and it includes all the associated claims for these members for 2008-2009. CMS provides resources, including a codebook, FAQs, and other documents with more information about this data.

To introduce claimsdb, this post covers the following topics: - Installation of the claimsdb package - How to setup a remote database with the included data - Examples of how to use claimsdb for analysis - An overview of the limitations of the data included in package

Installation and components of the claimsdb package

You can install the development version of claimsdb from GitHub with:

# install.packages("devtools")
devtools::install_github("jfangmeier/claimsdb")

You can then load claimsdb alongside your other packages. We will be using the tidyverse packages in our later examples.

library(tidyverse)
library(claimsdb)

The tables are available after loading the claimsdb package. This includes a schema that describes each of the tables and the included variables from the CMS DE-SynPUF. You can see that claimsdb includes five tables. One of the tables, bene contains beneficiary records, while the others include specific types of claims data.

schema
# A tibble: 5 x 5
  TABLE      TABLE_TITLE                                TABLE~1 UNIT_~2 PROPER~3
  <chr>      <chr>                                      <chr>   <chr>   <list>  
1 bene       CMS Beneficiary Summary DE-SynPUF          Synthe~ Benefi~ <tibble>
2 carrier    CMS Carrier Claims DE-SynPUF               Synthe~ claim   <tibble>
3 inpatient  CMS Inpatient Claims DE-SynPUF             Synthe~ claim   <tibble>
4 outpatient CMS Outpatient Claims DE-SynPUF            Synthe~ claim   <tibble>
5 pde        CMS Prescription Drug Events (PDE) DE-Syn~ Synthe~ claim   <tibble>
# ... with abbreviated variable names 1: TABLE_DESCRIPTION, 2: UNIT_OF_RECORD,
#   3: PROPERTIES

You can access details on the variables in each of the tables like in this example with the inpatient table. You can see that this table contains 35 fields, including a beneficiary code to identify members across tables as well as detailed information on each inpatient claim.

schema %>% 
  filter(TABLE == "inpatient") %>% 
  pull(PROPERTIES)
[[1]]
# A tibble: 35 x 3
   VARIABLE                 TYPE    DESCRIPTION                                 
   <chr>                    <chr>   <chr>                                       
 1 DESYNPUF_ID              string  Beneficiary Code                            
 2 CLM_ID                   string  Claim ID                                    
 3 SEGMENT                  numeric Claim Line Segment                          
 4 CLM_FROM_DT              date    Claims start date                           
 5 CLM_THRU_DT              date    Claims end date                             
 6 PRVDR_NUM                string  Provider Institution                        
 7 CLM_PMT_AMT              numeric Claim Payment Amount                        
 8 NCH_PRMRY_PYR_CLM_PD_AMT numeric NCH Primary Payer Claim Paid Amount         
 9 AT_PHYSN_NPI             string  Attending Physician National Provider Ident~
10 OP_PHYSN_NPI             string  Operating Physician National Provider Ident~
# ... with 25 more rows

Access claims data as a remote database

Many organizations store claims data in a remote database, so claimsdb also includes all of the tables as an in-memory DuckDB database. This can be a great way to practice working with this type of data, including building queries with dplyr code using dbplyr.

To setup the in-memory database, you need to create a database connection using claims_connect() and create connections to each of the tables you want to use.

library(dbplyr)

# Setup connection to duckDB database
con <- claims_connect()

# Setup connections to each of the enrollment and claims tables in the database
bene_rmt <- tbl(con, "bene")
inpatient_rmt <- tbl(con, "inpatient")
outpatient_rmt <- tbl(con, "outpatient")
carrier_rmt <- tbl(con, "carrier")
pde_rmt <- tbl(con, "pde")

You can then preview your connection to each of the remote tables.

# Preview the prescription drug event table in the database
pde_rmt
# Source:   table<pde> [?? x 8]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      PDE_ID    SRVC_DT    PROD_~1 QTY_D~2 DAYS_~3 PTNT_~4 TOT_R~5
   <chr>            <chr>     <date>     <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 00E040C6ECE8F878 83014463~ 2008-12-20 492880~      30      30       0      10
 2 00E040C6ECE8F878 83594465~ 2009-04-25 529590~      20      30       0       0
 3 00E040C6ECE8F878 83144465~ 2009-09-22 000834~      80      30      40      80
 4 00E040C6ECE8F878 83614464~ 2009-10-03 634810~      60      10       0      10
 5 00E040C6ECE8F878 83014461~ 2009-11-16 511294~      60      30       0      20
 6 00E040C6ECE8F878 83234464~ 2009-12-11 580160~     270      30       0      10
 7 014F2C07689C173B 83294462~ 2009-09-14 009045~      90      30       0      60
 8 014F2C07689C173B 83874466~ 2009-10-11 596040~      40      20       0     570
 9 014F2C07689C173B 83314462~ 2009-11-24 510790~      30      30       0     150
10 014F2C07689C173B 83874467~ 2009-12-29 511293~      30      30       0      10
# ... with more rows, and abbreviated variable names 1: PROD_SRVC_ID,
#   2: QTY_DSPNSD_NUM, 3: DAYS_SUPLY_NUM, 4: PTNT_PAY_AMT, 5: TOT_RX_CST_AMT

Examples of using claimsdb for analysis

To analyze and explore the claims and beneficiary data, you can execute your own SQL code on the database using the DBI package.

DBI::dbGetQuery(
  con, 
  paste0(
    'SELECT "DESYNPUF_ID", "BENE_BIRTH_DT"',
    'FROM "bene"',
    'WHERE "BENE_SEX_IDENT_CD" = 1.0',
    'LIMIT 10'
  )
)
        DESYNPUF_ID BENE_BIRTH_DT
1  001115EAB83B19BB    1939-12-01
2  03ADA78C0FEF79F4    1923-02-01
3  040A12AB5EAA444C    1943-10-01
4  0507DE00BC6E6CD6    1932-07-01
5  05672CCCED56BCAD    1937-07-01
6  060CDE3A044F64BA    1943-02-01
7  061B5B3D9A459675    1932-04-01
8  08C8E0A0C6EAC884    1954-06-01
9  09EEB5C4C4FAEF10    1935-04-01
10 0A58C6D6B9BE67CF    1942-07-01

However, in the following examples, we will use the dbplyr package which translates dplyr code into SQL that can be executed against the database. You can see that the results below with dplyr functions match the results above that used a SQL query.

bene_rmt %>% 
  filter(BENE_SEX_IDENT_CD == 1) %>% 
  select(DESYNPUF_ID, BENE_BIRTH_DT)
# Source:   SQL [?? x 2]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      BENE_BIRTH_DT
   <chr>            <date>       
 1 001115EAB83B19BB 1939-12-01   
 2 03ADA78C0FEF79F4 1923-02-01   
 3 040A12AB5EAA444C 1943-10-01   
 4 0507DE00BC6E6CD6 1932-07-01   
 5 05672CCCED56BCAD 1937-07-01   
 6 060CDE3A044F64BA 1943-02-01   
 7 061B5B3D9A459675 1932-04-01   
 8 08C8E0A0C6EAC884 1954-06-01   
 9 09EEB5C4C4FAEF10 1935-04-01   
10 0A58C6D6B9BE67CF 1942-07-01   
# ... with more rows

We can use the show_query() function to see the SQL code that dbplyr created and that it closely matches the SQL code above.

bene_rmt %>% 
  filter(BENE_SEX_IDENT_CD == 1) %>% 
  select(DESYNPUF_ID, BENE_BIRTH_DT) %>% 
  show_query()
<SQL>
SELECT DESYNPUF_ID, BENE_BIRTH_DT
FROM bene
WHERE (BENE_SEX_IDENT_CD = 1.0)

First, a note about working with dates/times in databases

dbplyr is an amazing tool for working with databases, especially if you want to use many functions from the dplyr and tidyr packages. However, it does not currently have SQL translations for all functions in the tidyverse family of packages. For example, the lubridate package’s date and time functions work on local dataframes but cannot be translated to work on remote tables at this time. In the example below, you can see that the lubridate function year() (for parsing the year from a date) works on the local dataframe but generates an error on the remote table with the same data.

bene %>% 
  transmute(
    DESYNPUF_ID,
    BENE_BIRTH_DT,
    BIRTH_YEAR = lubridate::year(BENE_BIRTH_DT)
  )
# A tibble: 998 x 3
   DESYNPUF_ID      BENE_BIRTH_DT BIRTH_YEAR
   <chr>            <date>             <dbl>
 1 001115EAB83B19BB 1939-12-01          1939
 2 0018A1975BC0EE4F 1935-11-01          1935
 3 00E040C6ECE8F878 1932-11-01          1932
 4 014F2C07689C173B 1935-07-01          1935
 5 029A22E4A3AAEE15 1921-05-01          1921
 6 03410742A11DDC52 1943-07-01          1943
 7 03ADA78C0FEF79F4 1923-02-01          1923
 8 040A12AB5EAA444C 1943-10-01          1943
 9 043AAAE41C9A37B7 1924-03-01          1924
10 0507DE00BC6E6CD6 1932-07-01          1932
# ... with 988 more rows
bene_rmt %>% 
  transmute(
    DESYNPUF_ID,
    BENE_BIRTH_DT,
    BIRTH_YEAR = lubridate::year(BENE_BIRTH_DT)
  )
# Source:   SQL [?? x 3]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      BENE_BIRTH_DT BIRTH_YEAR
   <chr>            <date>             <dbl>
 1 001115EAB83B19BB 1939-12-01          1939
 2 0018A1975BC0EE4F 1935-11-01          1935
 3 00E040C6ECE8F878 1932-11-01          1932
 4 014F2C07689C173B 1935-07-01          1935
 5 029A22E4A3AAEE15 1921-05-01          1921
 6 03410742A11DDC52 1943-07-01          1943
 7 03ADA78C0FEF79F4 1923-02-01          1923
 8 040A12AB5EAA444C 1943-10-01          1943
 9 043AAAE41C9A37B7 1924-03-01          1924
10 0507DE00BC6E6CD6 1932-07-01          1932
# ... with more rows

Fortunately, dbplyr allows you pass along SQL functions in your dplyr code, and it will include these functions in the generated query. For date/time functions, we need to consult the documentation from DuckDB on date functions. To parse a part of a date (e.g., the year), we need to use the date_part() function for DuckDB.

The function to do this task may vary across database backends, so if you are doing this with a different database (Oracle, SQL Server, etc.), you will need to read the documentation for that database management system.

bene_rmt %>% 
  transmute(
    DESYNPUF_ID,
    BENE_BIRTH_DT,
    BIRTH_YEAR = date_part('year', BENE_BIRTH_DT)
  )
# Source:   SQL [?? x 3]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      BENE_BIRTH_DT BIRTH_YEAR
   <chr>            <date>             <dbl>
 1 001115EAB83B19BB 1939-12-01          1939
 2 0018A1975BC0EE4F 1935-11-01          1935
 3 00E040C6ECE8F878 1932-11-01          1932
 4 014F2C07689C173B 1935-07-01          1935
 5 029A22E4A3AAEE15 1921-05-01          1921
 6 03410742A11DDC52 1943-07-01          1943
 7 03ADA78C0FEF79F4 1923-02-01          1923
 8 040A12AB5EAA444C 1943-10-01          1943
 9 043AAAE41C9A37B7 1924-03-01          1924
10 0507DE00BC6E6CD6 1932-07-01          1932
# ... with more rows

Example 1: which members had the highest prescription drug costs for 2008?

For this first example, we are going to identify the beneficiaries with the highest total prescription drug costs in 2008. We need to use the pde table that has claims on prescription drug events and the bene table that has beneficiary records. We create an object that is the aggregated costs for prescription drugs at the beneficiary level in 2008. Note that we had to use date_part() to parse the year from the service date.

# Calculate rx costs for utilizing members in 2008
rx_costs_rmt <- pde_rmt %>% 
  mutate(BENE_YEAR = date_part('year', SRVC_DT)) %>% 
  filter(BENE_YEAR == 2008) %>% 
  group_by(BENE_YEAR, DESYNPUF_ID) %>% 
  summarize(TOTAL_RX_COST = sum(TOT_RX_CST_AMT, na.rm = T), .groups = "drop") %>% 
  ungroup()

rx_costs_rmt
# Source:   SQL [?? x 3]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   BENE_YEAR DESYNPUF_ID      TOTAL_RX_COST
       <dbl> <chr>                    <dbl>
 1      2008 00E040C6ECE8F878            10
 2      2008 03ADA78C0FEF79F4          5020
 3      2008 040A12AB5EAA444C           120
 4      2008 043AAAE41C9A37B7           810
 5      2008 05672CCCED56BCAD             0
 6      2008 061B5B3D9A459675           270
 7      2008 08BB74BA9DFD5C06            20
 8      2008 08C8E0A0C6EAC884          5860
 9      2008 09EEB5C4C4FAEF10          1590
10      2008 0A58C6D6B9BE67CF           930
# ... with more rows

Then we join the aggregated cost data to the beneficiary table. This is necessary because the pde table does not include beneficiaries who didn’t use any prescription drugs. After joining the table we reassign missing cost data to zero for those beneficiaries with no utilization. We can then use collect() to retrieve the results as a local dataframe.

# Join the rx costs data to the beneficiary file and include members with no costs
rx_bene_rmt <- bene_rmt %>% 
  filter(BENE_YEAR == 2008) %>% 
  select(
    BENE_YEAR,
    DESYNPUF_ID
  ) %>% 
  left_join(
    rx_costs_rmt, by = c("BENE_YEAR", "DESYNPUF_ID")
  ) %>% 
  mutate(TOTAL_RX_COST = ifelse(is.na(TOTAL_RX_COST), 0, TOTAL_RX_COST)) %>% 
  arrange(desc(TOTAL_RX_COST))

rx_bene_rmt
# Source:     SQL [?? x 3]
# Database:   DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
# Ordered by: desc(TOTAL_RX_COST)
   BENE_YEAR DESYNPUF_ID      TOTAL_RX_COST
       <dbl> <chr>                    <dbl>
 1      2008 484FAAB99C7F90F2         10410
 2      2008 1187122DCAD04B48          9300
 3      2008 5199E41C3B2B36AD          9110
 4      2008 801D1702FC0C105F          8210
 5      2008 92158DA24D7A8799          7910
 6      2008 A705B8508EB5747D          7380
 7      2008 A9142A5D9895479C          6780
 8      2008 D20C21DD9F8177D0          6590
 9      2008 08C8E0A0C6EAC884          5860
10      2008 EBE9CC1D92AA56C0          5690
# ... with more rows
# You can use collect() to bring the results into a local dataframe
rx_bene_df <- rx_bene_rmt %>% 
  collect()

rx_bene_df
# A tibble: 500 x 3
   BENE_YEAR DESYNPUF_ID      TOTAL_RX_COST
       <dbl> <chr>                    <dbl>
 1      2008 484FAAB99C7F90F2         10410
 2      2008 1187122DCAD04B48          9300
 3      2008 5199E41C3B2B36AD          9110
 4      2008 801D1702FC0C105F          8210
 5      2008 92158DA24D7A8799          7910
 6      2008 A705B8508EB5747D          7380
 7      2008 A9142A5D9895479C          6780
 8      2008 D20C21DD9F8177D0          6590
 9      2008 08C8E0A0C6EAC884          5860
10      2008 EBE9CC1D92AA56C0          5690
# ... with 490 more rows

Example 2: what percent of beneficiaries received an office visit within 30 days of discharge from a hospital?

In the next example, we are identifying which beneficiaries had an office visit within 30 days of being discharged. We will start with the inpatient table that contains records for all inpatient stays, including when a beneficiary was discharged. We create an object that includes the beneficiary ID, the discharge date, and the date 30 days after discharge. Note that for DuckDB we need to coerce “30” to an integer to calculate the new date.

# Identify all member discharge dates and the dates 30 days after discharge from the inpatient table
ip_discharges <- inpatient_rmt %>% 
  transmute(
    DESYNPUF_ID, 
    DSCHRG_DT = NCH_BENE_DSCHRG_DT,
    DSCHRG_DT_30 = NCH_BENE_DSCHRG_DT + as.integer(30)) %>% 
  distinct()

ip_discharges
# Source:   SQL [?? x 3]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      DSCHRG_DT  DSCHRG_DT_30
   <chr>            <date>     <date>      
 1 014F2C07689C173B 2009-09-20 2009-10-20  
 2 029A22E4A3AAEE15 2008-01-25 2008-02-24  
 3 060CDE3A044F64BA 2008-10-18 2008-11-17  
 4 08BB74BA9DFD5C06 2009-03-07 2009-04-06  
 5 08BB74BA9DFD5C06 2009-03-08 2009-04-07  
 6 08C8E0A0C6EAC884 2008-02-22 2008-03-23  
 7 08C8E0A0C6EAC884 2009-05-28 2009-06-27  
 8 08C8E0A0C6EAC884 2009-06-14 2009-07-14  
 9 08C8E0A0C6EAC884 2009-07-07 2009-08-06  
10 08C8E0A0C6EAC884 2009-08-23 2009-09-22  
# ... with more rows

Next, we need to identify office visits from the carrier table. I created a vector of five office visit codes for this example. Since these codes must match the values in the “HCPCS” columns, we reshape the table with pivot_longer() then filter for the office visit codes.

# Create vector of office visit codes
off_vis_cds <- as.character(99211:99215)

# Identify members who had an office visit from the carrier table
off_visit <- carrier_rmt %>% 
  select(DESYNPUF_ID, CLM_ID, CLM_FROM_DT, matches("HCPCS")) %>% 
  pivot_longer(cols = matches("HCPCS"), names_to = "LINE", values_to = "HCPCS") %>% 
  filter(HCPCS %in% off_vis_cds) %>% 
  distinct(DESYNPUF_ID, CLM_FROM_DT) %>% 
  mutate(OFFICE_VISIT = 1)

off_visit
# Source:   SQL [?? x 3]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID      CLM_FROM_DT OFFICE_VISIT
   <chr>            <date>             <dbl>
 1 00E040C6ECE8F878 2009-01-17             1
 2 00E040C6ECE8F878 2009-07-20             1
 3 00E040C6ECE8F878 2009-07-26             1
 4 029A22E4A3AAEE15 2008-01-25             1
 5 029A22E4A3AAEE15 2008-01-30             1
 6 029A22E4A3AAEE15 2008-08-28             1
 7 029A22E4A3AAEE15 2008-11-19             1
 8 029A22E4A3AAEE15 2009-04-01             1
 9 029A22E4A3AAEE15 2009-04-12             1
10 029A22E4A3AAEE15 2009-05-13             1
# ... with more rows

Finally, we join the office visits object to the discharges object. We can use the sql_on option in left_join() to inject some custom SQL to join when the office visit date is within 30 days of the discharge date.

# Join discharges data and office visit data
discharge_offvis <- 
  ip_discharges %>% 
  left_join(
    off_visit,
    sql_on = paste0(
      "(LHS.DESYNPUF_ID = RHS.DESYNPUF_ID) AND",
      "(RHS.CLM_FROM_DT >= LHS.DSCHRG_DT) AND",
      "(RHS.CLM_FROM_DT <= LHS.DSCHRG_DT_30)"
      )
  )

discharge_offvis
# Source:   SQL [?? x 6]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID.x    DSCHRG_DT  DSCHRG_DT_30 DESYNPUF_ID.y    CLM_FROM_DT OFFIC~1
   <chr>            <date>     <date>       <chr>            <date>        <dbl>
 1 029A22E4A3AAEE15 2008-01-25 2008-02-24   029A22E4A3AAEE15 2008-01-25        1
 2 029A22E4A3AAEE15 2008-01-25 2008-02-24   029A22E4A3AAEE15 2008-01-30        1
 3 060CDE3A044F64BA 2008-10-18 2008-11-17   060CDE3A044F64BA 2008-11-01        1
 4 1187122DCAD04B48 2009-04-06 2009-05-06   1187122DCAD04B48 2009-04-07        1
 5 1187122DCAD04B48 2009-04-06 2009-05-06   1187122DCAD04B48 2009-04-27        1
 6 1187122DCAD04B48 2009-04-06 2009-05-06   1187122DCAD04B48 2009-05-04        1
 7 12D6FF0C18764D0D 2009-05-08 2009-06-07   12D6FF0C18764D0D 2009-05-19        1
 8 12D6FF0C18764D0D 2009-05-08 2009-06-07   12D6FF0C18764D0D 2009-05-24        1
 9 144C187653FBBE83 2008-06-08 2008-07-08   144C187653FBBE83 2008-06-22        1
10 144C187653FBBE83 2008-06-08 2008-07-08   144C187653FBBE83 2008-07-04        1
# ... with more rows, and abbreviated variable name 1: OFFICE_VISIT

After the join is complete, we can calculate the share of discharges with a timely office visit.

# Find the percent of members with a office visit within 30 days of discharge
discharge_offvis %>% 
  distinct(
    DESYNPUF_ID.x,
    DSCHRG_DT,
    OFFICE_VISIT
  ) %>% 
  mutate(OFFICE_VISIT = ifelse(is.na(OFFICE_VISIT), 0, OFFICE_VISIT)) %>% 
  summarize(
    OFV_RATE = mean(OFFICE_VISIT, na.rm = T)
  )
# Source:   SQL [1 x 1]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
  OFV_RATE
     <dbl>
1    0.540

Example 3: how well are beneficiaries filling their hypertension drug prescriptions?

For this final example, we need to identify hypertension medications from the pde table and calculate medication adherence rates for each beneficiary. To isolate the hypertension medications, we can borrow from the HEDIS medication list for ACE inhibitors and ARB medications (which are commonly used to treat hypertension). This medication list is for 2018/2019, so it likely includes new drugs that did not exist in 2008/2009 and may not include older drugs that are no longer used (ideally it’s best to use external lists that match the time frame of the claims data).

hedis_wb <- tempfile()

download.file(
  url = "https://www.ncqa.org/hedis-2019-ndc-mld-directory-complete-workbook-final-11-1-2018-3/",
  destfile = hedis_wb,
  mode = "wb"
)

hedis_df <- readxl::read_excel(
  path = hedis_wb,
  sheet = "Medications List to NDC Codes"
)

hyp_ndc_df <- 
  hedis_df %>% 
  filter(`Medication List` == "ACE Inhibitor/ARB Medications") %>% 
  select(
    LIST = `Medication List`,
    PRODUCTID = `NDC Code`
  )

hyp_ndc_df
# A tibble: 3,230 x 2
   LIST                          PRODUCTID  
   <chr>                         <chr>      
 1 ACE Inhibitor/ARB Medications 00093512501
 2 ACE Inhibitor/ARB Medications 00093512505
 3 ACE Inhibitor/ARB Medications 00185005301
 4 ACE Inhibitor/ARB Medications 00185005305
 5 ACE Inhibitor/ARB Medications 00247213730
 6 ACE Inhibitor/ARB Medications 00378044301
 7 ACE Inhibitor/ARB Medications 00781189201
 8 ACE Inhibitor/ARB Medications 13811062810
 9 ACE Inhibitor/ARB Medications 13811062850
10 ACE Inhibitor/ARB Medications 21695032630
# ... with 3,220 more rows

One of the cool features of dbplyr is that we can copy this local dataframe to the database as a temporary table using the copy_to() function.

hyp_ndc_rmt <- copy_to(con, hyp_ndc_df, overwrite = T)

hyp_ndc_rmt
# Source:   table<hyp_ndc_df> [?? x 2]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   LIST                          PRODUCTID  
   <chr>                         <chr>      
 1 ACE Inhibitor/ARB Medications 00093512501
 2 ACE Inhibitor/ARB Medications 00093512505
 3 ACE Inhibitor/ARB Medications 00185005301
 4 ACE Inhibitor/ARB Medications 00185005305
 5 ACE Inhibitor/ARB Medications 00247213730
 6 ACE Inhibitor/ARB Medications 00378044301
 7 ACE Inhibitor/ARB Medications 00781189201
 8 ACE Inhibitor/ARB Medications 13811062810
 9 ACE Inhibitor/ARB Medications 13811062850
10 ACE Inhibitor/ARB Medications 21695032630
# ... with more rows

With the hypertension codes in the database, we can use inner_join() to find the matching drugs from the pde table.

pde_hyp_rmt <- 
  pde_rmt %>% 
  inner_join(
    hyp_ndc_rmt, by = c("PROD_SRVC_ID" = "PRODUCTID")
  )

pde_hyp_rmt
# Source:   SQL [?? x 9]
# Database: DuckDB 0.5.1 [Josh@Windows 10 x64:R 4.1.0/:memory:]
   DESYNPUF_ID   PDE_ID SRVC_DT    PROD_~1 QTY_D~2 DAYS_~3 PTNT_~4 TOT_R~5 LIST 
   <chr>         <chr>  <date>     <chr>     <dbl>   <dbl>   <dbl>   <dbl> <chr>
 1 03ADA78C0FEF~ 83554~ 2008-02-01 510790~      30      30       0      10 ACE ~
 2 040A12AB5EAA~ 83974~ 2009-11-27 638740~      90      30       0      10 ACE ~
 3 043AAAE41C9A~ 83024~ 2008-08-11 002472~     200      90      10      80 ACE ~
 4 0D540BEBC45A~ 83454~ 2009-02-18 604290~      60      90      10      10 ACE ~
 5 0D540BEBC45A~ 83874~ 2009-06-15 661050~      30      30       0      10 ACE ~
 6 0E0A0107787B~ 83714~ 2008-06-30 661050~     100      90      70      60 ACE ~
 7 109D67D11477~ 83064~ 2008-11-21 134110~      30      60       0      70 ACE ~
 8 109D67D11477~ 83034~ 2009-07-03 666850~      30      30       0      10 ACE ~
 9 10D75CDD5B4A~ 83804~ 2008-07-01 002472~      90      90       0      10 ACE ~
10 1183CA4884F8~ 83934~ 2009-01-25 001850~      90      20       0     110 ACE ~
# ... with more rows, and abbreviated variable names 1: PROD_SRVC_ID,
#   2: QTY_DSPNSD_NUM, 3: DAYS_SUPLY_NUM, 4: PTNT_PAY_AMT, 5: TOT_RX_CST_AMT

We can now use collect() to retrieve the data as a local dataframe. As a dataframe, we can now use the AdhereR package to calculate the medication possession ratio (MPR) for members who filled a hypertension medication. MPR is a commonly used metric of medication adherence, measuring if beneficiaries have gaps between their prescription fills.

We can see the MPR for each member who filled one of the medications, and we can calculate the median and mean MPRs across these beneficiaries.

# install.packages("AdhereR")
library(AdhereR)

pde_hyp_df <- pde_hyp_rmt %>% 
  collect()

hyp_adhere <- CMA4(
  data = pde_hyp_df,
  ID.colname = 'DESYNPUF_ID',
  event.date.colname = 'SRVC_DT',
  event.duration.colname = 'DAYS_SUPLY_NUM',
  date.format = "%Y-%m-%d"
)

hyp_adhere_cma <- hyp_adhere$CMA %>% tibble()

hyp_adhere_cma
# A tibble: 134 x 2
   DESYNPUF_ID         CMA
   <chr>             <dbl>
 1 03ADA78C0FEF79F4 0.0411
 2 040A12AB5EAA444C 0.0411
 3 043AAAE41C9A37B7 0.123 
 4 0D540BEBC45ADCA7 0.164 
 5 0E0A0107787B32AA 0.123 
 6 109D67D114778917 0.123 
 7 10D75CDD5B4AD3B0 0.123 
 8 1183CA4884F8A0A8 0.0274
 9 11C45CF0DDD03ACE 0.0822
10 1226AD9944384B64 0.0411
# ... with 124 more rows
hyp_adhere_cma %>% 
  summarize(
    median_mpr = median(CMA),
    mean_mpr = mean(CMA)
  )
# A tibble: 1 x 2
  median_mpr mean_mpr
       <dbl>    <dbl>
1     0.0411   0.0848

Data Limitations

While data included in claimsdb is useful for many types of analyses, it does include a few notable limitations. - As mentioned earlier, the data is a small sample (500 beneficiaries) and is not intended to be representative of the Medicare population. In addition, the data is synthetic and should not be used for drawing inferences on the Medicare population. - Since the data is more than 10 years old, it doesn’t capture newer medications or procedures. It also includes procedure codes that have been retired or replaced. This is a challenge when applying external code lists that are much newer. - The diagnosis fields in the data use the International Classification of Diseases, Ninth Revision (ICD-9), but the United States converted to ICD-10 in 2015. If you are interesting in a mapping between ICD-9 and ICD-10, CMS has resources to consider. - The Medicare population is mostly Americans aged 65 and over, so the data will not have claims on certain specialties such as pediatrics or maternity care.