This article does not contain any real patient data. All patient data has been simulated but formatted to match the structure of CPRD Aurum data.

3 Worked example for data extraction

3.1 Step 1: Defining a cohort

We have provided simulated patient, observation and drugissue files which will be utilisied in the worked example. The names of the files share the same naming convention given in section 2.1, and column names of the data match the real Aurum data. Numeric variables were simulated at random as integers between 1 and 100, date variables as a date between 01/01/1900 and 01/01/2000, gender as an integer 1 or 2, and year of birth as an integer between 1900 and 2000. Patient id and practice id were assigned manually. These files are contained in the inst/aurum_data directory of rcprd. After installing rcprd, this directory can be accessed using the command system.file("aurum_data", package = "rcprd"). This contains data on 12 fake patients, split across two patient files (set1 and set2) and three observation and drugissue files (all set1):

#devtools::install_github("alexpate30/rcprd")
#install.packages("rcprd") NOT YET ON CRAN
library(rcprd)
#devtools::load_all()
list.files(system.file("aurum_data", package = "rcprd"), pattern = ".txt")
#> [1] "aurum_allpatid_set1_extract_drugissue_001.txt"  
#> [2] "aurum_allpatid_set1_extract_drugissue_002.txt"  
#> [3] "aurum_allpatid_set1_extract_drugissue_003.txt"  
#> [4] "aurum_allpatid_set1_extract_observation_001.txt"
#> [5] "aurum_allpatid_set1_extract_observation_002.txt"
#> [6] "aurum_allpatid_set1_extract_observation_003.txt"
#> [7] "aurum_allpatid_set1_extract_patient_001.txt"    
#> [8] "aurum_allpatid_set2_extract_patient_001.txt"

The first step in most analyses is creating and defining a cohort of individuals, which will involve working with the patient files. Data from the patient files can be combined using the extract_cohort function. This will look in the directory specified through the filepath argument, for any file containing “patient” in the file name. All files will be read in and concatenated into a single dataset. In some circumstances, researchers may be provided with a list of patids which meet their inclusion/exclusion criteria. In this case, these can be specified through the patids argument (which requires a character vector). Suppose the individuals meeting the exclusion criteria are those with patid = 1, 3, 4 and 6. We would then specify:

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"), patids = as.character(c(1,3,4,6)))
str(pat)
#> 'data.frame':    4 obs. of  12 variables:
#>  $ patid         : chr  "1" "3" "4" "6"
#>  $ pracid        : int  49 98 53 54
#>  $ usualgpstaffid: chr  "6" "43" "72" "11"
#>  $ gender        : int  2 1 2 1
#>  $ yob           : int  1984 1930 1915 1914
#>  $ mob           : int  NA NA NA NA
#>  $ emis_ddate    : Date, format: "1976-11-21" "1972-06-01" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1913-07-02" ...
#>  $ patienttypeid : int  58 81 10 85
#>  $ regenddate    : Date, format: "1996-08-25" "1997-04-24" ...
#>  $ acceptable    : int  1 1 0 1
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1912-04-27" ...

In other circumstances, a user may need to apply the inclusion and exclusion criteria themselves. In this case, one would initially create a patient file for all individuals.

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"))
str(pat)
#> 'data.frame':    12 obs. of  12 variables:
#>  $ patid         : chr  "1" "2" "3" "4" ...
#>  $ pracid        : int  49 79 98 53 62 54 49 79 98 53 ...
#>  $ usualgpstaffid: chr  "6" "11" "43" "72" ...
#>  $ gender        : int  2 1 1 2 2 1 2 1 1 2 ...
#>  $ yob           : int  1984 1932 1930 1915 1916 1914 1984 1932 1930 1915 ...
#>  $ mob           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ emis_ddate    : Date, format: "1976-11-21" "1979-02-14" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1929-02-23" ...
#>  $ patienttypeid : int  58 21 81 10 45 85 58 21 81 10 ...
#>  $ regenddate    : Date, format: "1996-08-25" "1945-03-19" ...
#>  $ acceptable    : int  1 0 1 0 0 1 1 0 1 0 ...
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1932-02-05" ...

The cohort of individuals would then be defined by applying study specific inclusion/exclusion criteria. For example, all individuals with > 1 day valid follow up aged 65+, after 1st January 2000. Such criteria can be applied solely using the information available in patient files. In this example, we define the individuals that met the inclusion criteria to be those with patid = 1, 3, 4 and 6.

pat <- subset(pat, patid %in% c(1,3,4,6))

Once the cohort has been defined, the next step is to extract medical/prescription data for these individuals.

3.2 Step 2: Reading in data and creating an SQLite database

Data for individuals in the cohort of interest is extracted from the .txt files and put into a SQLite database. This SQLite database is stored on a fixed storage device and can be queried when defining an analysis-ready dataset.

3.2.1 Add individual files to SQLite database using `add_to_database`

The function add_to_database can be used to add individual files to the SQLite database. Start by defining and connecting to your SQLite database. In this article we create a temporary database, but in practice this would be a permanent storage location. Specifically, file.path(tempdir(), "temp.sqlite") would be replaced by the desired file path and SQLite database name.

aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

Next, we add medical diagnoses data from the observation files to this database using the add_to_database function. The simulated raw data provided with rcprd can be accessed using the system.file function. The vector of patient id’s that defines the cohort is defined through the subset_patids argument. Only data with patid’s matching this argument will be added to the SQLite database. The filetype argument will select an appropriate function for reading in the .txt files, and also defines the name of the table in the SQLite database that the files are added to. Note that for the first file, overwrite = TRUE is specified to create a new table. For the second and third file, append = TRUE is specified to append to an existing table.

add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_001.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, overwrite = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_002.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_003.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)

We can then query this database, by selecting all rows from the observation table, and only printing the first 3. More details on how to query an SQLite database from within R is available in the documentation for R package RSQLite (Müller et al. 2024).

RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74

Note that when reading the raw data into R, the dates are converted into date formats, with a underlying numeric value where day 0 is 01/01/1970. When saved to the SQLite database, it is the underlying numeric values which is saved, hence the dates now appearing as numeric values. Next, the prescription data from the drugissue files is added to a table called drugissue. A single SQLite database may contain more than one table, so this data is added to a different table within the same SQLite database.

add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_001.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, overwrite = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_002.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_003.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)

Again this table can be queried, by selecting all rows from the drugissue table, and only printing the first 3.

RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

Listing the tables in the SQLite database shows there are now two, named observation and drugissue.

RSQLite::dbListTables(aurum_extract)
#> [1] "drugissue"   "observation"

The add_to_database function allows specification of filetype = c("observation", "drugissue", "referral", "problem", "consultation", "hes_primary","death"), each corresponding to a specific function for reading in the corresponding .txt files with correct formatting. The "hes_primary" options correspond to the primary diagnoses file in linked HES APC data. The "death" file corresponds to the death file in the linked ONS data. If wanting to add other files to the SQLite database, a user defined function for reading in the raw .txt file can be specified through extract_txt_func, and a table name can be specified through tablename. This allows the user to add any .txt file to their SQLite database.

Finally, when manually adding files in this manner, it is good practice to close the connection to the SQLite database once finished.

RSQLite::dbDisconnect(aurum_extract)

3.2.2 Add all relevant files to SQLite database using `cprd_extract`

In practice, there will be a high number of files to add to the SQLite database and adding each one using add_to_database would be cumbersome. We now repeat the extraction but using the cprd_extract function, which is a wrapper for add_to_database, and will add all the files in a specified directory that contain a string matching the specified file type. Start by creating a connection to the database:

aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

We then use cprd_extract to add all the observation files into the SQLite database. If the connection (aurum_extract) is to an existing database, which is the case here, it will be overwritten when running cprd_extract. The directory containing the files should be specified using filepath. It will only read in and add files with the text string specified in filetype, which takes values in c("observation", "drugissue", "referral", "problem", "consultation"). We then query the first three rows of this database, and note they are the same as previously.

### Extract data
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "observation", subset_patids = c(1,3,4,6), use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:40.655626
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:40.680561
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:40.703337
#>   |                                                                              |======================================================================| 100%

### Query first three rows
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74

The process is then repeated for the drugissue files.

### Extract data
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "drugissue", subset_patids = c(1,3,4,6), use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_001.txt 2024-11-11 22:30:40.777615
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_002.txt 2024-11-11 22:30:40.801986
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_003.txt 2024-11-11 22:30:40.823978
#>   |                                                                              |======================================================================| 100%

### List tables
RSQLite::dbListTables(aurum_extract)
#> [1] "drugissue"   "observation"

### Query first three rows
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

### Disconnect
RSQLite::dbDisconnect(aurum_extract)

The string to match on, function to read in the raw data, and the name of the table in the SQLite database, can be altered using the str_match, extract_txt_func and tablename arguments respectively. Note that this function may run for a considerable period of time when working with the entire CPRD AURUM database, and therefore it is not recommended to run interactively. While creation of the SQLite database may be time consuming, subsequent queries will be far more efficient, so this is short term pain for a long term gain.

3.2.3 Add all relevant files to SQLite database in a computationally efficient manner using the `set` functionality.

When the number of patients in your cohort is very large (for example millions, or tens of millions), the add_to_database function may perform very slowly. This is because for each observation in the file being added to the SQLite database, add_to_database checks to see whether the patid is contained in the vector subset_patids (a vector of length 20,000,000 in our case). We can utilise the structure of the CPRD AURUM data to speed up this process. If data has the set naming convention (see section 2.1), we know that we only need to search for patids from subset_patids, that are in the corresponding patient file. For example, when reading in file aurum_allpatid_set1_extract_observation_00Y.txt (for any Y), we only need to search whether patid is in the vector of patids from subset.patid, that are also in aurum_allpatid_set1_extract_patient_001.txt, which is much smaller vector. This can reduce the computation time for add_to_database and cprd_extract.

To achieve this, the subset_patids object should be a data frame with two required columns. The first column should be patid, the second should be set, reporting the corresponding value of set which the patient belongs to. The first step is therefore to create a patient file, which has an extra variable set, the number following the text string set in the patient file containing data for that patient. When reading in the patient files to create a cohort, this can be done by specifying set = TRUE. In this example, all individuals in our cohort come from the file with string set1, and therefore this variable is the same for all individuals in this cohort, however this will not be the case in practice.

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"), patids = as.character(c(1,3,4,6)), set = TRUE)
str(pat)
#> 'data.frame':    4 obs. of  13 variables:
#>  $ patid         : chr  "1" "3" "4" "6"
#>  $ pracid        : int  49 98 53 54
#>  $ usualgpstaffid: chr  "6" "43" "72" "11"
#>  $ gender        : int  2 1 2 1
#>  $ yob           : int  1984 1930 1915 1914
#>  $ mob           : int  NA NA NA NA
#>  $ emis_ddate    : Date, format: "1976-11-21" "1972-06-01" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1913-07-02" ...
#>  $ patienttypeid : int  58 81 10 85
#>  $ regenddate    : Date, format: "1996-08-25" "1997-04-24" ...
#>  $ acceptable    : int  1 1 0 1
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1912-04-27" ...
#>  $ set           : num  1 1 1 1

The patient file read in is the same as previously, with the addition of the set column. This file can be reduced to just the patid and set columns, and used as the input to subset_patids when running the add_to_database and cprd_extract functions. When extracting data from observation files with set1 in the name, it will only search for patient id’s with set == 1 in the data.frame provided to subset_patids.

### Create connection to SQLite database
aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

### Add observation files
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "observation", 
             subset_patids = pat, 
             use_set = TRUE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:40.965948
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:40.990244
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:41.012684
#>   |                                                                              |======================================================================| 100%

### Add drugissue files
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "drugissue", 
             subset_patids = pat, 
             use_set = TRUE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_001.txt 2024-11-11 22:30:41.038475
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_002.txt 2024-11-11 22:30:41.065845
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_003.txt 2024-11-11 22:30:41.087364
#>   |                                                                              |======================================================================| 100%

### Query first three rows of each table
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

Note that there is no difference compared to the previously extracted SQLite databases. The computational gains from applying the subsetting in this manner will not be realised in this example. We do not close the connection, as we will now move onto querying the database to extract variables for creating an analysis-ready dataset.

3.3 Step 3: Querying the SQLite database to extract variables

Once the data has been extracted and stored in an SQLite database, it can now be queried to create variables of interest. The normal process for extracting variables from electronic health records is to create code lists, a group of codes which denote the same condition. The database would then be queried for observations with medical codes matching those in the code list. A variable would then be defined based on this query. Whether this is a binary variable, indicating whether an individual has any record of a given code, or the most recent test result with the given code, or something much more complex. In CPRD Aurum, medical diagnoses and tests are identified from the observation file using medcodeids, and prescription data is identified from the drugissue file using prodcodeids. Creation of code lists is an important step of data extraction, and we refer elsewhere for details on best practice for developing code lists, and the limitations of working with code lists (Williams et al. 2019, 2017; Watson et al. 2017; Gulliford et al. 2009; Matthewman et al. 2024). The functions in this section are split into three groups:

Functions for extracting common variable types.
Functions for extracting specific variables
Functions for database queries and custom variable extraction

These functions extract and query the data relative to an index date. The index date may be a fixed date (e.g. 1st January 2010), a date which is different for each individual (e.g. date age 50 reached), or a combination of the two (e.g., maximum of 1st January 2010 and date aged 50 reached). Note, if the inclusion/exclusion criteria for the cohort are dependent on medical diagnoses or prescriptions, the functions in this section will be necessary in order to apply these criteria, and further reduce the cohort (step 2.2).

3.3.1 Functions for extracting common variable types

There are functions to extract three common variable types, history of condition/medication prior to index date (extract_ho), time from the index date until first occurrence of a medical code/prescription or censoring (extract_time_until), and most recent test result(s) in a given time frame and valid range relative to the index date (extract_test_data).

The first, extract_ho, extracts a binary variable based on whether individual has a specified code recorded prior to index date. This can be applied to search for history of medical diagnoses or prescriptions. The index date ust be a variable in the cohort dataset, and is specified through the indexdt argument.

### Define codelist
codelist <- "187341000000114"

### Add an index date to cohort
pat$fup_start <- as.Date("01/01/2020", format = "%d/%m/%Y")

### Extract a history of type variable using extract_ho
ho <- extract_ho(cohort = pat, 
                 codelist_vector = codelist, 
                 indexdt = "fup_start", 
                 db_open = aurum_extract, 
                 tab = "observation",
                 return_output = TRUE)
str(ho)
#> 'data.frame':    4 obs. of  2 variables:
#>  $ patid: chr  "1" "3" "4" "6"
#>  $ ho   : int  1 0 0 1

The second is extract_time_until, which defines a time-to-event/survival variable. This has two components, the time until the first record of a specified code or censoring, and an indicator for whether event was observed or censored. To derive a variable of this type the cohort must also contain a time until censoring variable, which can be specified through censdt.

### Add an censoring date to cohort
pat$fup_end <- as.Date("01/01/2024", format = "%d/%m/%Y")

### Extract a time until variable using extract_time_until
time_until <- extract_time_until(cohort = pat, 
                                 codelist_vector = codelist, 
                                 indexdt = "fup_start", 
                                 censdt = "fup_end",
                                 db_open = aurum_extract, 
                                 tab = "observation",
                                 return_output = TRUE)
str(time_until)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ patid        : chr  "1" "3" "4" "6"
#>  $ var_time     : num  1461 1461 1461 1461
#>  $ var_indicator: num  0 0 0 0

The third is extract_test, which will extract the most recent test result in a given time frame. The number of days before and after the index date to search for results are specified through time_post and time_prev respectively. Test results are identified from the observation file, using code lists. Lower and upper bounds can also be specified for the extracted data through lower_bound and upper_bound.

### Extract test data using extract_test_data
test_data <- extract_test_data(cohort = pat, 
                          codelist_vector = codelist, 
                          indexdt = "fup_start", 
                          db_open = aurum_extract,
                          time_post = 0,
                          time_prev = Inf,
                          return_output = TRUE)
str(test_data)
#> 'data.frame':    4 obs. of  2 variables:
#>  $ patid: chr  "1" "3" "4" "6"
#>  $ value: num  84 NA NA 28

More than one observation can be returned by specifying numobs. Metadata of the test result, such as the unit of measurement, date recorded, and the medical code, can be returned by settings numunitid = TRUE. A variation of this function, extract_test_data_var, will returns the standard deviation of the test data within the specified time and value range. Once all the variables of interest have been extracted, they can be merged into an analysis-ready dataset (step 4).

### Recursive merge
analysis.ready.pat <- Reduce(function(df1, df2) merge(df1, df2, by = "patid", all.x = TRUE), list(pat[,c("patid", "gender", "yob")], ho, time_until, test_data)) 
analysis.ready.pat
#>   patid gender  yob ho var_time var_indicator value
#> 1     1      2 1984  1     1461             0    84
#> 2     3      1 1930  0     1461             0    NA
#> 3     4      2 1915  0     1461             0    NA
#> 4     6      1 1914  1     1461             0    28

3.3.2 Functions for extracting specific variables

There are also a number of functions that can be used to extract specific variables:

extract_bmi: Derives BMI scores. Requires specification of codelist for BMI, height, and weight separately.
extract_cholhdl_ratio: Derives total cholesterol/high-density lipoprotein ratio. Requires specification of separate codelists for total cholesterol/high-density lipoprotein ratio, total cholesterol, and high-density lipoproteins separately.
extract_diabetes: Derives a categorical variable for history of type 1 diabetes, history of type 2 diabetes or no history of diabetes. Requires specification of separate codelists for type 1 and type 2 diabetes. Individuals with codes for both are designated as type 1.
extract_smoking: Derives a categorical variable for smoking status. Requires specification of seperate codelists for non-smoker, ex-smoker, light smoker, moderate smoker and heavy smoker. If the most recent smoking status is non-smoker, but there are historical codes which indicate smoking, then individual will be classified as an ex-smoker.

It was deemed that these variables required custom functions because their definitions did not fit into any of the variable types from section 3.3.1. In each case, a number of steps are taken in order to clean or manipulate the data in order to get the desired output. For example, height measurements recorded in centimeters are converted to metres in order to calculate BMI scores. This is done through the use of the numunitid variable in the observation file. For both BMI and cholesterol/high-density lipoprotein ratio, the variable can be either be identified directly, or calculated from the component mesaures. In each case, the component parts must be recorded in the specified time range relative to the index date. For smoking status, if an individuals most recent medical observation was recorded as a non-smoker, but their medical record shows previous smoking, the most recent record is changed to ex-smoker. The steps for cleaning the data and extracting these variables are provided in the vignette titled Details-on-algorithms-for-extracting-specific-variables. However, it is important to state, that the correct way to define a variable may change from study to study. Therefore when using these functions to extract variables, we encourage taking the time to ensure that the way the variable is extracted matches the definition in ones study.

3.3.3 Functions for database queries and custom variable extraction

These functions are utilised internally in the functions from sections 3.3.1 and 3.3.2. They have been provided to more easily enable package users to write their own functions for extracting variables that are not covered in the previous two sections.

The db_query function will query the SQLite database for observations where the medcodeid or prodcodeid is in a specified codelist. For example, we can query the observation table for all codes with medcodeid of 187341000000114.

db_query <- db_query(db_open = aurum_extract,
                     tab ="observation",
                     codelist_vector = "187341000000114")

db_query
#>     patid consid pracid  obsid obsdate enterdate staffid parentobsid
#>    <char> <char>  <int> <char>   <num>     <num>  <char>      <char>
#> 1:      1     42      1     81   -5373      4302      85          35
#> 2:      6     40      1     41  -14727     -6929      98          80
#>          medcodeid value numunitid obstypeid numrangelow numrangehigh probobsid
#>             <char> <num>     <int>     <int>       <num>        <num>    <char>
#> 1: 187341000000114    84        79        67          24           22         5
#> 2: 187341000000114    28        20         5          41           97        92

The combine_query_boolean function will assess whether each individual in a specified cohort (pat) has an observation in the queried data (obtained using db_query) within a specified time frame from the index date, returning a 0/1 vector. The cohort must contain a variable called indexdt containing the index date. This function is useful when defining ‘history of’ type variables, where we want to know if there is any record of a given condition prior to the index date.

### Add an index date to pat
pat$indexdt <- as.Date("01/01/2020", format = "%d/%m/%Y")

### Combine query with cohort creating a boolean variable denoting 'history of'
combine.query.boolean <- combine_query_boolean(cohort = pat,
                                               db_query = db_query,
                                               query_type = "med")
  
combine.query.boolean
#> [1] 1 0 0 1

The combine_query function will merge a cohort with the queried data and return a specified number of observations (numobs) within a specified time frame from the index date. This is useful when extracting test data and requiring access to the values of the tests, or when specifying variables that require > 1 observation within a certain time frame (i.e. two prescriptions within a month prior to index date). For queries from the observation table, the query type can be specified as "med" or "test". Inputting query_type = "med" will just return the date of the observations and the medcodeid.

### Combine query with cohort retaining most recent three records
combine.query <- combine_query(cohort = pat,
                               db_query = db_query,
                               query_type = "med",
                               numobs = 3)
  
combine.query
#>     patid       medcodeid obsdate
#>    <char>          <char>   <num>
#> 1:      1 187341000000114   -5373
#> 2:      6 187341000000114  -14727

For query_type = "test", the value and other relevant information will also be returned, and those with NA values removed (although this can be altered through argument value_na_rm). We then close the connection to the database.

### Extract a history of type variable using extract_ho
combine.query <- combine_query(cohort = pat,
                               db_query = db_query,
                               query_type = "test",
                               numobs = 3)
  
combine.query
#>     patid       medcodeid obsdate value numunitid numrangelow numrangehigh
#>    <char>          <char>   <num> <num>     <int>       <num>        <num>
#> 1:      1 187341000000114   -5373    84        79          24           22
#> 2:      6 187341000000114  -14727    28        20          41           97

### Disconnect
RSQLite::dbDisconnect(aurum_extract)

If the query was from the drugissue table, then query_type = "drug" should be specified, and the date of the observations and the prodcodeid will be returned. The functions in this section do little processing of the extracted data, and further manipulation is required in order to define most variables.

3.3.4 Saving extracted variables directly to a disk drive, and utilising rAURUMs suggested directory system

So far all extracted variables (using functions from section 3.3.1 and 3.3.2) have been read into the R workspace by specifying return_output = TRUE. When working with large cohorts it may be preferable to save the output directly onto a disk drive, by specifying out_save_disk = TRUE. The file path to save the output can be specified manually through the out_filepath argument. However, if this argument is left as NULL, rcprd will attempt to save the extracted variable into a directory “data/extraction/” relative to the working directory. The name of the file itself will be dependent on the variable name specified through argument varname. This can be a very convenient way to save the output directly to disk without having to repeatedly specify file paths and file names.

There is similar functionality when specifying the codelists. Codelists can be specified in two ways. The first is to read the codelist into R as a character vector and then specify through the argument codelist_vector, which has been done in all the previous examples. Alternatively, codelists stored on the disk drive can be referred to from the codelist argument in many rcprd functions, but requires a specific underlying directory structure. The codelist on the disk drive must be stored in a directory called “codelists/analysis/” relative to the working directory. The codelist must be a .csv file, and contain a column medcodeid, prodcodeid or ICD10 depending on the table being queried. The input to argument codelist should just be a character string of the name of the files (excluding the suffix ‘.csv’). The codelist_vector argument will take precedence over the codelist argument if both are specified.

Finally, there is similar functionality for accessing the SQLite database internally, rather than having to 1) open a connection, 2) use this as an input in the functions, and then 3) remember to close the connection. Instead, if the SQLite database is stored in a directory “data/sql/” relative to the working directory, the SQLite database can be referred to by name (a character string) with the argument db. A connection to the SQLite datbase will be opened internally within the function call, the SQLite database will be queried, and then the connection closed. Alternatively, a SQLite database stored anywhere on the disk drive can be accessed by specifying the full filepath (character string) with the argument db_filepath.

This workflow is advantageous as it avoids hard file paths which beneficial if wanting to move your code onto another computer system. Furthermore, once codelists and the SQLite database have been created and stored in the appropriate folders, they can simply be referred to by name, resulting in an easier workflow. The function create_directory_system() will create the directory system required to use rcprd in this way. To avoid repetition of the previous section, this is showcased just once using the extract_ho function. For the sake of this example, we start by setting the working directory to a directory called inst/example within rcprd. To maintain the new working directory across multiple R markdown code chunks, we use knitr::opts_knit$set. To follow this section, the user should simply set their working directory as usual using setwd().

## Set working directory
knitr::opts_knit$set(root.dir = tempdir())

Next, the create_directory_system() function can be used to generate the required directory structure.

suppressMessages(
  create_directory_system()
)

file.exists(file.path(tempdir(), "data"))
#> [1] TRUE
file.exists(file.path(tempdir(), "codelists"))
#> [1] TRUE
file.exists(file.path(tempdir(), "code"))
#> [1] TRUE

An SQLite database called “mydb.sqlite” is then created in the “data/sql” directory, using the same data from the previous examples:

## Open connection
aurum_extract <- connect_database("data/sql/mydb.sqlite")

## Add data to SQLite database using cprd_extract
cprd_extract(db = aurum_extract,
             filepath = system.file("aurum_data", package = "rcprd"),
             filetype = "observation", use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:42.75966
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:42.784241
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:42.809921
#>   |                                                                              |======================================================================| 100%

## Disconnect
RSQLite::dbDisconnect(aurum_extract)

Finally, a code list called mylist.csv is created and saved into the codelists/analysis/ directory.

### Define codelist
codelist <- data.frame(medcodeid = "187341000000114")

### Save codelist
write.csv(codelist, "codelists/analysis/mylist.csv")

The mydb.sqlite database can now be queried to create a ‘history of’ type variable using the codelist mylist.csv, with the output saved directly onto the disk drive.

extract_ho(cohort = pat,
           codelist = "mylist",
           indexdt = "fup_start",
           db = "mydb",
           tab = "observation",
           return_output = FALSE,
           out_save_disk = TRUE)

Note that in order to run extract_ho here, a connection to the SQLite database did not need to be created, the codelist did not need to be in the R workspace, and there is no output from this function. Instead the extracted variable has been saved onto the disk drive in an .rds file, and can be read in using:

readRDS("data/extraction/var_ho.rds")
#>   patid ho
#> 1     1  1
#> 3     3  0
#> 4     4  0
#> 6     6  1

This setup can be used in conjunction with any of the functions from step 3 (i.e. extract_test_var, extract_time_until or db_query).

3.3.5 Extracting longitudinal data/time varying covariates

All of the functions in section 3.3.1 and 3.3.2 have the option to extract data at a given time point post index date (specified through the t argument). This allows users to extract data at fixed intervals, which can be utilised for longitudinal analyses where time-varying covariates are required. If saving the extracted variables directly to the disk drive (out_save_disk = TRUE), the time at which data was extracted from, t, will be added to the file name by default.

rcprd: An R package to simplify the extraction and processing of CPRD data, and create analysis-ready datasets

1 Introduction

2 Data Structure and Extraction Process

2.1 Structure of CPRD Aurum data

2.2 Recommended process for extraction

3 Worked example for data extraction

3.1 Step 1: Defining a cohort

3.2 Step 2: Reading in data and creating an SQLite database

3.2.1 Add individual files to SQLite database using `add_to_database`

3.2.2 Add all relevant files to SQLite database using `cprd_extract`

3.2.3 Add all relevant files to SQLite database in a computationally efficient manner using the `set` functionality.

3.3 Step 3: Querying the SQLite database to extract variables

3.3.1 Functions for extracting common variable types

3.3.2 Functions for extracting specific variables

3.3.3 Functions for database queries and custom variable extraction

3.3.4 Saving extracted variables directly to a disk drive, and utilising rAURUMs suggested directory system

3.3.5 Extracting longitudinal data/time varying covariates

4 Discussion

5 References

rcprd: An R package to simplify the extraction and processing of CPRD data, and create analysis-ready datasets

1 Introduction

2 Data Structure and Extraction Process

2.1 Structure of CPRD Aurum data

2.2 Recommended process for extraction

3 Worked example for data extraction

3.1 Step 1: Defining a cohort

3.2 Step 2: Reading in data and creating an SQLite database

3.2.1 Add individual files to SQLite database using add_to_database

3.2.2 Add all relevant files to SQLite database using cprd_extract

3.2.3 Add all relevant files to SQLite database in a computationally efficient manner using the set functionality.

3.3 Step 3: Querying the SQLite database to extract variables

3.3.1 Functions for extracting common variable types

3.3.2 Functions for extracting specific variables

3.3.3 Functions for database queries and custom variable extraction

3.3.4 Saving extracted variables directly to a disk drive, and utilising rAURUMs suggested directory system

3.3.5 Extracting longitudinal data/time varying covariates

4 Discussion

5 References

3.2.1 Add individual files to SQLite database using `add_to_database`

3.2.2 Add all relevant files to SQLite database using `cprd_extract`

3.2.3 Add all relevant files to SQLite database in a computationally efficient manner using the `set` functionality.