rcprd contains a number of functions which extract specific variables, namely:
extract_BMI
extract_cholhdl_ratio
extract_diabetes
extract_smoking
The algorithms underpinning the extraction of these variables are given in section 2. A summary of the unit measurements recorded for these variables is given in section 3.
extract_BMI
)Extraction of BMI requires the user to specify three codelists. One
for BMI scores (codelist_bmi
), one for height measurements
(codelist_height
) and one for weight measurements
(codelist_weight
). All the BMI, height and weight
measurements for each patient in the cohort of interest are then
extracted. The algorithm is as follows:
extract_cholhdl_ratio
)Extraction of cholesterol/HDL ratio requires the user to specify
three codelists. One for cholesterol/HDL ratio measurements
(codelist_ratio
), one for total cholesterol measurements
(codelist_chol
) and one for HDL measurements
(codelist_hdl
). All the cholesterol/HDL, total cholesterol
and cholesterol/HDL measurements for each patient in the cohort of
interest are then extracted. The algorithm is as follows:
extract_diabetes
)Extraction of diabetes status requires the user to specify two
codelists. One for type 1 diabetes (codelist_type1
), and
another for type 2 diabetes (codelist_type2
). The reason
this variable is not treated as a history of
type variable
and extracted using extract_ho
is because often individuals
will have a generic code such as diabetes mellitus, which would
be used to identify type 2 diabetes, but will also have a specific code
such as type 1 diabetes mellitus. This algorithm treats the two
as mutually exclusive, and assigns individuals with a code for both type
1 and type 1 diabetes, as having type 1 diabetes. The algorithm is as
follows:
extract_smoking
)Extraction of smoking status requires the user to specify five
codelists. One for non-smoker (codelist_non
), one for
ex-smoker (codelist_ex
), one for light smoker
(codelist_light
), one for moderate smoker
(codelist.moderate
) and one for heavy smoker
(codelist_heavy
). For records identified using the light,
moderate or heavy smoker code lists, the value variable, which
represents number of cigarettes smoker per day, is used to modify the
outputted smoking status variable. This is to maximise the number of
observations that are defined in the same way (< 10 day is light, 10
- 19 a day is moderate, > 19 is heavy). The value variable
for observations recorded as ex-smoker are often denoting the number of
cigarettes per day the individual used to smoke, therefore this data is
not used to alter the smoking status. If an individuals most recent
record is a non-smoker, but an individual has previous records which
indicate a history of smoking, the smoking status is altered from
non-smoker to ex-smoker. The algorithm is as follows:
In this section we report the different units of measurement that the test data for the above variables may be recorded in. The unit of measurement is denoted with the numunitid variable in the observation file, which has a corresponding lookup file in the CPRD data. We queried the observation data for a large cohort of individuals aged 18 - 85 between 2005 - 2020 using the code lists provided within inst/codelists directory of rcprd.
list.files(system.file("codelists", package = "rcprd"))
#> [1] "edh_bmi_medcodeid.csv" "edh_chol_medcodeid.csv"
#> [3] "edh_cholhdl_ratio_medcodeid.csv" "edh_hdl_medcodeid.csv"
#> [5] "edh_sbp_medcodeid.csv" "height_medcodeid.csv"
#> [7] "weight_medcodeid.csv"
The test data was searched separately using each code list, and the resulting unit measurements that were recorded in more than 0.01% of the query are presented. The results of this were fed into the programs for deriving each of those variables (see section 2). When defining a variable, the aim is to convert all measurements to be on the scame scale/unit of measurement. For example, height measurements in metres should be converted to the same scale as those recorded in centimeters. However, for some records, the unit of measurement might be something odd, or missing, meaning it is unclear how to convert onto the desired scale. For observations with such unit measurements, we do not exclude these observations, as they may be correct measurements with a mis-recorded unit measurement. Instead, when extracting variables, and converting relevant unit measurements, we define a minimum and maximum value, and exclude observations that do not fit into this range. As will be seen below, the proportion of observations with unclear unit measurement is small (with the exception of cholesterol/hdl ratio, which is a special case).
numunitid | n | Description | prop |
---|---|---|---|
65 | 49 | 1/1 | 0.15 |
218 | 47 | mmol/L | 0.14 |
219 | 429 | mmol/mmol | 1.32 |
260 | 284 | ratio | 0.88 |
292 | 132 | Unk UoM | 0.41 |
405 | 18 | UNKNOWN UNITS | 0.06 |
421 | 24 | . | 0.07 |
986 | 154 | Not given | 0.47 |
4424 | 728 | CHOL/HDL | 2.24 |
NA | 30580 | NA | 94.24 |
The most common is ‘NA’ (78.95%). The second most common is ‘ratio’ (12.49%) then 1/1 (2.33%). The confusion about unit of measurement is likely due to the fact that this ratio has no unit of measurement, because total cholesterol and high-density lipoprotein have the same unit of measurement. All measurements are therefore assumed to be in the same unit of measurement (ratio).
numunitid | n | Description | prop |
---|---|---|---|
218 | 67168 | mmol/L | 92.74 |
NA | 5256 | NA | 7.26 |
The majority of unit measurements are mmol/L (96.35%) or NA (3.54%). All observations are therefore assumed to be recorded in mmol/L.
numunitid | n | Description | prop |
---|---|---|---|
182 | 80 | MG/DL | 0.16 |
218 | 46573 | mmol/L | 92.04 |
288 | 12 | units | 0.02 |
NA | 3936 | NA | 7.78 |
The majority of unit measurements are mmol/L (94.21%) or NA (5.67%). All observations are therefore assumed to be recorded in mmol/L.
numunitid | n | Description | prop |
---|---|---|---|
108 | 21 | Body Mass Index | 0.02 |
157 | 64943 | kg/m2 | 52.02 |
288 | 21 | units | 0.02 |
359 | 35 | Kg/m? | 0.03 |
568 | 99 | BMI | 0.08 |
657 | 105 | Kg/m² | 0.08 |
NA | 59555 | NA | 47.70 |
The majority of unit measurements are kg/m2 (39.58%), kg/mA2 (1.26%) or NA (58.08%). All observations are therefore assumed to be recorded in kg/m2.
numunitid | n | Description | prop |
---|---|---|---|
156 | 148873 | kg | 99.42 |
827 | 80 | Kgs | 0.05 |
NA | 773 | NA | 0.52 |
The majority of unit measurements are kg (98.67%) or NA (1.24%). Most observations are therefore assumed to be recorded in kg, however we also know from that for numunitid \(\in {1691, 2318, 2997 or 6265}\), this refers to stone. Observations with these units of measurements are therefore converted to kg.
numunitid | n | Description | prop |
---|---|---|---|
210 | 174602 | mm Hg | 53.44 |
212 | 11013 | mm/Hg | 3.37 |
215 | 12296 | mm[Hg] | 3.76 |
216 | 128757 | mmHg | 39.41 |
NA | 66 | NA | 0.02 |
The majority of unit measurements are cm (96.82%), m (1.7%), metres (0.02%) or NA (.37%). All observations with numunit not corresponding to metres, will be assumed to be in centimetres, and converted to metres to enable estimation of BMI.
numunitid | n | Description | prop |
---|---|---|---|
210 | 174602 | mm Hg | 53.44 |
212 | 11013 | mm/Hg | 3.37 |
215 | 12296 | mm[Hg] | 3.76 |
216 | 128757 | mmHg | 39.41 |
NA | 66 | NA | 0.02 |
While there is not a unique algorithm for SBP, we still present the results from the database query for this variable. All measurement are in mm/Hg.