--- title: "Co-occurrence Matrices and PMI-SVD Embeddings" author: "Thomas Charlon" date: "2025-01-17" bibliography: cooc_pmi_svd.bib output: html_vignette: number_sections: true pdf_document: number_sections: true vignette: > %\VignetteIndexEntry{Co-occurrence Matrices and PMI-SVD Embeddings} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Summary The nlpembeds package provides efficient methods to compute co-occurrence matrices, pointwise mutual information (PMI) and singular value decomposition (SVD) embeddings. In the biomedical and clinical setting, one challenge is the huge size of databases of electronic health records (EHRs), e.g. when analyzing data of millions of patients over tens of years. To address this, this package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. PMI-SVD embeddings are extensively used, *e.g.* in [Hong C. (2021)](https://pubmed.ncbi.nlm.nih.gov/34707226/). # Background Co-occurrence, PMI and SVD are methods to produce vector representations of concepts, *i.e.* embeddings, similarly to the popular word2vec method [@mikolov2013efficient]. However, while Word2vec is highly parallelized, it may lack stability, *i.e.* the same computation will produce significantly different results. To improve the stability, one can increase the number of iterations, but it may be preferable to compute co-occurrence, PMI (or more precisely, single-shift positive PMI), and SVD [@levy2014neural]. The nlpembeds package aims to provide efficient methods to compute co-occurrence matrices and PMI-SVD. In the biomedical and clinical settings, one challenge is the huge size of databases, *e.g.* when analyzing data of millions of patients over tens of years. To address this, the nlpembeds package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices from tens of gigabytes of data. \newpage # Co-occurrence matrix Consider the following data object: ```{r cooc_data} library(nlpembeds) df_ehr = data.frame(Patient = c(1, 1, 2, 1, 2, 1, 1, 3, 4), Month = c(1, 1, 1, 2, 2, 3, 3, 4, 4), Parent_Code = c('C1', 'C2', 'C2', 'C1', 'C1', 'C1', 'C2', 'C3', 'C4'), Count = 1:9) df_ehr ``` This represents our patient-level data: `Patient` is the patient identifier, `Parent_Code` is the diagnosis that was performed (or medication prescribed, etc.), `Month` is the year/month of the diagnosis, and `Count` is the number of times the diagnosis was performed during that month. We can compute a monthly co-occurrence which gives us: ```{r cooc_call} spm_cooc = build_df_cooc(df_ehr) spm_cooc ``` What the data represents is: * At month 1, patient 1 had the code C1 once and C2 twice, patient 2 had the code C2 thrice. * At month 2, patients 1 and 2 had the code C1 4 and 5 times, respectively. * At month 3, patient 1 had the codes C1 and C2 6 and 7 times, respectively. * At month 4, patients 3 and 4 had the codes C3 and C4 8 and 9 times, respectively. The monthly co-occurrence between two codes is defined as the minimum of their two counts within a month. The co-occurrences are first computed for each patient for each month, and sum aggregated. \newpage Here, let's consider the co-occurrence between codes C1 and C2. We decompose the computation to make it easily understandable. The codes co-occurred in months 1 and 3 in patient 1. We can decompose the co-occurrence matrices as follows: * The codes co-occurred min(1, 2) = 1, in Month 1 ```{r cooc_month1} cooc_1 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 1), min_code_freq = 0) cooc_1 ``` * The codes co-occurred min(6, 7) = 6, in Month 3 ```{r cooc_month2} cooc_2 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 3)) cooc_2 ``` When summed up, we get a co-occurrence of 7 between C1 and C2 ```{r cooc_sum} cooc_1 + cooc_2 ``` Note that although codes C3 and C4 were both present during month 4, they were not for the same patient, so their co-occurrence is 0. \newpage # PMI We can then call PMI on that object ```{r pmi} m_pmi = get_pmi(spm_cooc) m_pmi ``` Here, * PMI(C1, C2) is `log(7 * 59 / ((16 + 7) * (12 + 7)))` * PMI(C1, C1) is `log(16 * 59 / (16 + 7) ^ 2)` I acknowledge here Dr. Doudou Zhou and Dr. Ziming Gan for their contributions to the implementation of the PMI and SVD computations. # SVD We then call SVD on the PMI matrix, and then feed the `m_svd` object (*i.e.* the embedding matrix) to downstream analyses, such as knowledge graphs. ```{r svd} m_svd = get_svd(m_pmi, embedding_dim = 2) m_svd ``` The SVD is computed with Randomized SVD [@erichson2016randomized], an efficient approximation of truncated SVD in which only the first principal components are returned, with the rsvd package. The author suggests the number of dimensions requested `k` should follow: `k < n / 4`, where `n` is the number of features, for it to be efficient, and that otherwise one should rather use either SVD or truncated SVD. Furthermore, we select only the positive principal vectors (further details in the documentation of the function). For EHR, I recommend a SVD rank between 1000 and 2000 when analyzing codified and natural language processing (NLP) data (15k - 25k features), and less if only analyzing codified data (300 - 800 dimensions for 5k - 10k codified features). Known pairs can be used to compute AUC and select the best performing rank, *e.g.* with the `fit_embeds_kg` function from the [kgraph package](https://CRAN.R-project.org/package=kgraph) or with the [kgraph Shiny app](https://celehs.connect.hms.harvard.edu/kgraph/). The kgraph package also includes a dataset of known pairs of NLP concepts derived from PrimeKG [@chandak2022building] (further introduced in the vignette). When selecting the number of dimensions, one should consider balancing two objectives: higher AUC and lower number of dimensions, as a large number of dimensions can lead to overfitting and a lower number of dimensions will act as a regularization. The aim is thus to identify the range of numbers of dimensions after which increasing the number of dimensions does not significantly increase the AUC anymore. For generalization and reproducibility within other cohorts, it may be better to have a slightly smaller AUC and a lower number of dimensions. As an example, an edge case could be if for example the AUC is 0.732 at 1000 dimensions, and 0.75 at 2000 dimensions, and in my opinion in that case either way would be fine and perhaps a more thorough investigation could be performed. \newpage # Out-of-memory SQL databases To efficiently perform co-occurrence matrices of large databases of tens of gigabytes, two important optimizations are available: * Batching by patients * Code subsets based on dictionaries ## Batching by patients To perform batching by patients, we want to have our input file as a SQL file, which we will read by batches of patients and aggregate the results. This is performed with the `sql_cooc` function. To demonstrate how it is used, we first write the example object above as a SQL database. It is important to name the columns and the SQL table as here (*i.e.* four columns `Patient`, `Month`, `Parent_Code`, `Count` and table name `df_monthly`). Also, we index the table by patients, and write as a second table `df_uniq_codes` the unique codes (this is optional, it is done automatically in `sql_cooc` if the table `df_uniq_codes` is not found in the input SQL file and the `autoindex` parameter is set to TRUE). It is a good idea to also order your table by patients before writing it, as it will make read accesses more efficient, and some of my experiments showed it can make the computation up to 4 times faster. ```{r sql_data} library(RSQLite) test_db_path = tempfile() test_db = dbConnect(SQLite(), test_db_path) dbWriteTable(test_db, 'df_monthly', df_ehr, overwrite = TRUE) ### # optional, done automatically by sql_cooc if table 'df_uniq_codes' not found # and parameter autoindex set to TRUE dbExecute(test_db, "CREATE INDEX patient_idx ON df_monthly (Patient)") df_uniq_codes = unique(df_ehr['Parent_Code']) dbWriteTable(test_db, 'df_uniq_codes', df_uniq_codes, overwrite = TRUE) ### dbDisconnect(test_db) ``` We can then call the `sql_cooc` function on that filename, and specify a new filename for the output. For each batch, information about the current memory usage is printed, which can be helpful for debugging memory usage and getting an idea of the computation progress from the log file. ```{r sql_cooc} output_db_path = tempfile() sql_cooc(input_path = test_db_path, output_path = output_db_path) ``` \newpage Once the computation is complete, we read the co-occurrence sparse matrix from the output SQL file. ```{r read_sql} test_db = dbConnect(SQLite(), output_db_path) spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;') dbDisconnect(test_db) spm_cooc ``` As previously, we can then feed it to `get_pmi`. ```{r sql_pmi} m_pmi = get_pmi(spm_cooc) m_pmi ``` Or transform it as a classic matrix. ```{r read_sql_cooc} spm_cooc = build_spm_cooc_sym(spm_cooc) m_cooc = as.matrix(spm_cooc) m_cooc ``` That is the minimum call. Two important parameters come into play here: * `n_batch` which is the number of patients to include by batch * `n_cores` which is the number of cores on which the computation will be parallelized For the two parameters, the higher the value, the faster the computation and the more RAM memory required. The user can fine-tune these parameters to fit machine specifications and to optimize. As an example, `n_batch = 300` and `n_cores = 6` should be able to run on 16GB machines if the number of codes is not too large (*e.g.* if only considering rolled-up codified data). \newpage ## Code subsets based on dictionaries When the number of unique codes is large, *e.g.* when analyzing NLP concepts, one may want to perform a first subset of the codes by providing a dictionary of codes to include. Three parameters are available here: * `exclude_code_pattern`: Pattern of codes prefixes to exclude (will be used in SQL appended by '%'). For example, 'AB'. * `exclude_dict_pattern`: Used in combination with parameter `codes_dict`. Pattern of codes prefixes to exclude, except if they are found in `codes_dict` (will be used in grep prefixed by '^'). For example, 'C[0-9]'. * `codes_dict_fpaths`: Used in combination with `exclude_dict_pattern`. Filepaths to define codes to avoid excluding using `exclude_dict_pattern`. First column of each file must define the code identifiers. To demonstrate the behavior, we will rename the two first codes in the object above to two NLP concepts present in the dictionaries provided with the package. ```{r dicts_data} df_ehr$Parent_Code %<>% ifelse(. == 'C1', 'C0000545', .) df_ehr$Parent_Code %<>% ifelse(. == 'C2', 'C0000578', .) df_ehr ``` We write it as a SQL file. ```{r dicts_data_write} test_db_path = tempfile() test_db = dbConnect(SQLite(), test_db_path) dbWriteTable(test_db, 'df_monthly', df_ehr) dbDisconnect(test_db) ``` The code is then called like this: ```{r dicts_cooc} codes_dict_fpaths = list.files(system.file('dictionaries', package = 'nlpembeds'), full.names = TRUE) sql_cooc(input_path = test_db_path, output_path = output_db_path, exclude_dict_pattern = 'C[0-9]', codes_dict_fpaths = codes_dict_fpaths, autoindex = TRUE, overwrite_output = TRUE) ``` \newpage Finally, we read the co-occurrence sparse matrix from that SQL file. ```{r dicts_cooc_read} test_db = dbConnect(SQLite(), output_db_path) spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;') dbDisconnect(test_db) spm_cooc ``` As previously, we can then feed it to `get_pmi`. ```{r oom_sql_pmi} # m_pmi = get_pmi(spm_cooc) # m_pmi ``` Or transform it as a classic matrix. ```{r oom_read_sql_cooc} # spm_cooc = build_spm_cooc_sym(spm_cooc) # m_cooc = as.matrix(spm_cooc) # m_cooc ``` \newpage # Running on HPC servers ## Installation If using a CentOS server, nlpembeds may not install correctly and you may need to use v2.4.0 of RcppAlgos. One can install it this way, before installing nlpembeds: ```{r} # remotes::install_git('https://github.com/jwood000/RcppAlgos@v2.4.0.git') ``` ## Parameters tuning The `n_batch` patients are parallelized on the `n_cores` available, so `n_batch` needs to be larger than `n_cores`, and it is best to have at least 2-3 patients per core. The aggregated sparse matrix will logarithmically increase in size, so if you are close to the RAM limits your job may fail only after several hours. The number of codes have a quadratic impact on memory, so depending on the number of codes considered you can optimize `n_batch` and `n_cores`. Assuming 250 GB available and ~30k codes, I would recommend `n_batch = 75` and `n_cores = 25`, and estimate the co-occurrence matrix for ~60k patients should take 3 hours. You should be able to more or less multiply linearly the parameters by the amount of RAM, meaning if you have 500 GB available you could use `n_batch = 150` and `n_cores = 50` and it should run in ~1h30. If only considering 10-15k codes, you can most likely have `n_batch = 600` and `n_cores = 20`. If considering more than 30k codes, you will need to reduce `n_batch` and/or `n_cores`. \newpage # References