Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_celltype_data struggling with a large matrix: "too many elements (>= 2^31) selected along dimension 1 of array" #91

Open
ainefairbrother opened this issue Nov 29, 2023 · 7 comments
Labels

Comments

@ainefairbrother
Copy link

ainefairbrother commented Nov 29, 2023

1. Bug description

When running generate_celltype_data() on a gene-by-cell matrix of 34807 x 786896, I get a large array error. My input object (exp) is sparse DelayedMatrix object of type "double".

Console output

100 core(s) assigned as workers (156 reserved).
Converting to sparse matrix.
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'drop': too many elements (>= 2^31) selected along dimension 1 of array

Session info

R version 4.3.0 (2023-04-21)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/drihome/MRAineFairbrotherBrowne/miniconda3/envs/R/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] scuttle_1.10.1              DelayedMatrixStats_1.22.1
 [3] HDF5Array_1.28.1            rhdf5_2.44.0
 [5] DelayedArray_0.26.6         S4Arrays_1.0.4
 [7] Matrix_1.6-1                DESeq2_1.40.2
 [9] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
[11] Biobase_2.60.0              GenomicRanges_1.52.0
[13] GenomeInfoDb_1.36.1         IRanges_2.34.1
[15] S4Vectors_0.38.1            MatrixGenerics_1.12.2
[17] matrixStats_1.0.0           lubridate_1.9.2
[19] forcats_1.0.0               stringr_1.5.0
[21] dplyr_1.1.3                 purrr_1.0.1
[23] readr_2.1.4                 tidyr_1.3.0
[25] tibble_3.2.1                ggplot2_3.4.3
[27] tidyverse_2.0.0             ewceData_1.8.0
[29] ExperimentHub_2.8.1         AnnotationHub_3.8.0
[31] BiocFileCache_2.8.0         dbplyr_2.3.3
[33] BiocGenerics_0.46.0         EWCE_1.8.2
[35] RNOmni_1.0.1.2

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.7                magrittr_2.0.3
  [3] fs_1.6.3                      zlibbioc_1.46.0
  [5] vctrs_0.6.3                   memoise_2.0.1
  [7] RCurl_1.98-1.12               ggtree_3.8.2
  [9] rstatix_0.7.2                 htmltools_0.5.5
 [11] curl_5.0.1                    broom_1.0.5
 [13] Rhdf5lib_1.22.0               gridGraphics_0.5-1
 [15] basilisk_1.12.1               htmlwidgets_1.6.2
 [17] HGNChelper_0.8.1              plyr_1.8.8
 [19] plotly_4.10.3                 cachem_1.0.8
 [21] mime_0.12                     lifecycle_1.0.3
 [23] pkgconfig_2.0.3               R6_2.5.1
 [25] fastmap_1.1.1                 GenomeInfoDbData_1.2.10
 [27] shiny_1.7.4.1                 digest_0.6.33
 [29] aplot_0.2.2                   colorspace_2.1-0
 [31] patchwork_1.1.2               AnnotationDbi_1.62.2
 [33] rprojroot_2.0.3               grr_0.9.5
 [35] RSQLite_2.3.1                 ggpubr_0.6.0
 [37] beachmat_2.16.0               filelock_1.0.2
 [39] timechange_0.2.0              fansi_1.0.4
 [41] httr_1.4.6                    abind_1.4-5
 [43] compiler_4.3.0                here_1.0.1
 [45] withr_2.5.0                   bit64_4.0.5
 [47] backports_1.4.1               BiocParallel_1.34.2
 [49] orthogene_1.6.1               carData_3.0-5
 [51] DBI_1.1.3                     homologene_1.4.68.19.3.27
 [53] ggsignif_0.6.4                rappdirs_0.3.3
 [55] tools_4.3.0                   ape_5.7-1
 [57] interactiveDisplayBase_1.38.0 httpuv_1.6.11
 [59] glue_1.6.2                    rhdf5filters_1.12.1
 [61] nlme_3.1-163                  promises_1.2.0.1
 [63] grid_4.3.0                    anndata_0.7.5.6
 [65] reshape2_1.4.4                generics_0.1.3
 [67] gtable_0.3.4                  tzdb_0.4.0
 [69] hms_1.1.3                     data.table_1.14.8
 [71] car_3.1-2                     utf8_1.2.3
 [73] XVector_0.40.0                BiocVersion_3.17.1
 [75] pillar_1.9.0                  yulab.utils_0.1.0
 [77] babelgene_22.9                limma_3.56.2
 [79] later_1.3.1                   treeio_1.24.3
 [81] lattice_0.21-8                bit_4.0.5
 [83] tidyselect_1.2.0              locfit_1.5-9.8
 [85] Biostrings_2.68.1             stringi_1.7.12
 [87] lazyeval_0.2.2                ggfun_0.1.3
 [89] yaml_2.3.7                    codetools_0.2-19
 [91] BiocManager_1.30.21.1         ggplotify_0.1.2
 [93] cli_3.6.1                     xtable_1.8-4
 [95] reticulate_1.30               munsell_0.5.0
 [97] zellkonverter_1.10.1          Rcpp_1.0.11
 [99] dir.expiry_1.8.0              gprofiler2_0.2.2
[101] png_0.1-8                     ellipsis_0.3.2
[103] assertthat_0.2.1              blob_1.2.4
[105] basilisk.utils_1.12.1         sparseMatrixStats_1.12.2
[107] bitops_1.0-7                  viridisLite_0.4.2
[109] tidytree_0.4.5                scales_1.2.1
[111] crayon_1.5.2                  rlang_1.1.1
[113] KEGGREST_1.40.0

@Al-Murphy
Copy link
Collaborator

Al-Murphy commented Nov 30, 2023

Hey! Yep definitely a size issue, are you running the function with the as_DelayedArray = TRUE? By default it is FALSE, this might help. It's hard to debug exactly where you are running into the size issue without having access to the data.

Alan.

@bschilder
Copy link
Collaborator

bschilder commented Nov 30, 2023

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

@ainefairbrother
Copy link
Author

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

This is a server with ~133TB storage, 1TB RAM

Architecture and CPU info:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  256
  On-line CPU(s) list:   0-255

Linux kernal version: 5.15.0-73-generic

@ainefairbrother
Copy link
Author

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")

@Al-Murphy
Copy link
Collaborator

You will need to create and share a subset of your data which replicates the issue if you want us to debug this, the function works with the as_DelayedArray = TRUE with other datasets:

> # Load the single cell data
> cortex_mrna <- ewceData::cortex_mrna()
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
> # Use only a subset to keep the example quick
> expData <- cortex_mrna$exp#[1:100, ]
> l1 <- cortex_mrna$annot$level1class
> l2 <- cortex_mrna$annot$level2class
> annotLevels <- list(l1 = l1, l2 = l2)
> fNames_ALLCELLS <- EWCE::generate_celltype_data(
+   exp = expData,
+   annotLevels = annotLevels,
+   groupName = "allKImouse",
+   as_DelayedArray = TRUE
+ )
1 core(s) assigned as workers (10 reserved).
Converting to sparse matrix.
Converting to DelayedArray.
+ Calculating normalized mean expression.
Converting to sparse matrix.
Converting to sparse matrix.
+ Calculating normalized specificity.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
+ Saving results ==>  /tmp/Rtmpc3QW1s/ctd_allKImouse.rda

Subsetting on the number of genes and cells should be relatively quick to do.

@bschilder
Copy link
Collaborator

bschilder commented Dec 1, 2023

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")

It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.

Also, @ainefairbrother , can you confirm that your exp DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.

This can be done using this set of functions:
https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html

See here for more info:
https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html

@ainefairbrother
Copy link
Author

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")

It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.

Also, @ainefairbrother , can you confirm that your exp DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.

This can be done using this set of functions: https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html

See here for more info: https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html

Hi Brian - yes, can confirm that it's backed at the time of running the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants