Create mass_dataset class

Here, you can learn how to create mass_dataset class object using tidymass.

Data preparation

The massdataset class object can be used to store the untargeted metabolomics data.

Let’s first prepare the data objects according to the attached figure for each file.

1. `sample_info` (required)

The columns sample_id (sample ID), injection.order (injection order of samples), class (Blank, QC, Subject, etc), group (case, control, etc) are required.

2. `variable_info` (required)

The columns variable_id (variable ID), mz (mass to charge ratio), rt (retention time, unit is second) are required.

3. `expression_data` (required)

Columns are samples are rows are features (variables).

The column names of expression_data should be completely same with sample_id in sample_info, and the row names of expression_data should be completely same with variable_id in variable_info.

4. `sample_info_note` (optional)

This is the metadata for sample_info.

5. `variable_info_note` (optional)

This is the metadata for variable_info.

Download demo data

Here we use the demo data from masssprocesser package. The demo data can be downloaded here.

Download this data and uncompress it. And then set the path where you put the folder as working directory.

Then prepare data.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
peak_table_pos = readr::read_csv("feature_table/Peak_table_pos.csv")
#> Rows: 1612 Columns: 39
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): variable_id
#> dbl (38): mz, rt, bl20210902_10, bl20210902_11, bl20210902_13, bl20210902_14...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
peak_table_neg = readr::read_csv("feature_table/Peak_table_neg.csv")
#> Rows: 5486 Columns: 39
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): variable_id
#> dbl (38): mz, rt, X20210902_neg04, X20210902_neg05, X20210902_neg06, X202109...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_info_pos = readr::read_csv("feature_table/sample_info_pos.csv")
#> Rows: 36 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): sample_id, class, group
#> dbl (1): injection.order
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_info_neg = readr::read_csv("feature_table/sample_info_neg.csv")
#> Rows: 36 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): sample_id, class, group
#> dbl (1): injection.order
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Variable information and expression data are in the peak table. Let’s separate them.

expression_data_pos = 
  peak_table_pos %>% 
  dplyr::select(-c(variable_id:rt)) %>% 
  as.data.frame()

variable_info_pos = 
  peak_table_pos %>% 
  dplyr::select(variable_id:rt) %>% 
  as.data.frame()

rownames(expression_data_pos) = variable_info_pos$variable_id

expression_data_neg = 
  peak_table_neg %>% 
  dplyr::select(-c(variable_id:rt)) %>% 
  as.data.frame()

variable_info_neg = 
  peak_table_neg %>% 
  dplyr::select(variable_id:rt) %>% 
  as.data.frame()

rownames(expression_data_neg) = variable_info_neg$variable_id

colnames(expression_data_pos) == sample_info_pos$sample_id
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
colnames(expression_data_neg) == sample_info_neg$sample_id
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

The orders of sample_id in sample_info and column names of expression_data are different.

expression_data_pos = 
  expression_data_pos[,sample_info_pos$sample_id]

expression_data_neg = 
  expression_data_neg[,sample_info_neg$sample_id]

colnames(expression_data_pos) == sample_info_pos$sample_id
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE
colnames(expression_data_neg) == sample_info_neg$sample_id
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE

Create `mass_data` class object

Then we can create mass_data class object using create_mass_dataset() function.

library(massdataset)
#> massdataset 1.0.34 (2024-09-10 06:51:30.673357)
#> 
#> Attaching package: 'massdataset'
#> The following object is masked from 'package:stats':
#> 
#>     filter

object_pos =
  create_mass_dataset(
    expression_data = expression_data_pos,
    sample_info = sample_info_pos,
    variable_info = variable_info_pos
  )
  
object_pos
#> -------------------- 
#> massdataset version: 1.0.34 
#> -------------------- 
#> 1.expression_data:[ 1612 x 36 data.frame]
#> 2.sample_info:[ 36 x 4 data.frame]
#> 36 samples:bl20210902_3 bl20210902_4 bl20210902_5 ... bl20210902_37 bl20210902_38
#> 3.variable_info:[ 1612 x 3 data.frame]
#> 1612 variables:M86T44_POS M90T638_POS M91T631_POS ... M1197T265_POS M1198T265_POS
#> 4.sample_info_note:[ 4 x 2 data.frame]
#> 5.variable_info_note:[ 3 x 2 data.frame]
#> 6.ms2_data:[ 0 variables x 0 MS2 spectra]
#> -------------------- 
#> Processing information
#> 1 processings in total
#> create_mass_dataset ---------- 
#>       Package         Function.used                Time
#> 1 massdataset create_mass_dataset() 2024-09-25 17:21:43

Then negative mode.

object_neg =
  create_mass_dataset(
    expression_data = expression_data_neg,
    sample_info = sample_info_neg,
    variable_info = variable_info_neg
  )
  
object_neg
#> -------------------- 
#> massdataset version: 1.0.34 
#> -------------------- 
#> 1.expression_data:[ 5486 x 36 data.frame]
#> 2.sample_info:[ 36 x 4 data.frame]
#> 36 samples:X20210902_neg03 X20210902_neg04 X20210902_neg05 ... X20210902_neg37 X20210902_neg38
#> 3.variable_info:[ 5486 x 3 data.frame]
#> 5486 variables:M74T229_NEG M88T115_NEG M100T631_NEG ... M1199T22_NEG M1199T180_NEG
#> 4.sample_info_note:[ 4 x 2 data.frame]
#> 5.variable_info_note:[ 3 x 2 data.frame]
#> 6.ms2_data:[ 0 variables x 0 MS2 spectra]
#> -------------------- 
#> Processing information
#> 1 processings in total
#> create_mass_dataset ---------- 
#>       Package         Function.used                Time
#> 1 massdataset create_mass_dataset() 2024-09-25 17:21:43

Then save them for next analysis.

save(object_pos, file = "feature_table/object_pos")
save(object_neg, file = "feature_table/object_neg")

Export `mass_dataset` class object to csv for xlsx

export_mass_dataset(object = object_pos,
                    file_type = "xlsx",
                    path = "feature_table/demo_data_pos")

Then all the data will be in the feature_table/demo_data_pos folder.

Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.0
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Asia/Singapore
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] magrittr_2.0.3     masstools_1.0.13   massdataset_1.0.34 lubridate_1.9.3   
#>  [5] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
#>  [9] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
#> [13] tidyverse_2.0.0   
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-8                pbapply_1.7-2              
#>   [3] remotes_2.5.0               rlang_1.1.4                
#>   [5] clue_0.3-65                 GetoptLong_1.0.5           
#>   [7] matrixStats_1.3.0           compiler_4.4.1             
#>   [9] reshape2_1.4.4              png_0.1-8                  
#>  [11] vctrs_0.6.5                 ProtGenerics_1.36.0        
#>  [13] pkgconfig_2.0.3             shape_1.4.6.1              
#>  [15] crayon_1.5.3                fastmap_1.2.0              
#>  [17] XVector_0.44.0              utf8_1.2.4                 
#>  [19] rmarkdown_2.28              tzdb_0.4.0                 
#>  [21] preprocessCore_1.66.0       UCSC.utils_1.0.0           
#>  [23] bit_4.0.5                   xfun_0.47                  
#>  [25] MultiAssayExperiment_1.30.3 zlibbioc_1.50.0            
#>  [27] cachem_1.1.0                GenomeInfoDb_1.40.1        
#>  [29] jsonlite_1.8.8              DelayedArray_0.30.1        
#>  [31] BiocParallel_1.38.0         parallel_4.4.1             
#>  [33] cluster_2.1.6               R6_2.5.1                   
#>  [35] bslib_0.8.0                 stringi_1.8.4              
#>  [37] RColorBrewer_1.1-3          limma_3.60.4               
#>  [39] GenomicRanges_1.56.1        jquerylib_0.1.4            
#>  [41] Rcpp_1.0.13                 bookdown_0.40              
#>  [43] SummarizedExperiment_1.34.0 iterators_1.0.14           
#>  [45] knitr_1.48                  IRanges_2.38.1             
#>  [47] igraph_2.0.3                Matrix_1.7-0               
#>  [49] timechange_0.3.0            tidyselect_1.2.1           
#>  [51] abind_1.4-5                 rstudioapi_0.16.0          
#>  [53] yaml_2.3.10                 affy_1.82.0                
#>  [55] doParallel_1.0.17           codetools_0.2-20           
#>  [57] blogdown_1.19               plyr_1.8.9                 
#>  [59] lattice_0.22-6              Biobase_2.64.0             
#>  [61] withr_3.0.1                 evaluate_0.24.0            
#>  [63] zip_2.3.1                   circlize_0.4.16            
#>  [65] BiocManager_1.30.25         affyio_1.74.0              
#>  [67] pillar_1.9.0                MatrixGenerics_1.16.0      
#>  [69] foreach_1.5.2               stats4_4.4.1               
#>  [71] MALDIquant_1.22.3           MSnbase_2.30.1             
#>  [73] ncdf4_1.23                  generics_0.1.3             
#>  [75] RCurl_1.98-1.16             vroom_1.6.5                
#>  [77] rprojroot_2.0.4             S4Vectors_0.42.1           
#>  [79] hms_1.1.3                   munsell_0.5.1              
#>  [81] scales_1.3.0                glue_1.7.0                 
#>  [83] lazyeval_0.2.2              tools_4.4.1                
#>  [85] mzID_1.42.0                 vsn_3.72.0                 
#>  [87] QFeatures_1.14.2            mzR_2.38.0                 
#>  [89] openxlsx_4.2.7              XML_3.99-0.17              
#>  [91] grid_4.4.1                  impute_1.78.0              
#>  [93] MsCoreUtils_1.16.1          colorspace_2.1-1           
#>  [95] GenomeInfoDbData_1.2.12     PSMatch_1.8.0              
#>  [97] cli_3.6.3                   fansi_1.0.6                
#>  [99] S4Arrays_1.4.1              ComplexHeatmap_2.20.0      
#> [101] AnnotationFilter_1.28.0     pcaMethods_1.96.0          
#> [103] gtable_0.3.5                sass_0.4.9                 
#> [105] digest_0.6.37               BiocGenerics_0.50.0        
#> [107] SparseArray_1.4.8           rjson_0.2.22               
#> [109] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [111] httr_1.4.7                  here_1.0.1                 
#> [113] statmod_1.5.0               GlobalOptions_0.1.2        
#> [115] bit64_4.0.5                 MASS_7.3-61

Create mass_dataset class

Data preparation

1. sample_info (required)

2. variable_info (required)

3. expression_data (required)

4. sample_info_note (optional)

5. variable_info_note (optional)