5 Create mass_dataset class

Here, you can learn how to create mass_dataset class object using tidymass.

5.1 Data preparation

The massdataset class object can be used to store the untargeted metabolomics data.

Let’s first prepare the data objects according to the attached figure for each file.

5.1.1 `sample_info` (required)

The columns sample_id (sample ID), injection.order (injection order of samples), class (Blank, QC, Subject, etc), group (case, control, etc) are required.

5.1.2 `variable_info` (required)

The columns variable_id (variable ID), mz (mass to charge ratio), rt (retention time, unit is second) are required.

5.1.3 `expression_data` (required)

Columns are samples are rows are features (variables).

The column names of expression_data should be completely same with sample_id in sample_info, and the row names of expression_data should be completely same with variable_id in variable_info.

5.1.4 `sample_info_note` (optional)

This is the metadata for sample_info.

5.1.5 `variable_info_note` (optional)

This is the metadata for variable_info.

5.2 Download demo data

Here we use the demo data from masssprocesser package. The demo data can be downloaded here.

Download this data and uncompress it. And then set the path where you put the folder as working directory.

Then prepare data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

peak_table_pos = readr::read_csv("demo_data/feature_table/Peak_table_pos.csv")

Rows: 1612 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): variable_id
dbl (38): mz, rt, bl20210902_10, bl20210902_11, bl20210902_13, bl20210902_14...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

peak_table_neg = readr::read_csv("demo_data/feature_table/Peak_table_neg.csv")

Rows: 5486 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): variable_id
dbl (38): mz, rt, X20210902_neg04, X20210902_neg05, X20210902_neg06, X202109...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_info_pos = readr::read_csv("demo_data/feature_table/sample_info_pos.csv")

Rows: 36 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): sample_id, class, group
dbl (1): injection.order

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_info_neg = readr::read_csv("demo_data/feature_table/sample_info_neg.csv")

Rows: 36 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): sample_id, class, group
dbl (1): injection.order

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Variable information and expression data are in the peak table. Let’s separate them.

expression_data_pos = 
  peak_table_pos %>% 
  dplyr::select(-c(variable_id:rt)) %>% 
  as.data.frame()

variable_info_pos = 
  peak_table_pos %>% 
  dplyr::select(variable_id:rt) %>% 
  as.data.frame()

rownames(expression_data_pos) = variable_info_pos$variable_id

expression_data_neg = 
  peak_table_neg %>% 
  dplyr::select(-c(variable_id:rt)) %>% 
  as.data.frame()

variable_info_neg = 
  peak_table_neg %>% 
  dplyr::select(variable_id:rt) %>% 
  as.data.frame()

rownames(expression_data_neg) = variable_info_neg$variable_id

colnames(expression_data_pos) == sample_info_pos$sample_id

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

colnames(expression_data_neg) == sample_info_neg$sample_id

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

The orders of sample_id in sample_info and column names of expression_data are different.

expression_data_pos = 
  expression_data_pos[,sample_info_pos$sample_id]

expression_data_neg = 
  expression_data_neg[,sample_info_neg$sample_id]

colnames(expression_data_pos) == sample_info_pos$sample_id

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE

colnames(expression_data_neg) == sample_info_neg$sample_id

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE

5.3 Create `mass_data` class object

Then we can create mass_data class object using create_mass_dataset() function.

library(massdataset)

massdataset 1.0.34 (2025-05-08 11:44:07.24862)


Attaching package: 'massdataset'

The following object is masked from 'package:stats':

    filter

object_pos =
  create_mass_dataset(
    expression_data = expression_data_pos,
    sample_info = sample_info_pos,
    variable_info = variable_info_pos
  )
  
object_pos

-------------------- 
massdataset version: 1.0.34 
-------------------- 
1.expression_data:[ 1612 x 36 data.frame]
2.sample_info:[ 36 x 4 data.frame]
36 samples:bl20210902_3 bl20210902_4 bl20210902_5 ... bl20210902_37 bl20210902_38
3.variable_info:[ 1612 x 3 data.frame]
1612 variables:M86T44_POS M90T638_POS M91T631_POS ... M1197T265_POS M1198T265_POS
4.sample_info_note:[ 4 x 2 data.frame]
5.variable_info_note:[ 3 x 2 data.frame]
6.ms2_data:[ 0 variables x 0 MS2 spectra]
-------------------- 
Processing information
1 processings in total
create_mass_dataset ---------- 
      Package         Function.used                Time
1 massdataset create_mass_dataset() 2025-07-20 15:43:32

Then negative mode.

object_neg =
  create_mass_dataset(
    expression_data = expression_data_neg,
    sample_info = sample_info_neg,
    variable_info = variable_info_neg
  )
  
object_neg

-------------------- 
massdataset version: 1.0.34 
-------------------- 
1.expression_data:[ 5486 x 36 data.frame]
2.sample_info:[ 36 x 4 data.frame]
36 samples:X20210902_neg03 X20210902_neg04 X20210902_neg05 ... X20210902_neg37 X20210902_neg38
3.variable_info:[ 5486 x 3 data.frame]
5486 variables:M74T229_NEG M88T115_NEG M100T631_NEG ... M1199T22_NEG M1199T180_NEG
4.sample_info_note:[ 4 x 2 data.frame]
5.variable_info_note:[ 3 x 2 data.frame]
6.ms2_data:[ 0 variables x 0 MS2 spectra]
-------------------- 
Processing information
1 processings in total
create_mass_dataset ---------- 
      Package         Function.used                Time
1 massdataset create_mass_dataset() 2025-07-20 15:43:32

Then save them for next analysis.

save(object_pos, file = "demo_data/feature_table/object_pos")
save(object_neg, file = "demo_data/feature_table/object_neg")

5.4 Export `mass_dataset` class object to csv or xlsx

export_mass_dataset(object = object_pos,
                    file_type = "xlsx",
                    path = "demo_data/feature_table/demo_data_pos")

Then all the data will be in the feature_table/demo_data_pos folder.

## Session information

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.3     masstools_1.0.15   massdataset_1.0.34 lubridate_1.9.3   
 [5] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
 [9] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
[13] tidyverse_2.0.0   

loaded via a namespace (and not attached):
  [1] pbapply_1.7-2               remotes_2.5.0              
  [3] rlang_1.1.4                 clue_0.3-65                
  [5] GetoptLong_1.0.5            matrixStats_1.4.1          
  [7] compiler_4.4.1              png_0.1-8                  
  [9] vctrs_0.6.5                 reshape2_1.4.4             
 [11] rvest_1.0.4                 ProtGenerics_1.36.0        
 [13] pkgconfig_2.0.3             shape_1.4.6.1              
 [15] crayon_1.5.3                fastmap_1.2.0              
 [17] XVector_0.44.0              rmarkdown_2.29             
 [19] tzdb_0.4.0                  preprocessCore_1.66.0      
 [21] UCSC.utils_1.0.0            bit_4.5.0                  
 [23] xfun_0.52                   MultiAssayExperiment_1.30.3
 [25] zlibbioc_1.50.0             GenomeInfoDb_1.40.1        
 [27] jsonlite_1.8.9              DelayedArray_0.30.1        
 [29] BiocParallel_1.38.0         parallel_4.4.1             
 [31] cluster_2.1.6               R6_2.5.1                   
 [33] stringi_1.8.4               RColorBrewer_1.1-3         
 [35] limma_3.60.6                GenomicRanges_1.56.2       
 [37] Rcpp_1.0.13-1               SummarizedExperiment_1.34.0
 [39] iterators_1.0.14            knitr_1.49                 
 [41] IRanges_2.38.1              Matrix_1.7-1               
 [43] igraph_2.1.1                timechange_0.3.0           
 [45] tidyselect_1.2.1            rstudioapi_0.17.1          
 [47] abind_1.4-8                 affy_1.82.0                
 [49] doParallel_1.0.17           codetools_0.2-20           
 [51] lattice_0.22-6              plyr_1.8.9                 
 [53] Biobase_2.64.0              withr_3.0.2                
 [55] evaluate_1.0.1              zip_2.3.1                  
 [57] xml2_1.3.6                  circlize_0.4.16            
 [59] BiocManager_1.30.25         affyio_1.74.0              
 [61] pillar_1.11.0               MatrixGenerics_1.16.0      
 [63] foreach_1.5.2               stats4_4.4.1               
 [65] MSnbase_2.30.1              MALDIquant_1.22.3          
 [67] ncdf4_1.23                  generics_0.1.3             
 [69] vroom_1.6.5                 S4Vectors_0.42.1           
 [71] hms_1.1.3                   munsell_0.5.1              
 [73] scales_1.3.0                glue_1.8.0                 
 [75] lazyeval_0.2.2              tools_4.4.1                
 [77] mzID_1.42.0                 QFeatures_1.14.2           
 [79] vsn_3.72.0                  mzR_2.38.0                 
 [81] openxlsx_4.2.7.1            XML_3.99-0.17              
 [83] grid_4.4.1                  impute_1.78.0              
 [85] MsCoreUtils_1.16.1          colorspace_2.1-1           
 [87] GenomeInfoDbData_1.2.12     PSMatch_1.8.0              
 [89] cli_3.6.3                   S4Arrays_1.4.1             
 [91] ComplexHeatmap_2.20.0       AnnotationFilter_1.28.0    
 [93] pcaMethods_1.96.0           gtable_0.3.6               
 [95] digest_0.6.37               BiocGenerics_0.50.0        
 [97] SparseArray_1.4.8           rjson_0.2.23               
 [99] htmlwidgets_1.6.4           htmltools_0.5.8.1          
[101] lifecycle_1.0.4             httr_1.4.7                 
[103] statmod_1.5.0               GlobalOptions_0.1.2        
[105] bit64_4.5.2                 MASS_7.3-61

5.1 Data preparation

5.1.1 sample_info (required)

5.1.2 variable_info (required)

5.1.3 expression_data (required)

5.1.4 sample_info_note (optional)

5.1.5 variable_info_note (optional)