8   Raw data processing

 

TidyMass packages:

Data preparation

Download the demo data and refer this article.

We have positive and negative mode. For each mode, we have control, case and QC groups. Control group have 110 samples, and case group have 110 samples as well.

Positive mode 🔗︎

massprocesser package is used to do the raw data processing. Please refer this website.

Code 🔗︎

The code used to do raw data processing.

library(tidymass)
#> Registered S3 method overwritten by 'Hmisc':
#>   method       from      
#>   vcov.default fit.models
#> ── Attaching packages ───────────────────────────── tidymass 0.99.6 ──
#> ✓ massdataset   0.99.20     ✓ metpath       0.99.4 
#> ✓ massprocesser 0.99.3      ✓ metid         1.2.4  
#> ✓ masscleaner   0.99.7      ✓ masstools     0.99.5 
#> ✓ massqc        0.99.7      ✓ dplyr         1.0.8  
#> ✓ massstat      0.99.13     ✓ ggplot2       3.3.5
#> ── Conflicts ───────────────────────────────── tidymass_conflicts() ──
#> x massdataset::apply()     masks base::apply()
#> x dplyr::collect()         masks xcms::collect()
#> x BiocGenerics::colMeans() masks massdataset::colMeans(), base::colMeans()
#> x BiocGenerics::colSums()  masks massdataset::colSums(), base::colSums()
#> x dplyr::combine()         masks MSnbase::combine(), Biobase::combine(), BiocGenerics::combine()
#> x dplyr::filter()          masks metpath::filter(), massdataset::filter(), stats::filter()
#> x dplyr::first()           masks S4Vectors::first()
#> x dplyr::groups()          masks xcms::groups()
#> x S4Vectors::intersect()   masks BiocGenerics::intersect(), massdataset::intersect(), base::intersect()
#> x dplyr::lag()             masks stats::lag()
#> x masstools::mz_rt_match() masks massdataset::mz_rt_match()
#> x dplyr::rename()          masks S4Vectors::rename(), massdataset::rename()
#> x BiocGenerics::rowMeans() masks massdataset::rowMeans(), base::rowMeans()
#> x BiocGenerics::rowSums()  masks massdataset::rowSums(), base::rowSums()
process_data(
  path = "mzxml_ms1_data/POS",
  polarity = "positive",
  ppm = 10,
  peakwidth = c(10, 60),
  threads = 4,
  output_tic = FALSE,
  output_bpc = FALSE,
  output_rt_correction_plot = FALSE,
  min_fraction = 0.5,
  group_for_figure = "QC"
)

Results 🔗︎

All the results will be placed in the folder mzxml_ms1_data/POS/Result. More information can be found here.

You can just load the object, which is a mass_dataset class object.

load("mzxml_ms1_data/POS/Result/object")
object
#> -------------------- 
#> massdataset version: 0.99.8 
#> -------------------- 
#> 1.expression_data:[ 10149 x 259 data.frame]
#> 2.sample_info:[ 259 x 4 data.frame]
#> 3.variable_info:[ 10149 x 3 data.frame]
#> 4.sample_info_note:[ 4 x 2 data.frame]
#> 5.variable_info_note:[ 3 x 2 data.frame]
#> 6.ms2_data:[ 0 variables x 0 MS2 spectra]
#> -------------------- 
#> Processing information (extract_process_info())
#> create_mass_dataset ---------- 
#>       Package         Function.used                Time
#> 1 massdataset create_mass_dataset() 2022-02-22 16:37:06
#> process_data ---------- 
#>         Package Function.used                Time
#> 1 massprocesser  process_data 2022-02-22 16:36:42

We can see that there are 10,149 metabolic features in positive mode.

You can use the plot_adjusted_rt() function to get the interactive plot.

load("mzxml_ms1_data/POS/Result/intermediate_data/xdata2")
##set the group_for_figure if you want to show specific groups. And set it as "all" if you want to show all samples.
plot = 
massprocesser::plot_adjusted_rt(object = xdata2, 
                 group_for_figure = "QC", 
                 interactive = TRUE)
plot

Negative mode 🔗︎

The processing of negative mode is same with positive mode data.

Code 🔗︎

Same with positive mode, change polarity to negative.

massprocesser::process_data(
  path = "mzxml_ms1_data/NEG",
  polarity = "negative",
  ppm = 10,
  peakwidth = c(10, 60),
  threads = 4,
  output_tic = FALSE,
  output_bpc = FALSE,
  output_rt_correction_plot = FALSE,
  min_fraction = 0.5,
  group_for_figure = "QC"
)

Results 🔗︎

Same with positive mode.

load("mzxml_ms1_data/NEG/Result/object")
object
#> -------------------- 
#> massdataset version: 0.99.8 
#> -------------------- 
#> 1.expression_data:[ 8804 x 259 data.frame]
#> 2.sample_info:[ 259 x 4 data.frame]
#> 3.variable_info:[ 8804 x 3 data.frame]
#> 4.sample_info_note:[ 4 x 2 data.frame]
#> 5.variable_info_note:[ 3 x 2 data.frame]
#> 6.ms2_data:[ 0 variables x 0 MS2 spectra]
#> -------------------- 
#> Processing information (extract_process_info())
#> create_mass_dataset ---------- 
#>       Package         Function.used                Time
#> 1 massdataset create_mass_dataset() 2022-02-22 16:38:19
#> process_data ---------- 
#>         Package Function.used                Time
#> 1 massprocesser  process_data 2022-02-22 16:38:02

We can see that there are 8,804 metabolic features in negative mode.

Session information

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.0.8          metid_1.2.4          metpath_0.99.4      
#>  [4] massstat_0.99.13     ggfortify_0.4.14     massqc_0.99.7       
#>  [7] masscleaner_0.99.7   xcms_3.16.1          MSnbase_2.20.4      
#> [10] ProtGenerics_1.26.0  S4Vectors_0.32.3     mzR_2.28.0          
#> [13] Rcpp_1.0.8           Biobase_2.54.0       BiocGenerics_0.40.0 
#> [16] BiocParallel_1.28.3  massprocesser_0.99.3 ggplot2_3.3.5       
#> [19] masstools_0.99.5     massdataset_0.99.20  tidymass_0.99.6     
#> [22] magrittr_2.0.2      
#> 
#> loaded via a namespace (and not attached):
#>   [1] blogdown_1.7                tidyr_1.2.0                
#>   [3] missForest_1.4              knitr_1.37                 
#>   [5] DelayedArray_0.20.0         data.table_1.14.2          
#>   [7] rpart_4.1.16                KEGGREST_1.34.0            
#>   [9] RCurl_1.98-1.5              doParallel_1.0.17          
#>  [11] generics_0.1.2              snow_0.4-4                 
#>  [13] leaflet_2.1.0               preprocessCore_1.56.0      
#>  [15] mixOmics_6.18.1             RANN_2.6.1                 
#>  [17] proxy_0.4-26                future_1.23.0              
#>  [19] tzdb_0.2.0                  xml2_1.3.3                 
#>  [21] lubridate_1.8.0             ggsci_2.9                  
#>  [23] SummarizedExperiment_1.24.0 assertthat_0.2.1           
#>  [25] tidyverse_1.3.1             viridis_0.6.2              
#>  [27] xfun_0.29                   hms_1.1.1                  
#>  [29] jquerylib_0.1.4             evaluate_0.15              
#>  [31] DEoptimR_1.0-10             fansi_1.0.2                
#>  [33] dbplyr_2.1.1                readxl_1.3.1               
#>  [35] igraph_1.2.11               DBI_1.1.2                  
#>  [37] htmlwidgets_1.5.4           MsFeatures_1.3.0           
#>  [39] rARPACK_0.11-0              purrr_0.3.4                
#>  [41] ellipsis_0.3.2              RSpectra_0.16-0            
#>  [43] crosstalk_1.2.0             backports_1.4.1            
#>  [45] bookdown_0.24               ggcorrplot_0.1.3           
#>  [47] MatrixGenerics_1.6.0        vctrs_0.3.8                
#>  [49] remotes_2.4.2               here_1.0.1                 
#>  [51] withr_2.4.3                 ggforce_0.3.3              
#>  [53] itertools_0.1-3             robustbase_0.93-9          
#>  [55] checkmate_2.0.0             cluster_2.1.2              
#>  [57] lazyeval_0.2.2              crayon_1.5.0               
#>  [59] ellipse_0.4.2               pkgconfig_2.0.3            
#>  [61] tweenr_1.0.2                GenomeInfoDb_1.30.0        
#>  [63] nnet_7.3-17                 rlang_1.0.1                
#>  [65] globals_0.14.0              lifecycle_1.0.1            
#>  [67] affyio_1.64.0               extrafontdb_1.0            
#>  [69] fastDummies_1.6.3           MassSpecWavelet_1.60.0     
#>  [71] modelr_0.1.8                cellranger_1.1.0           
#>  [73] randomForest_4.7-1          rprojroot_2.0.2            
#>  [75] polyclip_1.10-0             matrixStats_0.61.0         
#>  [77] Matrix_1.4-0                reprex_2.0.1               
#>  [79] base64enc_0.1-3             GlobalOptions_0.1.2        
#>  [81] png_0.1-7                   viridisLite_0.4.0          
#>  [83] rjson_0.2.21                clisymbols_1.2.0           
#>  [85] bitops_1.0-7                pander_0.6.4               
#>  [87] Biostrings_2.62.0           shape_1.4.6                
#>  [89] stringr_1.4.0               parallelly_1.30.0          
#>  [91] robust_0.7-0                readr_2.1.2                
#>  [93] jpeg_0.1-9                  gridGraphics_0.5-1         
#>  [95] scales_1.1.1                plyr_1.8.6                 
#>  [97] zlibbioc_1.40.0             compiler_4.1.2             
#>  [99] RColorBrewer_1.1-2          pcaMethods_1.86.0          
#> [101] clue_0.3-60                 rrcov_1.6-2                
#> [103] cli_3.2.0                   affy_1.72.0                
#> [105] XVector_0.34.0              listenv_0.8.0              
#> [107] patchwork_1.1.1             pbapply_1.5-0              
#> [109] htmlTable_2.4.0             Formula_1.2-4              
#> [111] MASS_7.3-55                 tidyselect_1.1.1           
#> [113] vsn_3.62.0                  stringi_1.7.6              
#> [115] forcats_0.5.1.9000          yaml_2.3.4                 
#> [117] latticeExtra_0.6-29         MALDIquant_1.21            
#> [119] ggrepel_0.9.1               grid_4.1.2                 
#> [121] sass_0.4.0                  tools_4.1.2                
#> [123] parallel_4.1.2              circlize_0.4.14            
#> [125] rstudioapi_0.13             MsCoreUtils_1.6.0          
#> [127] foreach_1.5.2               foreign_0.8-82             
#> [129] gridExtra_2.3               farver_2.1.0               
#> [131] mzID_1.32.0                 ggraph_2.0.5               
#> [133] rvcheck_0.2.1               digest_0.6.29              
#> [135] BiocManager_1.30.16         GenomicRanges_1.46.1       
#> [137] broom_0.7.12                ncdf4_1.19                 
#> [139] httr_1.4.2                  ComplexHeatmap_2.10.0      
#> [141] colorspace_2.0-2            rvest_1.0.2                
#> [143] XML_3.99-0.8                fs_1.5.2                   
#> [145] IRanges_2.28.0              splines_4.1.2              
#> [147] yulab.utils_0.0.4           graphlayouts_0.8.0         
#> [149] ggplotify_0.1.0             plotly_4.10.0              
#> [151] fit.models_0.64             jsonlite_1.7.3             
#> [153] tidygraph_1.2.0             corpcor_1.6.10             
#> [155] R6_2.5.1                    Hmisc_4.6-0                
#> [157] pillar_1.7.0                htmltools_0.5.2            
#> [159] glue_1.6.1                  fastmap_1.1.0              
#> [161] class_7.3-20                codetools_0.2-18           
#> [163] pcaPP_1.9-74                mvtnorm_1.1-3              
#> [165] furrr_0.2.3                 utf8_1.2.2                 
#> [167] lattice_0.20-45             bslib_0.3.1                
#> [169] tibble_3.1.6                zip_2.2.0                  
#> [171] openxlsx_4.2.5              Rttf2pt1_1.3.9             
#> [173] survival_3.2-13             limma_3.50.0               
#> [175] rmarkdown_2.11              munsell_0.5.0              
#> [177] e1071_1.7-9                 GetoptLong_1.0.5           
#> [179] GenomeInfoDbData_1.2.7      iterators_1.0.14           
#> [181] impute_1.68.0               haven_2.4.3                
#> [183] reshape2_1.4.4              gtable_0.3.0               
#> [185] extrafont_0.17