## R-package 'RepeatedHighDim' (V 2.2.0)## Documentation and TutorialbyKlaus Jung(Documentation last modified: 05 April 2024) | |||||||||||

1. Overview & DownloadMany molecular high-throughput experiments result in high-dimensional data matrices with the number of features (represented in the rows) being much larger than the number of samples (represented in the columns). Since multiple features are measured on the same experimental unit, the data can be regarded as repeated measurements.The R-package 'RepeatedHighDim' comprises a selection of functions for different aspects of the analysis of high-dimensional repeated measurements. In particular, functions for - outlier detection,
- differential expression analysis,
- self contained gene-set tests,
- and the generation of binary random data
This documentation details the functionality of the package and guides through the example code. In the code examples, input is printed in red, output is printed in blue color. Download is available from The Comprehensive R Archive Network. 2. Outlier detectionA first step in exploratory data analysis of high-dimensional data is unsupervised clustering of samples in a lower-dimensional space, for example by means of principal component analysis (PCA). This analysis step can also be used to detect outlying samples. RepeatedHighDim implements the graphical approach of gemplots in the space of the first three principal components (Kruppa & Jung, 2017). This method extends the idea of bagplots, the two-dimensional version of boxplots (Rousseeuw et al., 1999). The advantage of the presented approach is that the data of multiple groups can be represented in the same PCA plot, but outlier detection is performed separately for each group. As an example, consider data 400 experimental units from two experimental groups in a three-dimensional space:
Next, we replace one observation in each group by an outlier:
Before the gemplots can be calculated and visualized, some graphical parameters need to be specified:
Now, the inner and outer gems can be calculated and visualized in for both groups in an interactive rgl-window:
The gemplots and the outliers are shown: 3. Differential expression analysisThe effect measure typically used in differential expression analysis if the log fold change (logFC). While one part of analysts rely on p-values for the ranking and selection of genes, others rely on the logFC. However, while p-values incorporate informatin about variance and sample size, the logFC is only based on mean expression levels. Calculating confidence intervals for the logFC allows to use the logFC complementary to p-values (Jung et al., 2011). RepeatedHighDim provides functions to calculate confidence intervals given logFCs and their standard errors. Furthermore, confidence intervals can be FDR-adjusted complementary to the FDR-adjustment of p-values. Thus, logFCs with confidence intervals and p-values can lead to the same selections. The following example takes logFCs and their errors from the R-package limma.
Volcano plot with CIs for the logFC: 4. Gene set analysisBesides studying expression changes of individual genes, a typical step in transciptomics data analysis is to study effects of gene-sets. The most common type of gene-set analysis is enrichment analysis which relies on the results of differential expression analysis. The occurence of gene-set members among the top ranked genes is compared to the occurence of member among non-top ranked genes is compared. Hence, enrichment methods are sometimes referred to as 'competitive' gene-set tests'. In contrast, 'self-contained' gene-set tests directly compare the expression profile of the gene-set between different experimental groups (Jung et al., 2011). Consider gene expression data of a subset of 100 genes from two experimental groups. The expression data of this subset can simply be committed the function RHighDim. In addition, the user needs to specify whether samples are paired or unpaired:
The summary can be displayed as follows:
In the case that there are missing values, what sometimes occures in protein expression data (Jung et al., 2014), the function GlobTestMissing can be used which performs a permutation test. Here is an example with a subset of 100 proteins and 10 percent of missing values.
5. Correlated binary dataCorrelated binary variables regularly occur in biomedical research. In order to perform simulations with such data, methods for their artificial generation are necessary. This package provides functionality to sample a matrix of correlated binary data from specified distribution with given covariance matrix and marginal probabilities. The generation is based on a genetic algorithm, where a start matrix with specified marginal probabilities is modified until the specified correlation structure is approached (Kruppa et al., 2018). In the following example a representative matrix with specified marginal probabilities and specified correlation matrix is generated. Then, a random sample is drawn from this matrix.
Other contributorsMany thanks to - Jochen Kruppa (University of Applied Sciences Osnabrück)
- Sergej Ruff (University of Veterinary Medicine Hannover)
References- Jung K, Becker B, Brunner B and Beissbarth T (2011). Comparison of Global Tests for Functional Gene Sets in Two-Group Designs and Selection of Potentially Effect-causing Genes.
*Bioinformatics*,**27**, 1377-1383. [Open access] - Jung K, Dihazi H, Bibi A, Dihazi GH and Beissbarth T (2014): Adaption of the Global Test Idea to Proteomics Data with Missing Values.
*Bioinformatics*,**30**, 1424-30. [Open access] - Jung K, Friede T and Beißbarth T (2011). Reporting FDR analogous confidence intervals for the log fold change of differentially expressed genes.
*BMC bioinformatics*,**12**, 1-9. [Open access] - Kruppa, J., & Jung, K. (2017). Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots.
*BMC bioinformatics*,**18(1)**, 1-10. [Open access] - Kruppa, J., Lepenies, B., & Jung, K. (2018). A genetic algorithm for simulating correlated binary data from biomedical research.
*Computers in biology and medicine*,**92**, 1-8. [Abstract] - Rousseeuw, P. J., Ruts, I., & Tukey, J. W. (1999). The bagplot: a bivariate boxplot.
*The American Statistician*,**53(4)**, 382-387. [Abstract]
ContactProf. Dr. Klaus JungUniversity of Veterinary Medicine Hannover, Foundation Institute for Animal Breeding and Genetics Bünteweg 17p, D-30559 Hannover, Germany klaus.jung@tiho-hannover.de www.klausjung-lab.de |