Vasily V. Grinev
Candidate of Sciences in Biology, Associate Professor,
Scientific Head of the Sector of Human Molecular Genetics.
News | Curriculum Vitae | Educational work | Research | Publications | Software | Presentations | Contacts
You can find more information about our software development activity on page https://github.com/VGrinev/transcriptome-analysis at GitHub repository.
The RNAexploreR App (on-line test mode) provides the computational pipline for analysis and prediction of possible variants of the RNA generation based on the graph models of the organization of human genes. Data analysis pipeline follows five steps along the bookmarks panel from the left to the right:
Step 1 - PCA. Principal Component Analysis of the exon features. Selection of the principal components, explaining 95% of the variation in the data.
Step 2 - HC exons. Hierarchical clustering of exons characterized by the selected components. Splitting exons into clusters and matching each cluster with a unique index in Latin characters (from a to z).
Step 3 - Experimntal RNA. Data transformation of experimental RNA transcripts to the labels of clusters in which the corresponding exons are located. Duplicates removal.
Step 4 - Predicted RNA. Data transformation of theoretical RNA transcripts to the labels of clusters in which the corresponding exons are located. Duplicates removal.
Step 5 - HC RNA. Hierarchical clustering of the pool of unique experimental and theoretical transcripts. Visualization of the results by a dendrogram and generating the list of selected RNAs.
It is a high-level R function for collection of the leading edge genes from GSEA output.
It is a set of high-level R functions for detection of significant open reading frames in nucleotide sequences and identification of pre-mature translation termination codons.
It is R-based software for user-defined non-ranked gene enrichment test. User just should provide two files with i) query and ii) subject gene sets. Query gene set (one or more) is a list of user-specified genes to be analysed (for instance, limma detected differentially expressed genes). Subject gene set (one or more) is a list of reference genes of some specific category (for example, KEGG pathway gene set). With above two sets of genes, software calculates standard fold enrichment score and odds ratio. Moreover, it calculates a concomitant statistics to assess the significance of fold enrichment. A null hypothesis can be tested by one of five approaches: Fisher’s exact test, Pearson’s chi-squared test, binomial exact test, hypergeometric test and/or random sampling test. Finally, two different methods were included to adjust the p-values for multiple comparisons: i) adjustment of the p-values with Benjamini and Hochberg's method for control of the false discovery rate, and ii) adjustment of the p-values with Holm's method for control of the family-wise error rate.
It is a new high-level R function annotateEpigeneticFeatures for fast annotation of experimentally detected exon-exon junctions with distances to the nearest epigenetic marks. This function accepts two files as input: i) TXT file in tab-delimited format with genomic coordinates of the experimentally detected exon-exon junctions, and ii) TXT file in tab-delimited format with genomic coordinates of the epigenetic marks to be analysed. With above mentioned input data, function annotateEpigeneticFeatures identifies the nearest up- and downstream epigenetic marks for both 5' as well as 3' splice sites of each exon-exon junction and it calculates respective distances. Finally, function returns i) an object of class data frame which contain input experimental data and a set of new metadata columns with annotations, and ii) comprehensive descriptive statistics on distances to the nearest epigenetic marks.
A set of new R functions for annotation of experimentally detected exon-exon junctions according to modes (or types) of alternative splicing. This package includes: i) function hnapRNA for calculation of hypothetical "non-alternative" precursor RNA based on reference annotations of RNAs for gene(-s) of interest, ii) generic high-level function hnapRNAgenerator for integrative control of main function hnapRNA, iii) function modeEEJs for final classification of experimentally detected exon-exon junctions, and iv) function modeStatistics for calculation of summary statistics on modes of alternative splicing.
It is a new high-level R function exonTypes for fast reference-based functional annotation of experimentally detected exons. As input data, function uses TxDb-like SQLite database of reference transcriptional models of genes and TXT file in tab-delimited format with experimentally detected exons. The input TXT file should include four mandatory fields: i) seqnames (name of chromosome or scaffold with prefix "chr"), ii) start (start genomic coordinate of the exon), iii) end (end genomic coordinate of the exon), and iv) strand (strand information about exon). Function describes a status of each experimental exon according to five functional types: 5'UTR, CDS, 3'UTR, non-coding and/or multi-type exon. Finally, function returns an object of class GRanges which contain input experimental data and five new metadata columns with annotations.
It is a new high-level R function overlapJunctions for calculation of overlaps between reference and experimentally detected exon-exon junctions. An input file for function overlapJunctions may include output results from limma/diffSplice or JunctionSeq pipelines with data for differential usage of exon-exon junctions in two (or more) experimental conditions. Function returns an object of class list which include consolidated results of calculations.
It is a new high-level R function splDistance for consolidation of global statistics on splicing distances. An input file for function splDistance may include output results from limma/diffSplice or JunctionSeq pipelines with data for differential usage of exon-exon junctions in two (or more) experimental conditions. This function produces an object of new class splicingDistances that contains input data as well as all calculated statistics.
High-level R function for filtering out of reads with wonky CIGAR strings from BAM files. Current experimental version of function allow to work with only BAM (not SAM) files and only against two type of bad CIGAR: i) CIGAR op has zero length; ii) CIGAR M operator maps off end of reference. In addition, users can specify their own list of unwanted reads.
High-performance high-level R function for filtration of the Cufflinks (or similar ones) assembled transcripts. Filtration procedure includes the following steps: i) removing of unstranded transcripts; ii) removing of transcripts that match two different strands; iii) removing of records from non-canonical chromosomes; iv) removing of one-exon transcripts; v) removing too short transcripts; vi) removing of transcripts with too short exon(-s); vii) removing of transcripts with too short intron(-s) and viii) removing of transcripts with too low abundance. Each step is controlled by respective arguments. The results will be stored in a file of GTF/GFF format or as a local SQLite database.
Identification of PTC in transcripts
This is R code for fast annotation of transcripts with premature termination codons.
This is R code for easy and fast conversion of the standard (linear) transcriptional models of genes into directed acyclic weighed exon graphs. The code permits to i) transform GTF/GFF file with gene annotations into local SQLite database, ii) retrieve the metadata from GTF/GFF file (if there is any relevant information), iii) reconstruct exon graph (from TranscriptDb object) as a list of edges, iv) assign weights (from metadata) to the edges, v) convert of edges list into directed acyclic weighed exon graph (as object of class igraph) and vi) save exon graph as a list of edges with weights in tab-delimited TXT file.
This is consolidated R-based wrap for analysis of the differential RNA splicing with linear modeling. The pipeline includes the following basic steps: 1) loading of the primary counts matrix in R workspace; 2) filtering of the primary counts matrix; 3) wrapping of the counts matrix in a digital gene expression object; 4) estimation of the normalization (scaling) factors to calculate an effective size for each RNA-seq library using the “trimmed mean of M-values” normalization method; 5) performing of the voom normalisation and transformation of the counts data that show some degree of heteroscedasticity; 6) fitting of the linear models to the normalized and transformed counts; 7) analysis of the differential splicing; 8) visualization and inspection of the results (including Volcano plots); 9) consolidation and saving of the results of interest.
A set of R functions for identification of significant open reading frames in nucleotide sequences using multinomial model.
Subjunc-based alignment of the RNA-seq reads and identification of exon-exon junctions
This is R-based wrap for alignment of the RNA-seq reads and identification of exon-exon junctions with seed-and-vote approach described in “Liao Y., Smyth G. K., Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. // Nucleic Acids Res. – 2013 May 1;41(10):e108. doi: 10.1093/nar/gkt214”.
Analysis of the ChIP-seq data
This R-based code is designed to identification of significant peaks from ChIP-seq data. Pipeline includes indexing of reference genome, local alignment of the DNA-seq reads, peaks calling with a fully Bayesian hidden Markov model and Monte Carlo simulations, filtering of peaks against posterior probabilities and depth/coverage and saving of final results.
Analysis of the MALDI-TOF spectra
This R-based code is designed to assess the similarity of the MALDI-TOF spectra. Pipeline includes loading raw data from .mzXML file(-s) in R workspace, preprocessing every spectrum, the creation of a consolidated matrix of spectra, pairwise comparison of spectra by different approaches, calculation of the basic statistics, preparation and saving the numerical results in standard tab-delimited tables and plots. The similarity of the spectra is estimated using Pearson's r, Spearman's rho, Euclidean distances, principal component analysis and/or spectral angle mapper. Finally, samples can be clustered according to selected metric of spectra similarity.
CelNetAnalyzer is a simple-to-use Java-based software package for topological analysis of the large undirected cellular networks. This software is managed through a graphical user interface and it returns a comprehensive list of the topological indices. The returned list of structural metrics is oriented on cellular networks and it includes degree and neighbourhood, clustering, distance, centrality and heterogeneity indices as well as simple cycles, compositional complexity and Shannon information entropy of network. Comparative studies have shown that the CelNetAnalyzer calculates these parameters significantly faster than competitors, thanks to parallelization and enhanced and newly developed algorithms. CelNetAnalyzer is an open-source project and free distributed for non-commercial use. CelNetAnalyzer requires JavaTM Platform Standard Edition 6 or higher. Downloadable archive contains GUI version of software, source code, the user manual, test network and the results of the topological analysis of this network.
Страница обновлена: 01.03.2019 20:50