Making Cancer History


Genome-Wide SNP Data Analyis

SNPHarvester -- We propose a new method SNPHarvester to detect SNP–SNP interactions in GWA studies. SNPHarvester creates multiple paths in which the visited SNP groups tend to be statistically associated with diseases, and then harvests those significant SNP groups which pass the statistical tests. It greatly reduces the number of SNPs. Consequently, existing tools can be directly used to detect epistatic interactions.

MegaSNPHunter -- MegaSNPHunter takes case-control genotype data as input and produces a ranked list of multi-SNP interactions. In particular, the whole genome is first partitioned into multiple short subgenomes and a boosting tree classifier is built for each subgenomes based on multi-SNP interactions and then used to measure the importance of SNPs. The method keeps relatively more important SNPs from all subgenomes and let them compete with each other in the same way at the next level. The competition terminates when the number of selected SNPs is less than the size of a subgenome.

SNPRuler -- SNPRuler is a novel learning approach based on the predictive rule inference to find disease-associated epistatic interactions in genome-wide case-control studies.

BOOST -- BOolean Operation based Screening and Testing (BOOST) is a method for detecting gene-gene interactions. It allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. Interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium were carried out. Each analysis took less than 60 hours to completely evaluate all pairs of roughly 360, 000 SNPs on a standard 3.0 GHz desktop with 4G memory running Windows XP system.

GBOOST -- GBOOST is a GPU-implementation of BOOST based on the CUDA technology by Nvidia.

PBOOST -- PBOOST is a GPU based tool for parallel permutation tests in genome-wide association studies.

GBOOST 2.0 -- GBOOST 2.0 is a GPU implementation of advanced BOOST method (with covariates adjustment) based on the CUDA technology by Nvidia.

PLA -- Piecewise-constant and Low-rank Approximation for Multi-sample aCGH Data Analysis.

RPower -- RPower is an R package to estimate power and determine the sample size for replication studies of genome-wide association studies. The power estimation method is based on Empirical Bayes, which is used for reducing bias of "Winner's curse" in the primary study. The credible interval of the power can also be estimated in the package.

RRate -- RRate is an R package to implement a replication rate (RR) estimation method. Replication rate is the Bayesian probability of replicating a statistically significant association in GWAS.

Jlfdr -- Jlfdr is an R package to implement a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). This method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate (Fdr) at a certain level.

RFdr -- RFdr is an R package to implement a novel method to determine significance levels for two-stage GWASs. It finds the most powerful significance levels when controlling the false discovery rate (Fdr) in the two-stage study at a certain level.

RRIntCC -- RRIntCC is a novel region-based interaction detection method based on LD contrast test for case-control studies in genomic analysis.

Mass Spectrometry Data Analysis

Peptidequant -- Peptidequant is an optimization-based peptide quantification tool. It is designed to take two challenges in peptide abundance estimation: peptide overlapping and peak intensity variation.

SyncPro -- SyncPro is a visualization package showing multiple processed mass spectrometry data sets simultaneously.

ECL (deprecated) -- ECL is an exhaustive search tool for the identification of cross-linked peptides using whole database.

ECL2 -- ECL2 is an advanced version of ECL. It has a linear computational complexity, supports multi-thread compution, and multiple variable modifications.

PIPI -- PIPI is a tool identifying peptides with unlimited PTM types.

Xolik -- Xolik is an efficient tool for identifying cross-linked peptides in a large sequence database.

PST -- PST is a parallel simulation tool for studying open search methods in the identifications of peptides with post-translational modifications (PTMs).

ECL-PF -- ECL-PF is a cleavabe XL-MS search engine that performs in a very sensitive way.

ECL 3.0 -- ECL 3.0 is a sensitive peptide identification tool in both non-cleavable and cleavable cross-linking mass spectrometry.

PIPI2 -- Sensitive tag-based database search to identify peptides with multiple PTMs.

SeaPIC -- SeaPIC is a screening tool for the PTMs investigation in cross-linking datasets.

Computer Vision & Image Processing

DECOLOR -- A tool for automatic object detection using motion information.

Source code

Mass Spectrometry Data Analysis

Optimization Based Peptide Mass Fingerprinting for Protein Mixture Identification

  • Two effective algorthims that identify protein mixtures using single-stage MS data.
  • Source code and test data can be found here.

Improving Peptide Identification with Single-Stage Mass Spectrum Peaks

  • A regularization based method for adjusting original peptide identification score with single-stage MS data.
  • Source code and test data can be found here.

A Partial Set Covering Model for Protein Mixture Identification Using Mass Spectrometry Data

  • An unfied optimization model that identify protein mixtures using both MS data and MS/MS data.
  • Source code and test data can be found here.