Now showing 1 - 1 of 1
  • Publication
    Solving the differential peak calling problem in ChIP-seq data
    ( 2016)
    Allhoff, M.
    Gene expression is the process of selectively reading genetic information and it describes a life-essential mechanism in all known living organisms. Key players in the regulation of gene expression are proteins that interact with DNA. DNA-protein interaction sites are nowadays analyzed in a genome wide manner with chromatin immunoprecipitation followed by sequencing (ChIP-seq). With ChIP-seq it becomes possible to assign a discrete value to each genomic location. The value corresponds to the strength of the protein binding event. Peaks, that is, regions with a signal higher than expected by chance, correspond to the protein-DNA interaction sites. Detecting such peaks is the fundamental computational challenge in the ChIP-seq analysis. As in every complex wet lab protocol, ChIP-seq contains a wide range of potential biases. To reduce the effect of unwanted biases, ChIP-seq experiments are often replicated, which helps to distinguish between biological and random events and to verify the reliability of all experimental steps. Complex ChIP-seq based studies emphasize the demand of methods to compare replicated ChIP-seq signals which are associated with distinct biological conditions. These studies investigate the differential peak calling problem which is subject of current biological and medical research. Solving this problem leads to a deeper understanding of gene expression regulation. Several computational challenges arise when detecting differential peaks (DPs). First, the shape of ChIP-seq peaks depends on the underlying protein of interest. For ChIP-seq data of histone modifications, the DNA-protein interactions occur in mid-size to large domains. Here, domains can span several hundreds of base pairs and may have intricate patterns of gains and losses of ChIP-seq signals within the same domain. In contrast, ChIP-seq from transcription factors mostly happens in small isolated peaks. Second, artefacts, which arise due to the complexity of the ChIP-seq protocol, produce signals with distinct signal-to-noise ratios, even when they are produced in the same lab and follow the same protocols. Furthermore, different sequencing depths between samples aggravate the comparison of their ChIP-seq signal. Hence, a robust normalization method for the ChIP-seq signals is required. Finally, clinical samples, where patients have a distinct genetic background, introduce further variation to the distinct ChIP-seq signals. Moreover, replicated ChIP-seq experiments introduce further complexity which has to be reflected by the use of sophisticated statistical models. Current differential peak calling methods fail to cover all listed challenges. They apply heuristic signal segmentation strategies, such as window-based approaches, to identify DPs. There are only a few attempts to normalize ChIP-seq data. Furthermore, most methods do not support replicates. Hence, there is a clear need for computational methods that address all challenges. In this thesis, we propose ODIN and THOR, algorithms to determine changes of protein-DNA complexes for distinct cellular conditions in ChIP-seq experiments without and with replicates. We apply a statistical model (hidden Markov model) to call DPs and to handle replicates. We also introduce a novel normalization strategy which is based on control regions. These features lead to comprehensive algorithms that accurately call DPs in ChIP-seq signals. Moreover, the evaluation of differential peak calling algorithms is an open problem. The research community lacks both a direct metric to rate the algorithms and data sets with a genome wide map of DNA-protein interaction sites which can serve as gold standards. We propose two alternative approaches for the evaluation. First, we present indirect metrics to quantify DPs by taking advantage of gene expression data and second, we use simulation to customize artificial gold standards.