About the GATC Platform



Introduction


Datasets obtained from microarray analyses at the Erasmus MC Rotterdam and elsewhere, are exponentially growing in size and number. Already within the research area of breast, endometrium, and prostate cancer, we have more than 10,000,000 ratio data points and identified hundreds of genes potentially relevant in cancer progression. It is clear that it is simply beyond unaided human cognition to group and prioritize all these genes for further biochemical analysis. This process must become high throughput and automated. The criteria for selecting the few genes for further pre-clinical testing or deepening the basic research can be phrased as common biological questions, such as:
Many of the answers to these questions can be found by searching (combinations of) public databases including PubMed, OMIM, GeneCards, Entrez, SwissProt, KEGG, CGAP, GO, Ensembl, and many others. Functional relationships can also be retrieved and predicted using programs such as GSEA, Ingenuity, OmniViz, text mining programs such as Anni, and many other related knowledge databases. Therefore, data mining of linked databases and prediction programs can provide answers to or support our quest for the biological mechanisms underlying cancer progression.

A second essential part of the new biology era, is the integration and mining of different datasets (from the same samples or disease). Linking of exon array data to SNP array data and deep sequencing records is still a major challenge. The amount of information within exon arrays on alternative splicing or alternative promoter-usage is already a lot to handle; what to do when you want to automatically link that to SNP or paired-end sequencing data?

Objectives:

The GATC Platform set out in 2002 to generate a data warehouse to perform automated and batch-wise searches of differentially expressed genes by linked our microarray data to many different public databases using the Sequence Retrieval System (SRS). Since then, various programs and tools (so called Mappers) have been created by bioinformaticians and bioinformatics students to extract relevant biological information from our different array and sequencing platforms. Our objectives are :
  1. Establishment of the GATC Portal for high throughput and automated queries within databases:
    1. collect, link, and update informative databanks
    2. integrate the databases within a data retrieval and mining system
    3. integrate visualization tools
  2. Research on and development of novel data mining/visualization tools to integrate them in the GATC Portal

Programs and Tools:

Several programs, database systems and tools have been developed by bioinformaticians within the Erasmus MC. Some of these programs have been published and are readily available. Others will be made available upon request. Please contact Guido Jenster (g.jenster@erasmusmc.nl) for more information.

1. The SRS data warehouse system (SRS 7.1.3)

(Antoine Veldhoven, Don de Lange)

In order to collect, link, and mine databases, we use the Sequence Retrieval System, originally developed to query common databases such as UniGene, SwissProt, MedLine, OMIM and about another 1000 databases (Etzold T, Argos P. Comput Appl Biosci. 1993 Feb;9(1):49-57). SRS turns out to be very useful in linking all these databases together and integrating your own datasets such as microarray data or protein expression analyses (Veldhoven et al., BMC Bioinformatics. 2005 Jul 27;6:192). Since all data is linked, even cross-species, we can for example very easily retrieve the genes/proteins that are upregulated in human prostate cancer AND identified as ortholog serum proteins in a mouse serum protein sequencing database. The data warehouse is mainly used for retrieving expression and ratio data of many genes from many different microarray studies to create a single table with all available information.




2. The Auto-Upload Tool

(Antoine Veldhoven, Don de Lange)

The Auto-Upload Tool has been developed for users of SRS to upload their data table (typically array data) into SRS (Veldhoven et al., BMC Bioinformatics. 2005 Jul 27;6:192). The database can be linked to the other databases in SRS and user permission can be set.




3. SRS Output Converters

(Don de Lange, Farshid Arasteh Samani)

Unfortunately, SRS 7.1.3 has some serious bugs with respect to exporting results (tables of linked entries) for further processing in for example Excel. XML-2-Text and TXT-2-Text converters to fix these problems were developed and provide additional options such as averaging of probesets from the same gene.


4. AltSplice Mapper

(Barry van der Mast)

From exon array expression data, one can retrieve much more information than just gene expression. Based on the relative expression level of each individual probeset (often corresponding to one particular exon), one can find evidence for alternative splicing, alternative promoter usage, gene fusion and partial gene deletions. The program reads Affymetrix Exon array tab-delimited input files and normalizes exon expression as compared to the different arrays provided to the Mapper. If exon expression in one of the samples is deviant by a factor you determine, the program will report that particular gene. In addition, groups of exons can be defined to represent the beginning, middle and end of a gene. The relative ratios between these groups are calculated and outliers reported. Such abnormal ratios can indicate alternative promoter usage, gene fusions or partial deletions.




5. Anni 2.0

(Rob Jelier, Martijn Schuemie, Jan Kors)

Within the Department of Medical Informatics (Erasmus MC), an online tool has been developed to aid the biomedical researcher with a broad range of information needs (Jelier et al., Genome Biology 2008 Jun 12, 9(6):R96). Anni provides an ontology-based interface to MEDLINE and retrieves documents and associations for several classes of biomedical concepts, including genes, drugs and diseases, with established text-mining technology. In the published study, Anni's usability is illustrated by applying the tool to two use cases: interpretation of a set of differentially expressed genes, and literature-based knowledge discovery.




6. CoPub Mapper

(Blaise Alako, Antoine Veldhoven, Sjozef van Baal, Ton Rullmann, Jan Polman)

CoPub Mapper is a program established in collaboration between the Erasmus MC and Organon N.V. (currently part of Schering-Plough) (Alako et al., BMC Bioinformatics 2005, 1471-2105-6-51). The CoPub Mapper database can be used to identify genes co-published with other genes and keywords such as diseases, tissues, and gene ontology terms (molecular function, biological process, cellular component). These searches can be performed using a single gene, a single keyword, or using a batch of genes. An updated version is available at http://services.nbic.nl/cgi-bin/copub/CoPub.pl.




7. GO Mapper

(Marcel Smid, Lambert Dorssers)

GO Mapper is an application that uses the functional annotation of the Gene Ontology consortium to analyze gene expression data (Smid and Dorssers, Bioinformatics 2004, 20(16):2618-2625). The application loads microarray or SAGE derived data, links the genes to GO-terms and then uses the actual measured expression level to weigh the associated terms. The aggregate of the weighted GO-terms shapes the functional profile of the experiment. The weight of a term is calculated as an Expression Quotient (EQ); In short, the EQ is calculated as the quotient of the average expression ratio of the genes associated with a GO-term and the average expression ratio of all genes measured on the array. The GO Mapper can be used for analysis of human, mouse, rat, zebrafish, fruitfly and yeast data.




8. Position Mapper

(Anouk Heine)

Position Mapper compares two lists of genomic features (genes, miRNAs, SNPs, etc.) for overlap based on chromosome and base pair position information provided. This program was developed to identify which miRNAs are located within known genes and which miRNAs are represented by the probesets on Affymetrix Exon arrays.




9. Prostate Cancer Homozygous Deletion Mapper (PCHD)

(Don de Lange, Paul Verhagen)

In order to identify short recurrent sequences near genomic regions of frequent DNA changes (such as homozygous deletions), PCHD was written to store HD and LOH information and to compare sequences within these regions to one another. The PCHD database has a user friendly interface and can be utilized to upload single sequence locations or large SNP data in a batch-wise fashion. Data from this database can be exported as bed file to be visualized in the UCSC Genome Browser. For PCHD Blast, the Blast algorithm was adapted to handle FASTA sequences batch-wise and generate a customized output format.




10. GeneLoc Mapper

(Don de Lange)

GeneLoc Mapper is a tool that visualizes gene expression data onto the genome sequence with genes drawn in the viewer as arrows and colored based on their expression level or ratio. The program can extract genes from the dataset based on their differential expression, gene location, gene orientation and distance to one another. GeneLoc Mapper was developed to determine whether neighboring genes are often co-regulated or co-expressed in case of tissue-specific expression or cell stimulations (for example upon growth hormone activation). The program has been instrumental in identifying loci of co-expressed genes that might be under the control of locus control regions or developed by gene duplication events (an examples can be found in: Van der Heul et al., Prostate, 2009 in press).




11. Neighbor Mapper

(Tessa Dorival)

To determine whether a specific genomic feature (such as a gene, miRNA, etc.) is located within a region of genomic loss or amplification, a program was developed to calculate the average or mean level of probe intensities of SNPs surrounding the feature of interest. Two lists need to be provided. One contains the SNP intensity data from SNP arrays and the other list the features of interest. Both lists must contain the genomic location of the features. Using this program, a batch search can be performed of, for example all miRNAs to identify their neighboring SNPs and calculate their average/mean intensities. This will identify miRNA within regions of loss or amplifications. By selecting the number of SNPs on either side, the size of the region surrounding the feature can be varied. Using this attribute, Neighbor Mapper can also identify genes (or other features) located on or near DNA breakpoints.




12. Venn Mapper and Venn Mapper for SRS

(Marcel Smid, Don de Lange, Antoine Veldhoven)

Venn Mapper is a program that compares heterologous microarray data sets (e.g. from different platforms), based on the number of common, differentially expressed genes (Smid et al., Bioinformatics, 2003, 19(16),p2065-2071). The application loads microarray data (gene expression ratios) and determines which genes are up- or down-regulated by a user-defined ratio cut-off level. For each experiment, lists of differentially expressed genes are computed. Every list will be compared to every other list, and the number of co-occurring genes will be calculated. With the use of the binomial distribution, so called z-values can be assigned to the overlap found between two lists. Venn Mapper for SRS is special version of Venn Mapper and is a web-based interface between the Venn Mapper microarray analysis software and the Sequence Retrieval System (SRS). The main purpose of this software is providing the user with the ability to easily select databases (known as libraries) from SRS, choose the fields of interest (possibly from multiple databases) and automatically start an analysis by Venn Mapper. Venn Mapper for SRS also includes a clear file management option screen (Veldhoven et al., BMC Bioinformatics. 2005 Jul 27;6:192).

    


13. Short Sequence Location Mapper

(Bas Pigmans)

Short Sequence Location Mapper or SSLM is a program written to analyse and visualize the location of short sequences on a lager sequence.
The program loads the sequences, then aligns them and finally graphs and tables can be generated for one or more datasets at the same time.
SSLM typically works with a BLAST data output file of short Solexa reads mapped to RNA subjects (for example pre-miRNA sequences). The analysis can be performed on either one sequence in the dataset or on different types of RNA. The generated output is on-screen or in tab-separated-, FASTA-, text- or tiff-format. SSLM will not perform the BLAST search to link the short sequences to there larger sequence they originate from.



GATC Platform © 2002 - 2012 Erasmus MC.
Page Last Updated : Mon, 01 Feb 2010 20:40:28 +0100