RegulonDB

Evidence Classification in RegulonDB


Scientific knowledge advances incrementally. At any point, we base broad conclusions on assertions of varying degrees of confidence. RegulonDB classifies evidence supporting particular assertions essentially based on the methods used to generate them. We do so to make explicit the complex mixture of more or less well supported specific claims that support broader conclusions (Weiss et al., 2013).

We classify the evidence supporting knowledge as ’Weak’, ’Strong ’ or ’Confirmed ’.

Weak evidence: Single evidence with more ambiguous conclusions, where alternative explanations, indirect effects, or potential false positives are prevalent, as well as computational predictions; for instance gel mobility shift assays with cell extracts or gene expression analysis.
Strong evidence: Single evidence with direct physical interaction or solid genetic evidence with a low probability for alternative explanations; for instance, footprinting with purified protein or site mutation.
Confirmed: is assigned, if objects are supported by at least two independent types of strong evidence with mutually excluding false positives. This approach is based essentially on the methods used to validate results and exclude alternative explanations in scientific research.

Confidence is assigned in two stages:

In stage I: we classify single evidence into weak or strong.
In stage II: we validate data by integrating multiple evidence in a process termed ‘Analytical Cross-Validation’. Cross-validation of weak evidence high throughput (HT) data to strong evidence and of strong evidence data to ‘confirmed’ is described in Stage II Analytical Cross-Validation.

Stage I. Classification of Individual Evidence Types
Description   
Single evidence is classified into weak or strong evidence (see above), depending on the confidence level of the associated methodologies.

1. Promoters and transcription start sites (TSSs)   
Promoters are defined in bacteria by the DNA region specifically bound by RNA polymerase to initiate transcription.
A TSS is the precise first nucleotide that is transcribed, different methods identify promoters or TSSs. They are jointly classified here.
Evidence Code Evidence Category
Strong Evidence
1.1 RNA polymerase footprinting  
RPF Classical experiment
1.2 In vitro transcription assay  
TA Classical experiment
1.3 Transcription initiation mapping
Example: Primer extension, S1 mapping, 5'RACE
TIM Classical experiment
1.4 RNA-seq using two enrichment strategies for primary transcripts and consistent biological replicates
Example: use of terminator exonuclease and differential ligation of adaptors.
RS-EPT-CBR High-throughput protocol
1.5 RNA-seq using two enrichment strategies for primary transcripts, consistent biological replicates, and evidence for a non-coding gene.  
RS-EPT-ENCG-CBR High-throughput protocol
1.6 cross validation(GEA/GS)  
CV(GEA/GS) independent cross-validation
1.7 cross validation(GEA/ROMA)  
CV(GEA/ROMA) independent cross-validation
1.8 High-throughput transcription initiation mapping  
HTIM nd
Weak Evidence
1.9 Automated inference of promoter position
Example: Computational prediction.
AIPP Computational prediction or inference
1.10 ChIP analysis  
CHIP High-throughput protocol
1.11 ROMA  
ROMA High-throughput protocol
1.12 RNA-seq  
RS High-throughput protocol
1.13 Human inference of promoter position
Example: Identification of a possible promoter by an expert by reading the sequence.
HIPP Human inference
2. Regulatory interactions   
A regulatory interaction is defined, depending on the type of evidence, as the transcription factor (TF)-regulated gene interaction (TF-gene), or more specifically as the TF-DNA binding site interaction. Evidence Code Evidence Category
Strong Evidence
2.1 In vitro transcription assay  
TA Classical experiment
2.2 ChIP analysis and statistical validation of TFBSs  
CHIP-SV High-throughput protocol
2.3 cross validation(GEA/GS)  
CV(GEA/GS) independent cross-validation
2.4 cross validation(GEA/ROMA)  
CV(GEA/ROMA) independent cross-validation
Weak Evidence
2.5 ChIP analysis
Example: ChIP-chip, ChIP-seq.
CHIP High-throughput protocol
2.6 Mapping of signal intensities
Example: RNA-seq or microarray analysis.
MSI High-throughput protocol
2.7 ROMA  
ROMA High-throughput protocol
3. Transcription factor functional conformation    
Most dedicated TFs have usually two conformations, one with a non-covalent bound allosteric metabolite, or a covalent phosphorylation (holo conformation), and one as a free protein or multimer (the apo conformation). There are exceptions to this statement. We call functional conformation the one that is capable of binding to its specific binding sites and perform its activation or repression activity. For the sake of functional conformation evidence the experiments below have to be considered with and without effector. Evidence Code Evidence Category
4. Transcription units
Evidence Code Evidence Category
Strong Evidence
4.1 Mapping of signal intensities, evidence for a single gene, consistent biological replicates  
MSI-ESG-CBR High-throughput protocol
4.2 paired end di-tagging  
PET High-throughput protocol
Weak Evidence
4.3 Mapping of signal intensities  
MSI High-throughput protocol


Stage II. Analytical Cross-Validation
Analytical cross-validation is an active evaluation of confidence and integrates multiple evidence by combining independent types of evidence, with the intention to confirm individual objects and mutually exclude false positives. It follows the same principles of science as applied by wet-lab scientists, where data are confirmed by repetitions on the one hand, and by additional experimental strategies to exclude alternative explanations on the other.

Analytical cross-validation requires, that the combined methods are independent, that is, do not share major sources of false positives or common raw materials. This approach allows to evaluate high throughput (HT) data. Objects, that are supported by two types of independent weak evidence are classified as strong evidence. Furthermore, it allows to introduce a third confidence score "confirmed". Objects, that are supported by two types of independent strong evidence are classified as confirmed evidence. The new confidence score confirmed describes the most reliable data that resemble the gold standard data in RegulonDB.

Description

For each object, the types of evidence are given, which can be combined with each other to allow an upgrade to confirmed confidence. Any two methods from different rows can be combined.
Types of evidence in the same row cannot be combined with each other. For instance, different protocols for transcription initiation mapping cannot be combined with each other for cross-validation, since these methods use mRNA as the starting material and therefore share a common source of false positives, which is RNA processing or degradation.

Cross-validation of TF binding sites and promoters requires that the exact location of the object is specified for each individual evidence.

Evidence codes: Each combination of two types of independent evidence is described as an evidence code, of the type CV(EC1/EC2). For instance, the evidence code for the combination of genomic SELEX (GSELEX) and gene expression analysis (GEA) is CV(GSELEX/GEA), that for the combination of footprinting (BPP) with site mutation (SM) is CV(BPP/SM).

1. Promoters and transcription start sites (TSSs)
Confirmed Evidence Objects supported by two types of independent strong evidence are classified as confirmed.
 
CV(GEA/ROMA/SM) GEA: Gene expression analysis
ROMA: ROMA
SM: Site mutation
CV(GEA/SM/GS) GEA: Gene expression analysis
SM: Site mutation
GS: genomic SELEX
CV(RPF/GEA/GS) RPF: RNA polymerase footprinting
GEA: Gene expression analysis
GS: genomic SELEX
CV(RPF/GEA/ROMA) RPF: RNA polymerase footprinting
GEA: Gene expression analysis
ROMA: ROMA
CV(RPF/RS-EPT-CBR) RPF: RNA polymerase footprinting
RS-EPT-CBR: RNA-seq using two enrichment strategies for primary transcripts and consistent biological replicates
CV(RPF/RS-EPT-ENCG-CBR) RPF: RNA polymerase footprinting
RS-EPT-ENCG-CBR: RNA-seq using two enrichment strategies for primary transcripts, consistent biological replicates, and evidence for a non-coding gene.
CV(RPF/SM) RPF: RNA polymerase footprinting
SM: Site mutation
CV(RPF/TA) RPF: RNA polymerase footprinting
TA: In vitro transcription assay
CV(RPF/TIM) RPF: RNA polymerase footprinting
TIM: Transcription initiation mapping
CV(RS-EPT-CBR/SM) RS-EPT-CBR: RNA-seq using two enrichment strategies for primary transcripts and consistent biological replicates
SM: Site mutation
CV(RS-EPT-CBR/TA) RS-EPT-CBR: RNA-seq using two enrichment strategies for primary transcripts and consistent biological replicates
TA: In vitro transcription assay
CV(RS-EPT-ENCG-CBR/SM) RS-EPT-ENCG-CBR: RNA-seq using two enrichment strategies for primary transcripts, consistent biological replicates, and evidence for a non-coding gene.
SM: Site mutation
CV(RS-EPT-ENCG-CBR/TA) RS-EPT-ENCG-CBR: RNA-seq using two enrichment strategies for primary transcripts, consistent biological replicates, and evidence for a non-coding gene.
TA: In vitro transcription assay
CV(SM/TA) SM: Site mutation
TA: In vitro transcription assay
CV(SM/TIM) SM: Site mutation
TIM: Transcription initiation mapping
CV(TA/TIM) TA: In vitro transcription assay
TIM: Transcription initiation mapping
2. Regulatory interactions
Strong Evidence Objects supported by two types of independent weak evidence are classified as strong.
 
CV(GEA/GS) GEA: Gene expression analysis
GS: genomic SELEX
CV(GEA/ROMA) GEA: Gene expression analysis
ROMA: ROMA
Confirmed Evidence Objects supported by two types of independent strong evidence are classified as confirmed.
 
CV(CHIP-SV/GEA/GS) CHIP-SV: ChIP analysis and statistical validation of TFBSs
GEA: Gene expression analysis
GS: genomic SELEX
CV(CHIP-SV/GEA/ROMA) CHIP-SV: ChIP analysis and statistical validation of TFBSs
GEA: Gene expression analysis
ROMA: ROMA
CV(CHIP-SV/RPF) CHIP-SV: ChIP analysis and statistical validation of TFBSs
RPF: RNA polymerase footprinting
CV(CHIP-SV/SM) CHIP-SV: ChIP analysis and statistical validation of TFBSs
SM: Site mutation
CV(GEA/ROMA/SM) GEA: Gene expression analysis
ROMA: ROMA
SM: Site mutation
CV(GEA/SM/GS) GEA: Gene expression analysis
SM: Site mutation
GS: genomic SELEX
CV(RPF/GEA/GS) RPF: RNA polymerase footprinting
GEA: Gene expression analysis
GS: genomic SELEX
CV(RPF/GEA/ROMA) RPF: RNA polymerase footprinting
GEA: Gene expression analysis
ROMA: ROMA
CV(RPF/SM) RPF: RNA polymerase footprinting
SM: Site mutation
3. Transcription units
Confirmed Evidence Objects supported by two types of independent strong evidence are classified as confirmed.
 
CV(LTED/PM) LTED: Length of transcript experimentally determined
PM: Polar mutation
CV(PET/PM) PET: paired end di-tagging
PM: Polar mutation


RegulonDB