Evidence Classification in RegulonDB

Scientific knowledge advances incrementally. At any point, we base broad conclusions on assertions of varying degrees of confidence. RegulonDB classifies evidence supporting particular assertions essentially based on the methods used to generate them. We do so to make explicit the complex mixture of more or less well supported specific claims that support broader conclusions (Weiss et al., 2013).

We classify the evidence supporting knowledge as ’Weak’, ’Strong ’ or ’Confirmed ’.

Weak evidence: Single evidence with more ambiguous conclusions, where alternative explanations, indirect effects, or potential false positives are prevalent, as well as computational predictions; for instance gel mobility shift assays with cell extracts or gene expression analysis.
Strong evidence: Single evidence with direct physical interaction or solid genetic evidence with a low probability for alternative explanations; for instance, footprinting with purified protein or site mutation.
Confirmed: is assigned, if objects are supported by at least two independent types of strong evidence with mutually excluding false positives. This approach is based essentially on the methods used to validate results and exclude alternative explanations in scientific research.

Confidence is assigned in two stages:

In stage I: we classify single evidence into weak or strong.
In stage II: we validate data by integrating multiple evidence in a process termed ‘Analytical Cross-Validation’. Cross-validation of weak evidence high throughput (HT) data to strong evidence and of strong evidence data to ‘confirmed’ is described in Stage II Analytical Cross-Validation.

Stage I. Classification of Individual Evidence Types
Single evidence is classified into weak or strong evidence (see above), depending on the confidence level of the associated methodologies.

1. Promoters and transcription start sites (TSSs)   
Promoters are defined in bacteria by the DNA region specifically bound by RNA polymerase to initiate transcription.
A TSS is the precise first nucleotide that is transcribed, different methods identify promoters or TSSs. They are jointly classified here.
Evidence Code Evidence Category
2. Regulatory interactions   
A regulatory interaction is defined, depending on the type of evidence, as the transcription factor (TF)-regulated gene interaction (TF-gene), or more specifically as the TF-DNA binding site interaction. Evidence Code Evidence Category
3. Transcription factor functional conformation    
Most dedicated TFs have usually two conformations, one with a non-covalent bound allosteric metabolite, or a covalent phosphorylation (holo conformation), and one as a free protein or multimer (the apo conformation). There are exceptions to this statement. We call functional conformation the one that is capable of binding to its specific binding sites and perform its activation or repression activity. For the sake of functional conformation evidence the experiments below have to be considered with and without effector. Evidence Code Evidence Category
4. Transcription units
Evidence Code Evidence Category

Stage II. Analytical Cross-Validation
Analytical cross-validation is an active evaluation of confidence and integrates multiple evidence by combining independent types of evidence, with the intention to confirm individual objects and mutually exclude false positives. It follows the same principles of science as applied by wet-lab scientists, where data are confirmed by repetitions on the one hand, and by additional experimental strategies to exclude alternative explanations on the other.

Analytical cross-validation requires, that the combined methods are independent, that is, do not share major sources of false positives or common raw materials. This approach allows to evaluate high throughput (HT) data. Objects, that are supported by two types of independent weak evidence are classified as strong evidence. Furthermore, it allows to introduce a third confidence score "confirmed". Objects, that are supported by two types of independent strong evidence are classified as confirmed evidence. The new confidence score confirmed describes the most reliable data that resemble the gold standard data in RegulonDB.


For each object, the types of evidence are given, which can be combined with each other to allow an upgrade to confirmed confidence. Any two methods from different rows can be combined.
Types of evidence in the same row cannot be combined with each other. For instance, different protocols for transcription initiation mapping cannot be combined with each other for cross-validation, since these methods use mRNA as the starting material and therefore share a common source of false positives, which is RNA processing or degradation.

Cross-validation of TF binding sites and promoters requires that the exact location of the object is specified for each individual evidence.

Evidence codes: Each combination of two types of independent evidence is described as an evidence code, of the type CV(EC1/EC2). For instance, the evidence code for the combination of genomic SELEX (GSELEX) and gene expression analysis (GEA) is CV(GSELEX/GEA), that for the combination of footprinting (BPP) with site mutation (SM) is CV(BPP/SM).