NLP for facilitating and accelerating curation in RegulonDB

To facilitate and accelerate our biocuration work, we started research in Natural Language Processing (NLP), an active field developing approaches to detect, extract and organize knowledge from biomedical literature. We initiated in 2014 with an approach to curate the experimental contrasting variable in growth conditions using the OntoGene text mining system. We called this an assisted or semiautomatic curation since it involves the final manual curation given a set of selected sentences by the system. By “growth condition” we mean here exclusively the contrast or variable that differs between the control and the experimental condition (Gama-Castro S et al. 2014).
Lately, we collected a data set of manually validated sentences containing knowledge of regulatory interactions between transcription factors (TFs) and genes, which include growth conditions, for training classification models using machine learning techniques. The idea is to facilitate curation by using predictive models to process article collections to filter and prioritize relevant sentences.
In addition, we have worked on the extraction of information about biological processes of regulated genes and structural domains of the TFs to support elaboration of our TF extensive summaries. We used manual summaries to train an automatic summarizer that collects sentences concerning these TF properties from article collections (Méndez-Cruz CF et al. 2017). Below, we show the manual and automatic summary of ArgR. An implementation of this summarizer is available in github.

Manual summary Automatic summary
ArgR has two domains: The N-terminal domain, which contains a winged-helix-turn-helix DNA-binding motif and the C-terminal domain, which contains a motif that binds L-arginine and a motif for oligomerization. Based on cross-linking analysis of wild-type and mutant ArgR proteins, it has been shown that the C-terminus is more important in cer/Xer site-specific recombination than in DNA-binding.
ArgR complexed with L-arginine represses the transcription of several genes involved in biosynthesis and transport of arginine, transport of histidine, and its own synthesis and activates genes for arginine-catabolism. ArgR is also essential for a site-specific recombination reaction that resolves plasmid ColE1 multimers to monomers and is necessary for plasmid stability.
Results The domain structure of ArgR. The mutagenesis results of two laboratories have shown that the ArgR subunit is made up of two functional regions: a basic N-terminal half responsible for DNA-binding and an acidic C-terminal half responsible for both oligomerization and for binding arginine (Burke et al., 1994; Tian & Maas, 1994).
We overexpressed the C-terminal domain of ArgR (ArgRc) corresponding to amino acids 80 to 156 in a T 7 polymerase-driven system and purified the protein to homogeneity .
Discussion The C-terminal domain of ArgR forms a hexameric protein core that contains the binding sites for L-arginine and provides a central, symmetric scaffold for six DNA-binding domains.
In addition to regulating the transcription of arginine biosynthetic genes, ArgR plays an obligatory role in a site-specific recombination reaction that resolves ColE1-like plasmid multimers to monomers and is necessary for plasmid stability (Stirling et al. , 1988) .

We conceive the knowledge in RegulonDB as the training set to generate and evaluate NLP approaches and tools that we expect will be used both to the benefit of E. coli and potentially of other microbial organisms.

NLP resources

a) Regulatory Interactions. As a product of the work described above, two NLP resources have been created that we make them available for the BioNLP community, especially for tasks of automatic classification, passage detection, and relation extraction. The first is a data set of validated sentences divided in two classes. These sentences were obtained from 142 articles concerning transcriptional regulation.

Description Class Total instances File
Regulatory interactions without growth condition RI 896 Download
Regulatory interactions with growth condition RI+GC 253 Download

b) TFs summaries. The second resource is a data set of manual summaries for 178 TFs in text format. These could be used as samples of high-quality curated knowledge comprising several properties of TFs for developing user-oriented multi-document summarizers in automatic text summarization research.

Description Total instances File
Manual summaries of 178 TFs in text format 178 Download