Introduction

    Anti-microbial peptides (AMPs), naturally encoded from genes and generally contained 12–100 amino acids, are crucial components of the innate immune system and can protect the host from various pathogenic bacteria, as well as viruses. In recent years, the widespread use of antibiotics has inspired the rapid growth of antibiotic-resistant microorganisms that usually induce critical infection and pathogenesis. Due to the broad spectrum of antimicrobial activity, AMPs are active against a variety of pathogens, such as Gram-positive and Gram-negative bacterial, fungi, viruses, parasites, as well as tumors. An increasing interest therefore was motivated for discovering natural AMPs that enables the development of new antibiotics. With the importance of development of potential drugs, several databases or tools dedicated to the annotation of AMPs have been proposed in past few years. In recent years, the advent of high-throughput technologies has leaded molecular biology into a data surge in both growth and scope. For instance, mass spectrometry (MS) has been widely applied in proteomic studies for profiling thousands of peptides in one experiment. Additionally, the next-generation sequencing (NGS) has been applied to generate large-scale sequencing reads from foods, water, soil, air and specimen, for identifying microbiota and their functions based on metagenomics and metatranscriptomics, respectively. Rapidly advancing biotechnologies have offered us the opportunities to examine the genome, transcriptome, and proteome in comprehensive ways. Yet, extracting meaningful information from this vast sea of data and approaching biological functions from a systems biology perspective have become the Holy Grail in bioinformatics. We were thus motivated to design a database-assisted platform (dbAMP) for providing comprehensively functional and physicochemical analyses for AMPs based on the large-scale transcriptome and proteome data. The highlights of dbAMP are listed as follows:

1. Integrated resource of AMPs: Figure 1 displays the system flow of data collection, integration, analyses and representation of dbAMP. A total of 28,709 AMPs were collected from databases and were manually curated from the literature. After the removal of redundant sequences by mapping to UniProtKB protein entries, the dbAMP contains 9,062 experimentally verified AMPs along with their functional activities obtained from 19,647 organisms. Table 1 shows that over 20 types of functional activities of AMPs were included in dbAMP.

Figure 1. Schematic illustration of data collection, integration, analyses, and representation of dbAMP.


Table 1. Comparison of data statistics of AMPs with their functional activities between dbAMP and other AMP databases.




3. Identification of AMPs of different species on proteome data: In addition to providing a user-friendly interface for browsing the collected AMPs in dbAMP, all the experimentally verified AMPs were utilized to generate AMP prediction models against multiple species based on random forest (RF). Figure 3 presents the amino acid composition of AMPs based on different species. This investigation indicated that there is remarkable difference of amino acid composition on plants, fish, and mammals. Table 2 has presented that the proposed RF models could reach high prediction accuracy on proteome data of different species.

Figure 3. Investigation of amino acid composition of AMPs on different species.


Table 2. Cross-validation and independet testing results of the generated AMP prediction models against multiple species.




5. Enhanced design of dbAMP web interface: To enable comprehensive analyses for AMPs, the dbAMP has provided users the web interface with enhanced designs. Users are allowed to browse all the AMPs and submit RNA sequencing reads or MS/MS-identified peptides to the dbAMP, and the system could identify known AMPs with their functional activities and discover novel AMPs by the predictive models. Table 3 shows the comparison of system functionalities between dbAMP and other AMP databases. The dbAMP is now freely accessible via http://csb.cse.yzu.edu.tw/~dbAMP/.

Table 3. Comparison of system functionality between dbAMP and other AMP databases.

2. Comprehensive analyses for functional and physicochemical properties: An increasing interest in the functional and physicochemical investigation of AMPs motivated the mapping of all AMPs onto protein entries of UnProtKB and Protein Data Bank (PDB) based on sequence identity, which enables users to examine amino acid composition, solvent-accessible surface area, functional domains, secondary structure, antimicrobial potency, against target species, hydrophobicity, as well as the composition of positively and negatively charged residues. Additionally, 6,338 proteins interacting with AMPs were integrated into dbAMP for the analysis of potential targets of antimicrobial resistance. As displayed in Figure 2, the dbAMP can provide comprehensively functional and physicochemical analyses for human antimicrobial peptide elafin.

Figure 2. A case study on the antimicrobial peptide elafin of human with comprehensively structural and functional analyses including amino acid composition, solvent-accessible surface area, functional domains, secondary structure, antimicrobial potency, against target species, hydrophobicity, the composition of positively and negatively charged residues, and AMP-protein interaction.




4. Large-scale detection of AMPs on transcriptome data: In this work, all the amino acid sequences of AMPs were transformed into DNA sequences in order to implement an efficient pipeline, based on Docker container, for discovering AMPs from next-generation sequencing (NGS) data using Bowtie2 program. Figure 4 has elaborated the flowchart of using the developed Docker container to detect AMPs in the transcriptome sequencing datasets. To demonstrate the new scheme of AMPs discovery, the RNA sequencing samples of Taiwanese oolong teas (Dayuling, Alishan, Jinxuan and Oriental Beauty teas), obtained from NCBI SRA with accession number SRP113601, were subjected to the quality control and sequence alignment against dbAMP entries. As presented in Figure 5, totally 8194 (6.5%), 26220 (6.1%), 5703 (5.8%) and 106183 (7.8%) RNA reads could be mapped to AMPs with sequence identity of 100%.

Figure 4. Flowchart of using the developed Docker container to detect AMPs in next-generation sequencing data.


Figure S9. Distribution of AMPs in four oolong teas. (A) The distribution of anti-gram-positive and anti-gram-negative AMPs in plants. (A) The distribution of anti-gram-positive and anti-gram-negative AMPs in bacterial. (C) The distribution of gram-positive and gram-negative bacterial in four oolong teas.

˄