The advance of new techniques in molecular biology (for example, high-throughput DNA sequencing or DNA microarrays), has led to a huge amount of biological data being produced every day at increasing speed. In order to understand the mechanisms of life it is crucial to interpret these data and to unravel the patterns hidden within. In this research project, the neural hardware is applied for pattern recognition in biological sequences. We believe that a hardware approach can offer enormeous advantage in terms of speed compared to conventional, PC-based systems.
In biology, we have to deal with large sequences built from an alphabet. For genomic data, the alphabet consists of the four nucleotides A, C, G, and T. Proteins are sequences consisting of 20 aminoacids. Often, it is required to search those sequences for a specific subsequence. For example, one might be interested in the locations where a certain molecular binding site exist on a genome.
Fig 1: Searching a sequence for a specific pattern. The shown network was trained to recognize the pattern ACGCCTTCAAT.
Currently, we are focusing on applying our hardware for those database searches. The idea is to train a network to recognize a certain subsequence, as shown in figure 1. The database is shown to the network sequentially in form of small windows. Whenever the pattern of interest is discovered the network gives a positive response. With our hardware, the human genome can be seached for a pattern in only a few minutes. Due to the parallel nature of the hardware, many patterns can be searched for simultaneously. In other words, it makes no difference in terms of computing time whether we are interested in only one or hundreds of patterns.
Biological sequence data is often noisy or even partly unknown. We believe (and have already shown) that our networks can be trained to cope with noisy data, as illustrated in figure 2.