HammingNN: a neural network based pattern classifier; results with genomics datasets

As a medical student in the early 1980’s, I was very excited when we were introduced to the physiology behind neurology. Having previously graduated in Electrical Engineering (Waterloo, 1970) and with several years of work experience in  computer systems engineering, I immediately saw the potential for modeling neural networks using computers.

Subsequent to medical school, I have been working on a project to develop such a model, and while much remains to be done, I believe that my results so far are very promising.

In essence, what I have is a nearest neighbour classifier that uses Hamming distances. It works effectively with both discrete and continuous data, and because it uses a neural network paradigm, it is tolerant of missing data. In common with other nearest neighbour classifiers, it can be trained in a single pass through the training data. A trained classifier occupies an extremely tiny memory footprint, and requires very little processing power to run. The result is very low energy consumption.

You can experiment with the classifier (running on my home server, with a rudimentary web interface) here:

http://holder66.dyndns.org:8080

While the current version of my classifier (programmed in python) works only on static datasets, an earlier version, in Forth, was developed and tested to work also with time-based, or sequential, data, such as speech. There is good reason to believe that the mammalian brain does many things using sequential data classification, including image recognition and remembering the properties of what is perceived of the real world.

I believe that my paradigm closely approximates the operation of central nervous system pattern classification networks. As it was developed by keeping in mind eventual translation into simple logic circuitry, ie without floating point processors, it may offer a development path to  simulating significant CNS functionality.

While I remain excited about this path of ongoing development, the current version of my pattern classifier has a number of immediate applications. For example, if trained on suitable clinical data, it can be used by physicians and other healthcare professionals as a clinical decision tool, in place of the algorithms now being used. It would provide improved accuracy, including when data is missing, and could be updated easily when more cases are added to the training dataset. Because of its tiny footprint, many diverse tools could be accommodated at the same time and stored  locally on devices such as smartphones.

My classifier is also very effective at reducing the dimensionality of very large data sets. For example, microarrays used in genomics studies typically provide thousands of data points for each case. An example is provided in the paper by Pomeroy et al [Pomeroy SL, Tamayo P, Gaasenbeek M et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436-442] (download the pdf here). They looked at three datasets, ranging from 34 to 60 cases, with floating point values for 7130 genes for each case. Here is a comparison of the classification results obtained by Pomeroy (which I have highlighted in the pdf) with my results. The classification accuracies achieved are equal or better, and are obtained with a tiny subset of the thousands of genes.

 

Dataset
Pomeroy et al results
HammingNN classifier results (parameters) 
# genes used
(out of 7130)
A
35/42 correct
37/42 correct (g,y,n=5,k=2,i=52)
52 genes
B
33/34 correct
33/34 correct (s,y,n=12,i=76)
76 genes
C
13/60 incorrect
9/60 incorrect (s,y,n=6,i=4)
4 genes

 

Please note that for Dataset C, Pomeroy et al used 8 genes to obtain their result, while my classifier obtained greater accuracy using only 4 genes. That a nearest neighbour classifier is able to achieve this degree of accuracy, when using such a small subset of the input variables available, I believe to be due to the contrast enhancement provided by discretizing continuous variables into a relatively small number of “slices”. This mimics a process that occurs in biological nervous systems, for example, the cochlea of the inner ear discretizes audio frequencies into a relatively small set of bands.

As a physician, I am aware of many potential applications for this tool in the healthcare field. But clearly there are many other areas, such as maintenance planning or diagnosis, quality assurance, loan and mortgage approval, and so on, where it could be profitably applied.

Leave a Reply