A Parallel Implementation of Neuro-Evolution with a Simple Neural Network

by Andrés Santiago Pérez-Bergquist

Introduction

For a great many tasks, identifying whether or not it is being performed correctly is quite easy, but actually coming up with an explicit procedure for performing the task is quite difficult. In these cases, one can make use of a variety of learning techniques which develop implicit algorithms by adapting to test cases given. Many of these techniques for developing algorithms rely on neural networks as their mode of computation, and one possible way of adapting a neural network to a problem is via neuro-evolution. In this technique, some encoding scheme is created for representing a neural network, large numbers of networks are created and evaluated, and those that show the most promise are taken and "bred" as if their representation were the DNA of living organisms, giving rise to a new generation of networks, some of which hopefully perform better than their precursors. Over time, a network will be developed that will be capable of performing well on test cases, hopefully indicating that it will perform well on actual data. The specific nature of the work involved lends itself very well to parallelization, and can actually benefit from being split up into multiple subtasks with limited communication.

Objective

To generate via neuro-evolution a simple neural network capable of performing reasonably well at a biologically-motivated pattern-recognition task.

Implementation

The target task for the evolved networks to accomplish was the successful recognition of a center-on surround-off pattern, which many cells in the human visual cortex do. In this task, input consists of points in a two-dimensional plane, and the network is supposed to respond positively when the stimulus is near the middle of the field of the vision, and negatively when the stimulus is in the surrounding region. Specifically, the input points were all in a circle of radius two. The network had two possible outputs, true and false. Points in the inner circle of radius should be true, points in the outer circle should be false. The region outside the outer circle is a don't care. While the program is running, it periodically shows graphically the performance of the best networks developed so far, with true regions in red, false regions in white, and the two circles superimposed in black. The following diagram is what an ideal network's output would look like:

The actual networks used are two layer, with eight effective hidden neurons. Each of these neurons has as input the x and y values of the point to consider, and a constant input of 1, known as the bias. If the weighted sum of these inputs is positive, the neuron outputs a one, otherwise it outputs a zero. There is an additional special neuron whose purpose is to propagate the bias signal to the output neuron. The output neuron computes a weighted sum of the inputs from the nine neurons before it, and outputs true if this sum is positive, false otherwise. Each of the eight hidden neurons can be used to detect if the input point is on either side of an arbitrary line in the plane, and the output neuron can then base its decision on where the point lies with respect to these lines. With this setup, the best approximation possible to the desired function would be a regular octahedron in place of the inner circle.

The parameters of each of the hidden neurons are kept in a gene, which holds the weights for the x input, y input, and bias input, and the weight the output neuron gives to the hidden neuron's output. For the bias unit, the first three are fixed at 0, 0, and 1, ensuring it always outputs a 1. A network has two genes for every hidden unit, and the corresponding values of each parameter from each of the two genes are added to obtain the values to use for that hidden neuron when classifying points with that network. (It should be noted that the use of diploid genes in neuro-evolution is not something I have encountered in the existing body of work, where haploid seems to be the standard, but rather something I was trying as an innovation.) These are initially set randomly to a floating-point value uniformly distributed between -1.0 and 1.0. When a new network is created by breeding two parents, one of each gene from each parent is randomly picked, just like chromosomes from diploid organisms engaging in sexual reproduction. There is a small chance of a mutation taking place at this point, which results in the addition of a small value uniformly distributed between -0.1 and 0.1 to each of the four parameters of one of the copies of a gene (or just the output weight, in the case of the bias propagation unit).

There are ten separate processors, arranged in a ring. (Since the problem being solved does not intrinsically suggest a topology, and the flux can be altered as one wishes, the cheapest topology was chosen.) Each one initially create a population of 20 random networks. They are then evaluated based on their performance on 50 points, 25 each inside inner circle and outside it, with the angular values evenly distributed about the circle. The number of points they label correctly is their score. The scores for the top ten at each processor are displayed on the screen in columns, with perfect scores of fifty highlighted in red. Each processor then selects two parents. One is always drawn randomly from the top ten at that processor. The other is drawn from the top five at either that processor or one of its neighbors (with the sequence being two from here, one from left, two from here, one from the right, repeat). The offspring of the two networks is evaluated and inserted into the population. The lowest-scoring member of the population is then deleted. This is repeated until the top five networks at that processor have perfect scores. This is done to make sure that a single network getting a perfect score isn't just luck, but rather the result of having genes that lead to reasonable solutions. The processors do not bother to synchronize between generations, as this is not needed and introduces time-wasting overhead. Simple semaphores are used to ensure that no one tries to read a population of networks while it is being written to.

Every hundred generations, the processors also take time out to generate a picture of the region the current top-performing network recognizes, and to dump the values of the weights of that network (the sums of each pair of genes), which are printed to the screen in alternating black and dark gray (for increased clarity), with the bias propagation unit first. This is not done more often because it is computationally expensive. Since the listing of the scores of the networks is updated every ten generations, the image below may lag the actual performance of the current best network.

The Program

The commented source code based on the above design resides in the following files, which are best viewed in a monospaced font with tabs set to 4 spaces:

The applet itself is available for running through this link. When run, it usually produces some decent though not exceptional results within a minute or two, using MRJ 2.2 on a 240 MHz 603ev.

Analysis

Even though it has managed to find a tolerable solution every time I ran it, it does not always manage to do so on all processors. Sometimes, one processor with a solution that is substantially different from that of its neighbors will sit there with a score in the high 40s and not improve. It has apparently hit an adaptive peak, and getting a better solution would require moving off that peak and experiencing a temporary decrease in performance, something which is not allowed under the current evolution scheme used, which is strictly hill-climbing. Since it is substantially different from its neighbors, they provide no help, as any crossbreeds have genes that try coding for different functions, wind up with garbage neurons that are barely able to do anything at all, and get booted out of the population instantly. Upping the mutation rate would help this, but would affect the ability to converge on actual solutions. A potential remedy would be an adaptive mutation rate that increased when performance stagnated. Similar techniques have been widely explored in neural network research.

In any case, the technique used was quite amenable to parallelization. There was no need for centralized control, just a limited local locking mechanism, which was simplified by that fact that any given piece of data might have several potential readers but only one possible writer. Each processor can do its work separately without concern of what the other processors are doing (or even if they're still there at all), removing any sort of sequential bottlenecks whatsoever. The amount of flux between adjacent processors can be adjusted by means of the rate of cross-population interbreeding, though this actually affects the dynamics of the solution-finding process. Deviating too far from some ideal value could impact the entire system's ability to find solutions. Excessively high rates of flux effectively turn the entire system into a single population. Aside from the increased flux, there is less opportunity for different solutions to be tried and for speciation to occur; all the populations will look very similar, leading to an increased chance of getting stuck on a local optimum that is not a global optimum. The breaking into separate populations with moderate interbreeding that is necessitated by parallel processing is actually a useful feature that a monoprocessing solution would probably employ as well. On the other hand, if every population is very isolated from the others, the potential for genes that are useful in one population to be useful when combined with the very different genes in a neighboring population is quite low, effectively segregating the different populations entirely. (Just as most biological species can't interbreed.) In this case, when one of them gets stuck, it can't really be helped by its neighbors. In any case, any adjustments to the rate of interbreeding only affect the flux by a constant factor, so they're not that important to speed, and should be tuned for optimum correctness.

With respect to the size of the network, it can be scaled arbitrarily large without any changes, as each processor is only concerned about a very small local neighborhood. While adding additional processors does increase the probability that at least one will converge to a solution, and do so sooner, it rapidly becomes a situation of diminishing returns. In general, finding a solution to a given problem will require some average amount of work on a single population. Thus, while there are no technical reasons not to scale up, the actual practical value of increasing the network size is limited. Since convergence is a probabilistic endeavor that isn't even guaranteed to work, the only real way to determine the relationship between processors and time would be to run many trials each of many different numbers of processors and graph the time it took in each case.