Life is by far the best-known two-dimensional cellular automaton, invented by the famous mathematician John Horton Conway several decades ago. It consists of an infinite two-dimensional square grid, each of whose cells may be in either an off or an on state. At each time interval, each cell deterministically sets its next state based on its current state and the state of its eight immediate neighbors. Off cells turn on if exactly three neighbors are on, otherwise they remain off; on cells remain on if exactly two or three neighbor cells are also on, otherwise they turn off. (The inspiration is that on cells breed other on cells, but cells can die from loneliness or overcrowding.) These simple rules produce amazingly complex behavior, with a variety interesting, beautiful, and organic-looking patterns that can spread through the space. In fact, Life has been proven to be equivalent to a universal Turing machine. It was initially studied by hand, though as soon as computer simulations were possible, these were far preferable due to their being so many orders of magnitude faster.
The two-dimensional grid nature of the problem immediately suggests that a mesh of processors would be an ideal way to handle the problem. Since each cell needs to know only the state of itself and its eight immediate neighbors to determine its next state, information does not need to propagate at unreasonable speeds across the mesh. In terms of bisection bandwidth, a mesh that is k processors wide at the bisection point needs to send k elements of information across the boundary in each direction during each generation. Thus, if there are n processors in a square grid, the bisection bandwidth is O(sqrt(n)), and perfect for a mesh.
At the extreme end of granularity, each cell could have its own processor, but this is ludicrously expensive, due to the very large numbers of cells needed to consider for an interesting problem. A better solution is to allocate each processor a group of contiguous rectangular cells, dividing the entire space into tiles in this manner. To allow for arbitrary scaling, the solution allows the dimensions of the space that each process focuses on to be specified as a parameter at run-time.
In the ideal version of Life, it truly operates in an infinite plane. However, to actually do this requires possibly extending the patch of space under consideration indefinitely, as pieces of the pattern expand ever outward. (Indeed, there exist finite starting patterns that spawn an infinite number of on cells.) Such extension does not lend itself well to allocating each processor a certain space to consider and then just running, because either each processor must expand and change the are it considers, or the new areas to consider must be assigned in an unbalanced and noncontiguous fashion, increasing bisection bandwidth (though not asymptotically). Two solutions come to mind. First, a finite space could be considered, and any cells that would be generated outside this space are ignored. This raises the problem of how to treat the boundary cells, as they have less than eight neighbors. Additionally, it is possible that a very interesting moving pattern could just drop off the edge of the world. Alternately, a finite space could be considered, but with the edges connected to each other to form a torus. This means that all cells are treated alike (as far as they're concerned, there are no edges), and that no patterns are lost, though in an overly small space, patterns could collide with themselves if they grow large enough, resulting in different evolution than in a truly infinite space. This risk is simply treated as acceptable. In any case, this is easily accomplished by linking the opposite edges of the processor mesh, creating a wraparound mesh.
Each individual processor needs to keep track of two main items of information, the current and previous states of its section of the grid. Once the current state has been calculated, it is displayed, and then replaces the previous state and is not modified again. When calculating the current state at the edges of its section, it needs read-only access to the previous states of its neighbor processors. Since each processor only writes to its own current state, and only reads from previous states, this means that there are no possibilities of read-write or write-write conflicts and race conditions when accessing memory.
However, each processor needs to be certain that it reads from the actual previous state of its neighboring processors and not some arbitrary state held in the variable containing the previous state of its neighbors. That is, the processors need to stay in step. No one can start computing generation n+1 until all its neighbors are done computing generation n. Since we want the visual display to show an actual generation, and not fragments from different ones, it is actually more restrictive; no processor can start on generation n+1 before all processors are done with generation n. This is easily implemented by means of a synchronization barrier (the code for which is provided in the textbook), at which all processors must wait upon completing a generation before beginning the next one. This ensures that all processors remain at the same generation.
Even within generations, additional synchronization is needed. The only time that the previous generation is modified is when the current generation is moved back to the previous generation. Obviously, this cannot take place while the neighboring processors are still attempting to read from the old previous generation, and must take place before neighboring processors attempt to read from the new current generation. So, all processors can compute the generation in parallel, synchronize themselves, age the current generation, synchronize again, and repeat indefinitely.
This still leaves the task of displaying the current generation. While the act of displaying could in theory be done in parallel, this is not feasible. Displaying graphics requires using graphics hardware that is separate from the main processor, and that is generally not parallel (except for high-end 3D graphics cards). Additionally, within the model of a Java applet, all drawing must be done from the applet's paint() routine, which is passed a graphics object to which to draw to. While multiple drawing threads could be running, waiting for the command from the applet's thread, the graphics context would need to be distributed to every thread, which requires a one-to-all broadcast every generation. Thus, the actual drawing is best done in parallel. However, determining what to draw can be done in parallel. Each processor can look at the current state of its local section and determine the bitmapped image to draw to the screen, and cache it for future use by the sequential paint() routine of the applet. As an added bonus, this implements buffering instead of direct-to-screen drawing, yielding an additional speed bonus, even if the entire grid is being handled by only one processor. It also makes the drawing process look smoother. But, during the time in which the image is being drawn, it can't be changed. Thus, the drawing process needs to be protected from both the swapping process and the creation of the current generation by another synchronization barrier.
While the entire program will run as an infinite loop, calculating generations and displaying them until the user gets bored and terminates it, it does require some beginning. It was chosen to seed the initial Life grid with a random pattern of 70% off cells and 30% on cells. This gives a high density of interesting patterns for watching evolve, while giving them the breathing room to do so. This pattern can be generated by each processor in parallel for its grid, completely independently of the other processors. However, it can't calculate the next generation until its neighbors have their initial seed patterns complete. Thus, a synchronization point is needed between the initialization process and the main loop.
Additionally, the applet needs to know when it should redraw the screen display. This can be accomplished by notifying it that it needs to do so, but this should only be done once per cycle through the loop. This is done by arbitrarily choosing the first processor to reach the synchronization barrier to be the one to do the notification. All others will sleep patiently while their previously-computed bitmaps are sent to the screen and then no longer needed.
The applet itself is available for running through this link, which lets one set up the options for how many threads to use, how large an area to give each thread, and how large to make each cell when drawing to the screen.
Due to the unusual nature of the work being done, there are no test cases that one can run, and no permanent output to be examined. However, observing the display of the running program for a while will allow anyone familiar with Life to verify what is happening. All of the simple stable structures and oscillators that appear are those that typically do in any given run given a random initial state. Additionally, gliders arise with some frequency; these are small y-shaped groups that travel across the grid in a diagonal line at a net rate of one cell every four generations until they encounter something. In general, close attention was paid to the seams between processors, as any mistakes in the special-case code for each edge and corner would result in abnormal behavior. Throughout the testing process, as features were added and the program run repeated times, no visible errors were detected.
It was intended that the applet be tested on multiprocessor machines to see if increasing the number of threads, up to the number of processors, would result in faster generation times. Unfortunately, the only multiprocessors I had access to were dual-processor Linux boxes, and Netscape's notoriously flakey JVM refused to do anything other than display a blank box when faced with the program. Multiple other JVMs on other platforms showed no problem whatsoever (Metrowerks Java and MRJ on Mac OS, Microsoft's JVM on Windows).
While there is not much that could be done to improve the algorithm for finding the next generation, as it involves nothing more than sums over array passes in no necessary order, some possibilities do exist. Since certain small patterns do tend to crop up fairly often, they could be stored in some sort of hash table, and then referenced to avoid having to do several sums. Of course, finding the right patterns to include in the table (which gets slower as it gets bigger) and a hashing algorithm that was faster than just computing it the brute force way are both quite nontrivial.
If viewing the actual process of evolving is deemed less important than just finding out what eventual steady state or cycle something settles down to, then the hashing alternative becomes more attractive, as one can store jumps of many generations and fill in the unusual parts via the standard method, potentially allowing one to find the end result of large complex patterns very quickly. With significantly less effort, one could take the current algorithm and simply make it display the pattern only once every n generations, so as to help remove any potential display bottlenecks. This still allows one to see how the pattern is evolving overall, albeit not at every instant, while still hurrying along towards the end product faster than is possible now.
In any case, the current solution scales quite well. With the exception of entering and exiting the synchronization barriers, everything being done now would need to be done by a strict uniprocessor solution. Specifically, if there are n cells total, then each cell needs a constant amount of work, so a uniprocessor does O(n) work. Splitting the n cells up among p processors means that each processor is doing n/p + v work, where v is the constant overhead incurred by the synchronization locks. Thus, the multiprocessor system does n+pv work, which is O(n) as long as p <= n. That is, the system is cost optimal even if each and every cell is given its own processor. Even if accessing data on neighboring processors takes longer, this still only increases the time spent by a constant multiplier, which does not affect the asymptotic performance. At the limit when one has a separate processor for each cell, not only is this cost-optimal, but the problem can't be solved any faster, as each cell needs that certain amount of sequential work to determine its state in the next generation. So, this problem is about as ideal for parallelizing as it could get.