A Parallel Distributed Client-Server Implementation of a Brute-Force RSA Cracker

by Andrés Santiago Pérez-Bergquist

Introduction

Modern cryptography is based upon various algorithms generally involving primality and the factoring of huge numbers. The most famous of these is the RSA Cryptosystem, which was the first widely-used public key cipher, in which there are two complementary keys, and a message encoded with one can only be decoded by means of its partner (and not itself). The secrecy of the RSA system depends upon the enormous amounts of time required to factor the numbers involved. If sufficiently small numbers are used as keys, then a particular use of the RSA cipher becomes vulnerable to brute-force attacks that attempt to try all possible decoding keys until the correct one is found. The feasibility of such an attack is highly dependent on the amount of brute force available to the attacker. When RSA Inc. released the first of their challenges to break certain encoded messages that they made public, it at first seemed that no one without a mainframe could possibly hope to crack the keys, and that no one would waste a valuable supercomputer on such a task. However, clever folk realized that this was a highly parallelizable task with virtually no flux, and could thus be handled equally well by hordes of low-cost processors. Additionally, it was realized that ordinary home computers sit idle much of the time, and that if one could get enough of them working together in parallel, they could outmatch anything Cray ever built, at least on this task. Thus was born distributed.net, an attempt to harness the latent computing power of all the computers on the world. It consists of a client-server architecture, where a client program is installed on each computer, which then connects to a central server to obtain a quantum of work to do, does it, and returns the results when fetching the next quantum. The fact that the amount of information needing to be transferred is quite small makes this possible over the relatively glacial communication speeds of the Internet. The effectiveness of this approach has led to the breaking of several cipher challenges, up to 56 bits in length, and the spawning of a variety of other projects, such as SETI@Home, which follow the same architecture.

Objective

To produce a simplified model of a computer network, and then implement a client-server architecture showing how multiple computers could coordinate efforts to break via brute-force an RSA-encoded message with the first few characters known.

Implementation

The first task was to create the simulated network. The size of the network was arbitrarily set at 10 nodes--much smaller than that of distributed.net, but enough to show the principles at work in the simulation. These ten nodes were embedded randomly in a 200x200 space, and bidirectional links were inserted from each node to its nearest three or four neighbors. This resulted in a connected network every time the program was run. Each node consisted of two actual processes. One was a task equivalent to an OS and its network drivers that accepted messages destined for other nodes, routed messages along their way, and recognized messages intended for the node it was running on. The other was representative of an actual program running on the computer which occasionally required interacting with the network. The latter was eventually replaced with the client and server under development, but was originally a simple process that periodically sent out network traffic semi-randomly, preferring nodes with which it had just interacted with, and which responded to incoming messages by occasionally sending one back. This generated enough traffic with the general properties of actual traffic to be able to adjust the network to function correctly.

Each node knows to which of the other nodes it had direct connections, but nothing more. Each maintains a routing table estimating the time in milliseconds to send a message to any given node through each of its neighors. The values are initially set based on the distance between nodes, since the time to transmit a message between two nodes is simulated as dependent on the length of the link between them, but this distance information is not used or actually known by the node itself. The times in the table are later adjusted by sending back a response upon receiving a message, and thus letting the origin node know how long the message took to arrive. Early routing algorithms tried had problems. Primarily, the network would settle into sending messages by paths much longer than necessary, and nodes could become confused and bounce messages between themselves forever, without sending them on their proper way. These problems were eventually solved. The algorithm used is non-deterministic. It keeps track of the estimated time to send a message to a given destination through any of its nodes, and chooses to send the message through the node with the fastest time about two-thirds of the time. The rest of the time, it considers the next fastest node and uses it two-thirds of the remaining time, and so on, creating an exponential probability distribution. Nodes which a message has not visited before are always given priority over nodes which have been visited. By using this algorithm, the network quickly gets a feel for how messages should be routed, and generally sends them in the right direction. The way it is set up encourages each node to explore all the paths and use those that actually work, and prevents even bad data from permanently affecting network performance, as grossly incorrect time estimates will be fixed as soon as a message or two actually gets through. The fact that it is not very intelligent about routing overall probably means that it would not scale well to much larger network sizes, but this is not particularly relevant, as the network itself is not what was being studied by the project.

The network is displayed graphically, with each node as a numbered black dot in a position correspoding to its location. Gray lines between nodes designate links, with ones actively being used to transmit a message highlighted in pink. Red circles behind each node represent the number of queued messages waiting to leave that node, with each pixel radius beyond 3-pixel radius black dot representing one message. When a message reaches its destination, the path it followed it logged to the console, and when a response to a message gets back to the originating node, this is noted as well, though the entire path is not given.

Cracking an RSA-encoded message was first done with a single-threaded application that creates a set of RSA keys, takes the string "This is the decrypted message.", and encodes it. Then, it pretends to only know what the first four letters of the original text are, much like the actual RSA challenges, which consist of a coded message and the first few plaintext characters. It simply starts trying every possible key value, starting at zero and going up to the modulus, until it hits on one that successfully decodes the message, detected by mapping the encrypted opening letters to the actual opening letters. For the case used in the code, approximately a 22-bit modulus, a working decoding key is 130397. When the key is found, it is used to decode the entire message, which is then output, and the total time spent on the task is also output.

For the final program, the work was split between a client and a server. The server always runs on node Zero, and all the other nodes run clients. The server begins running a program essentially identical to the single-threaded version, creating and encrypting the message, until it reaches the point where the single-threaded program would begin using brute force. The server then waits for requests from the clients. When the clients start up, the first thing they do is create a request for work to do and send it to the server. Upon receiving one of these requests, the server parcels out a block of 10,000 contiguous numbers and assigns them to the requesting node as its current workload. When the node receives the fulfillment of its request, it logs it to the output and attempts every key in that block as a decoding key, just as the single-threaded program did. If it should find a decoding key, it immediately sends a message to the server, and then shuts down, as there's nothing else to do. When the server receives a message that contains not a request for work but a working decoding key (or, alternately, when the the last block has been assigned, but not all blocks have been completed), it stops honoring requests for work, as they are unneeded, and proceeds to decode the message itself, finishing in the same way the single-processor program does. No further work takes place, though requests fro work may still come in from other processors for a short while afterwards.

The Program

The commented source code based on the above design resides in the following files, which are best viewed in a monospaced font with tabs set to 4 spaces:

The applet itself is available for running through this link, as well as the single-processor version here; both provide most of their output through the console.

As to accuracy, both programs successfully recover the original phrase. In addition, earlier versions used only numbers instead of text, and successfully recovered those using a variety of different keys of lengths from 16 to 24 bits.

When both programs were run on Apple's MRJ 2.1 on a 240 MHZ 603e UMAX c600, the simple version took between 25 and 27 seconds to run, while the network version took between 26 and 43 seconds to run, with the network topology apparently playing an important factor in the run time. It should be noted that best-case run-times, even including the overhead from simulating the network, were almost identical. Thus, under ideal conditions, the overhead is almost zero, as one would expect. If the overhead were a problem, one could always increase the size of the blocks given out each time. On an actual network where the work really was done in parallel, phenomanal speedups almost equal to the number of processors present should be observed, and indeed are in the actual distributed.net setup.

Ideally, the applet would have been tested on multiprocessor machines to make certain that the network version would indeed run faster than the simple version, but, unfortunately, the only multiprocessors I had access to were dual-processor Linux boxes, and Netscape's outdated JVM yet again refused to do anything other than display a blank box when faced with the program. Multiple other JVMs on other platforms showed no problem whatsoever.

Analysis

The problem in question was completely parallel, in that each individual key could theoretically be tried at the same time by a different processor, because all the information needed for each attempt is present at the beginning of the problem. There is literally no flux. The motion of information during the calculations is not actually part of the problem, but rather meta-information dealing with the distribution of work. While this is not strictly necessary in the version of the problem I simulated, in reality different computers might have different speeds, be needed for other things, and added to or taken away from the network periodically, requiring a dynamic distribution of work that is not entirely mapped out ahead of time. Still the amount of communications work needed compared to the amount of computational work is microscopic, and can be reduced as needed by increasing the size of the work blocks allotted, reducing the need to communicate.

The initial encryption was not done in parallel, but this wasn't really part of the problem, since in an actual situation one would be attempting to break an already-encoded message whose plaintext one did not already have. The final decoding could also be done in parallel, since decoding one chunk does not depend on the other chunks, but the actual time required to decode the message once the key is known is so small that parallelizing it would be of little use. Given the relatively large communication times an actual network yields, just doing it on the server is probably faster, in reality, at least.

In any case, the current solution scales just about perfectly. Increasing the size of the network (given a proper routing algorithm) allows one to increase the number of processors working on the problem without adding any computational overhead. The only overhead is in communications, since larger networks have larger diameters. But, given that an actual problem of this sort needs computational time measured in hundreds of thousands of processor-years, a few seconds latency over the internet won't really be noticed. Plus, there's nothing preventing one from communicating and cracking at the same time. Once the first block has been received, it can be started upon immediately, and new blocks can be obtained while still working on old block, as well as results being uploaded while working on new blocks. Most computers have separate sub-processors or controllers for their network interface cards, freeing the main processor to do useful work in the meantime. Even if the overhead from communications does become a problem, increasing the granularity by increasing the block size cuts the communications overhead by a factor equal to the increase in the block size. In short, it really doesn't get any better than this; theory points to this being the ideal task to parallelize, and practice (on order of hundreds of thousands of computers) bears it out.