Wednesday, November 2, 2011

Lazy Mandelbrots in Clojure

When learning a new language it is often beneficial to select a small fun task that one has previously applied to other languages.

Rendering the Mandelbrot set is such a small task i find to be fun, i am a sucker for results that are visually appealing! Such rendering, while simple to implement, is computationally expensive. Rendering the set as fast as possible can be a useful exercise to understand the performance capabilities of the language.

My experience so far with Clojure has been a good one. The runtime has a bijouxesque quality to it, the Clojure 1.3 runtime jar is about 3.4 MB. I thought the syntax would be an issue but one quickly sees through the brackets and braces when using a good editor. The learning curve for the Clojure syntax is not much different to that of the JSON syntax.

All code for rendering the Mandelbrot set can be found here on github.

First, a function needs to be defined that determines if a point, c say, in the complex plane is a member of the Mandelbrot set. Basically a point c is a member of the set if iterations of zn+1 = zn2 + c remain bounded i.e. no matter how large n gets zn never goes beyond a certain value. In addition, if c is not a member then a value can be assigned that reflects the rate at which the sequence of z tends towards infinity, this value is commonly used to assign colours to points outside of the set.

Using Clojure 1.3 the following function is the fastest i could implement:


If the function returns zero then c is a member of the set, otherwise a value is returned that represents how fast the sequence tends towards infinity.

Notice the primitive type hints (see here for more details). This is not ideal, i would prefer to utilize complex number types from Clojure contrib rather than having to split out the real and imaginary parts, for example:


which is similar to the implementation here. I especially like the use of the iterate function to create the lazy sequence of z values. However, this approach proved to be slow. I don't recall the exact numbers but there was about a 15x difference between using the more appealing, but slower, function on Clojure 1.2 and using less appealing, but faster, function on Clojure 1.3. Note that Clojure 1.3 does have some significant performance improvements for arithmetic operations, however using the more appealing function on Clojure 1.3 still proved to be too slow.

Regardless of the two approaches i like Clojure's approach to tail recursion using loop and recur.

Next, lets define a lazy sequence of values of mandelbrot function that range over an area of the complex plane:


Since this is a lazy sequence no calculations will occur until values are extracted from that sequence. This function applies a linear transformation from co-ordinates in some space, say that for a graphical image, to that in the complex space.

The sequence returned will be of the form ([x y limit] [x y limit] ...).

Then, lets define a function that returns a lazy sequence of sequences given a bounding area in complex space, a and b, and the number of values, size, that should be calculated in either the real or imaginary axis:


The bound represented by a and b is normalized into a lower and upper bound. Given the size, lower and upper bound a distance d is calculated, and the width, w, and height, h (or number of values to be calculated on the real and imaginary axis respectively, thus the total number of values will be the product of the width and height). If w=size then h<=size, otherwise if h=size then w<=size.

The sequence returned will be of the form:
[w h
  ([x y w h ([x y limit] [x y limit] ...)]
   [x y w h ([x y limit] [x y limit] ...)]
 ...)]
Where n defines the number of sub-sequences. In this particular case the function is splitting up the bounded area into n regions each represented by a lazy sequence, and for each sub-sequence the sub-bounds are declared. The reasons for this will become apparently later on. Again note that when this function is called no calculations will be performed until values are extracted from the sequences.

As a slight aside notice the ranges function:


I started writing this function then realized partition did exactly what i wanted, which is, given a right-open interval of [0, l) return the sequence of non-intersecting right-open sub-intervals of size s, and optionally ending in an right-open interval whose size is less than s and whose upper endpoint is l, for example:
mandel.mandelbrot=> (ranges 10 3)
((0 3) (3 6) (6 9) (9 10))
The lesson being, "there is probably a function for that".

Given the mandelbrot-seqs function we can start rendering Mandelbrots. Here is a function to render as text:


Which can produce output like the following:
mandel.mandelbrot=> (text-mandel 40 [-2 -1] [0.5 1] 10)
(888887777666666666666655554431   5555666
 888877766666666666665555554421  24555566
 8888776666666666666555555443    03445556
 8887766666666666665555544331     2444556
 8887766666666666655555433321     1234455
 8877666666666665555544 0 1         23205
 88766666666666555544430                4
 88766666666665544444320                3
 8866666666655444444330                03
 876666666553233433321                 02
 866666555443 12112211                  
 866655554432      00                  
 8655555444320                          
 855555444321                           1
 8555543321                             3
 8443233220                            13
 8                                     23
 8443233220                            13
 8555543321                             3
 855555444321                           1
 8655555444320                          
 866655554432      00                  
 866666555443 12112211                  
 876666666553233433321                 02
 8866666666655444444330                03
 88766666666665544444320                3
 88766666666666555544430                4
 8877666666666665555544 0 1         23205
 8887766666666666655555433321     1234455
 8887766666666666665555544331     2444556
 8888776666666666666555555443    03445556
 888877766666666666665555554421  24555566)
nil
In this case we don't need to split into multiple sub-sequences and we are just interested in the width, w, or number of characters on the line so the sequence of strings can be partitioned by that value.

Alternatively we can render as an image, which gets more interesting. An image will contain many more pixels than characters in textual output, so will be more expensive to create. The rendering algorithm is easier to parallelize. This is the reason why multiple sub-sequences are supported.

If i have a machine with 4 CPUs i can map the problem to four lazy sequences, let each core process a sequence to produce an image, and reduce those images to one larger image.

Here is the code to produce a BufferedImage:


For the case where the problem is split into n parts the pmap function is used to map the function image-mandel-doseq in parallel to the lazy sequences. This produces a sequence of images which are then reduced into one large image by a further sequential map operation.

The image-mandel-doseq function is where the work is performed to set pixels in an image:


The doseq repeatedly invokes WritableRaster.setPixel. Notice the type hint ^BufferedImage and the functional calls to int and ints. To ensure that Clojure knows which setPixel method to call it is necessary to convert values, otherwise a costly call by reflection will result.

When interacting with Java libraries i recommend setting the *warn-on-reflection* flag to true, then the compiler will warn when reflection is used to make calls:
(set! *warn-on-reflection* true)
So how does the parallel execution work out on my Mac Book Pro with an Intel Core i7?
user=> (defn t [n f]
  (for [_ (range 0 n)]
    (let [start (. System (nanoTime))
          ret (f)]
      (/ (double (- (. System (nanoTime)) start)) 1000000.0))))
#'user/t
user=> (for [i (range 1 8)]
  (/ (apply + (t 10 #(mandel.mandelbrot/image-mandel i 1024 [-2 -1] [0.5 1] 256))) 10))
(679.6881000000001 467.689 451.8355 366.8658 361.23620000000005 400.1187 535.1956)
The system profiler says my machine has 2 cores, but sysctl -a says there are 4 CPUs. The results are not quite what i was expecting. Nearly a 2x increase when the number of parallel tasks is four or five. Hmm... i don't quite understand this result. Need to dig deeper to work out what is going on.

I would love a graphically aware repl that could render the image that is returned from the image-mandel function, kind of like a poor man's Mathematica repl :-) An alternative is to develop a web service and render the image in a browser. This is really simple with compojure and ring. The image can be viewed by executing lein ring server.

The Renderable protocol of compojure was extended to support rendering of images:


Ideally i would like to directly stream out the image rather that buffer to a byte array, but this requires some tweaks to ring, which may make it after the 1.0 release, see the discussion here.

Finally the pièce de résistance is deployment of the application to Cloudbees by executing lein cloudbees deploy. Browse to here (if it takes a while to load that is because the application has hibernated and it takes some time to wake up). Developers forking the project can easily deploy to CloudBees if they have an account by tweaking the project.clj. For more detailed instructions on how to use lein with CloudBees see my previous blog entry.

4 comments:

  1. Nice intro to Clojure sequences. :)

    The 2 versus 4 cpu count will likely be due to HyperThreading. You will likely have two physical cores, each capable of running two threads simultaneously (known on Intel CPUs as HyperThreading). Look for "machdep.cpu.core_count" vs "machdep.cpu.thread_count" in sysctl.

    For certain tasks running >1 thread per physical core can be beneficial. The idea is that per-thread performance will drop (as now 2 threads are sharing a core), but the aggregate system performance will go up. If you saw the performance plateau (or increase only modestly) once you passed 2 threads, then this could be the explanation.

    http://en.wikipedia.org/wiki/Hyper-threading

    ReplyDelete
  2. @pauldoo: Thanks! i think that explains the timing results.

    $ sysctl -a | egrep "\.core_count|\.thread_count"
    machdep.cpu.core_count: 2
    machdep.cpu.thread_count: 4

    The performance increase between 1 and 2 threads is much larger than between 2 and 4 threads.

    ReplyDelete
  3. You might be interested in this other parallel version:
    simple mandelbot

    ReplyDelete
  4. @TCP: Thanks for the pointer! I dunno whether Graphics/2D is thread safe and whether there is some contention when setting the color and rendering the pixel (the plot function is not thread safe). That may explain why you did not get a larger speed up, that and perhaps the coordination overhead of pmap is high.

    ReplyDelete