That link that you (David) posted does seem to work. It looks like they have quite a small recurrent neural net going, but it looks like fun. I'll certainly revisit that when I get to look at implementing a recurrent net.
I think for the next few weeks I plan on still playing around with the existing network for pattern recognition. I think there's quite a lot of room for improvement in the way in which I obtain my training data sets. I think there might be much better ways of doing things. I've hardly played with learning rates at all, just sticking with the box standard default (0.01). Also I suspect there are much better neural architectures (hopefully just using the existing layer types).
So for the moment I think my focus is on making these things perform really well (as opposed to fast),
Just on reference to your paper link, although I can see how much fun it would be to write ones own GPU kernels, I'd probably go down NVIDIA's cuDNN route. Just from personal experience there's quite an art to writing extremely efficient GPU code, and I'd be pretty tempted to leverage someone else's, especially if it's basically doing the same calculations (which it probably would be).
Am out of UK for a couple of weeks (and without internet), so won't be doing anything on this until I return.
Anyway, those are my thoughts.
Kind regards,
Julian.