Deep Context through Parallel Processing
Released 5 February 2014 by Existor Ltd
Recently our users have noticed that Cleverbot is getting smarter. It is always gaining data, but since January 2014 it also made a significant computational leap. Over the past year we have been preparing and implementing the Cleverbot algorithm to run "in parallel" on graphics cards (see image). We went live with our first server in autumn 2013 and since last week graphics cards are now handling nearly all Cleverbot requests.
We've had some great user feedback, so we've decided to share some of our techniques and discoveries. This article describes what parallel processing is, and how we use graphics cards to implement it for Cleverbot.
Using The GPU
At its core, the Cleverbot algorithm fuzzily compares strings of text against its massive database of over 170 million lines of conversation. 170 million is Big Data. Attempting to search through this many rows of text using normal database techniques takes too much time and memory. Over the years, we have created several custom-designed and unique optimisations to make it work. Even then, we need the latest and fastest SSD drives, and as much RAM as our servers can take.
A top range computer has several CPUs (Central Processing Units, also called "cores"). Intel QuadCores have four, but can run multiple threads per core and so appear to have 8 or 12 of them. Graphics cards on the other hand feature a GPU (Graphics Processing Unit) which has lots of less powerful processing units, 100s or 1000s of them. A year ago we decided to investigate if we could make use of graphics cards for Cleverbot.
Programming In Parallel
For any project, the first step in using the GPU is seeing if your particular task can be divided into sub-tasks which can run in parallel. For example, imagine a hostel with 20 people and 1 really powerful shower. If each shower takes 2 minutes, it will take 40 minutes for all people to have a shower, because they have to shower in sequence. Compare that to a hostel with 10 weak showers. Now 10 people can shower at the same time, in parallel. Even if each shower takes 10 minutes, the total time taken will be only 20 minutes.
We realised that our task could be quite nicely divided into parallel sub-tasks. The first step in Cleverbot is to find a couple million loosely matching rows out of those 170 million. We usually do this with database indices and caches and all sorts of other tricks. For us the GPU is like a hotel with 170 million guests and 1024 mediocre showers.
The detail of how to divide our algorithm into parallel chunks took several months. At several points in the algorithm, data must be put into the GPU's memory (VRAM, video RAM) or extracted out of it - because some parts of the algorithm have to be done in sequence and/or are much faster on the CPU. At these and other times the 1024 tasks have to be synchronised. To revisit the analogy, every hour, all 1024 showers have to finish at the exact same instant so that the boiler can be checked, but between boiler checks they can get out of sync - it doesn't matter if one person takes 11 minutes and another takes only 9.
Determining the synchronisation points and then limiting them to as few as possible was a real challenge. Limiting and controlling the amount of data transfer (which is slow on the GPU) was another challenge. In the first month, just understanding what made it crash was a big challenge, with lots of screen freezes and reboots.
Manufacturers and languages
Once we had it working in theory, we started to investigate hardware. There are two main manufacturers of graphics cards: AMD and Nvidia. And there are two languages for programming graphics cards: OpenCL and Cuda. OpenCL evolved from OpenGL and is an open standard that works on both types of cards. Cuda is proprietary to Nvidia. We started our developments using OpenCL, because we happened to have a computer available with a supported AMD graphics card.
For serving, we eventually moved to Cuda on NVIDIA cards. For our purposes, we realised that memory was the limiting factor. To put 170 million lines of text into VRAM (video RAM) we needed as much as possible. The choice was between the Nvidia Geforce GTX Titan and AMD Radeon HD 7990. Both had 6Gb of VRAM and can run up 1024 tasks in parallel. We opted for Titans because we could fit 3 of them in one computer (see image above), and we couldn't with the Radeons.
Developing in OpenCL on Nvidia seemed to work well, but we found we could only access 4Gb of the 6Gb of memory. The only way to access all 6Gb was to run two separate OpenCL applications, each using 3Gb. This turns out to be a limitation of Nvidia and OpenCL. We translated our code into Cuda and could then access all 6Gb.
Deep Context
What has parallel processing meant for us? When compared side-by-side, our new serving is up to 25 times more efficient than our old serving. Previously we had to make lots of shortcuts. When servers were busy, we wouldn't use the whole 170 million rows, but only a small fraction of them.
Now we can serve every request from all 170 million rows, and we can do deeper data analysis. Context is key for Cleverbot. We don't just look at the last thing you said, but much of the conversation history. With parallel processing we can do deep context matching.
We also need fewer servers which has reduced costs. Except for electricity - that's gone up!
Next Steps
With our GPU serving in place, we are now looking at introducing new algorithms into Cleverbot's core processing, such as more semantic analysis and deep learning. We also plan to make use of our Cleverscript software to give Cleverbot more of a memory. All of this is possible through our new style of serving.