I’m pleased to begin the first of a series of blogs on parallel processing, a topic which is no longer optional in the world of machine learning and AI. In this introductory note, we’ll take an overview of the field which, for better or for worse, is becoming more complex every day.
Today we are bumping into barriers imposed by the laws of physics that prevent CPUs from going much faster just by cranking up the clock speed. The only solution to this limitation of individual processors is to have more of them. Today, your servers probably have dozens or processor cores. Indeed, smart phone may have ten. This advance does not come without a price. In the 1960s and 1970s programming could be wild and woolly. Only slowly over time did better development environments and better-understood good programming practices begin to impose order and keep the more aggressive bugs at bay. Today, it is the wild west once again as developers try to come to grips with new development environments, and, making matters worse, computer hardware that changes faster than software used to change.
To be sure, there are many and varied approaches to spreading a problem across multiple processors, each with its own advantages and disadvantages. In this installment, we will take an overview of types of parallel processing.
The idea of distributing work across many computers dates back almost as far as the electronic digital computer itself. There were some interesting and sometimes surprising developments, but it was not till the so-called “personal computer” became a cheap commodity item that distributed multitasking became the compelling choice for large-scale and high-performance computing. Today, all of the 500 fastest computers in the world are distributed multiprocessor systems. As an aside, last year (2019) 498 of those 500 ran the Linux operating system. This year 500 do.
There is multithreading the illusion and multithreading the performance booster. Multithreading was introduced long before it was common for a computer to have multiple CPUs and long before multicore chips even existed. The primary job of multithreading was to keep the CPU busy and give users the experience of a responsive system that didn’t make them wait. Without multiple cores, however, multithreading was actually a drain on performance since internally the system was still running one set of instructions at a time, and also had to deal with the overhead of switching from thread to thread.
With multiple cores, multithreading can become a performance booster. While the overhead of switching between threads has not gone away, with multiple cores multithreading can provide true parallel processing with several tasks being executed simulataneously. One of the primary goals of machine learning (ML) and artificial intelligence (AI) platforms such as Tensorflow or Paddle-Paddle is to take full advantage of the available processor cores in a system.
A heterogeneous platform involves at least two significantly different processing systems. The simplest is one that you have probably used yourself. High-end gaming systems have for years included video graphics processing units (GPUs) that consist largely of many parallel floating-point processors. (In this case “many” can be in the thousands. It is part of the genius of NVidia to have recognized early on that there would be a market for this processing muscle among those needing computing power but uninterested in the slaughter of video game demons. These GPUs perform precisely the sort of matrix operations required in wholesale quantities by ML and AI.
It wouldn’t be fair to call field-programmable gate arrays, or FPGAs, the “new kid on the block” since they have been commercially available since 1984. Only recently, however, have FPGAs become a critical component of modern high-performance computer systems. FPGAs consist of arrays of components that can be organized (and reorganized) into working digital components by special software code. You write code, but the output is a chip, not software. In the data center, they play the role of “reconfigurable computing”. Complex (and expensive!) hardware can be reorganized on the fly to best suit the task at hand. These devices are also making their presence felt in what AI folks refer to as “edge” computing. Hardware can be specifically constructed for image recognition, video analysis, and speech understanding and synthesis.
Personally, I think perhaps the greatest contribution of FPGAs will ultimately come from the power they provide to clever and inventive individuals. In the past, you could only design and create your own chip if you had access to a multi-billion dollar silicon fabrication plant. And a budget to match. Now, engineers and developers can get started designing their own digital circuitry for the cost of a couple of trips to Starbucks. It will be interesting to see what they create.
In the next installment, we will take a closer look at distributed multiprocessing: its history and development, and what it has to offer today.