Tuesday, September 03, 2013

Mutithreading and Multiprocessing in Python

Recently someone asked me about multithreading in Python. The last time i wrote any multithreading code in python was at least 5 years ago, that too using the thread module for some very basic stuff. As i was brushing up on this topic, I came across some very useful articles. I am using this blog post to link to these articles for my future reference.

Multithreading in CPython is really very limited due to the existence of Global Interpreter Lock (GIL), but multiprocessing seems more promising.

Due to existence of GIL, multithreading in Python is useless for tasks that are CPU bound. For example, mathematical computations. Multithreading is best utilized in programs that are IO intensive. The IO operation could be on disk (listing files and directories, or reading/writing files, etc) or on network (communicating with websites). The important modules for this are - thread and threading. Thread is considered low-level API but has limited capabilities, while threading is considered high-level and provides more functions. This tutorial provides a quick overview of both these module via simple examples.

An important decision that you would have to make when writing multithreaded programs is whether the child threads should merge back with the parent/main thread. This will be required if you ever have to reconcile the output from all child threads to determine the next course of action for the parent thread. This decision is made by calling the join() function available with every thread object. I found the textual representation in this stack exchange answer to be very intuitive and useful to understand join() for different kind of threads.

To overcome the limitations of using multithreading in Python, the multiprocessing module side-steps the GIL by using subprocesses instead of threads. This allows the programmers to fully leverage the benefits of multi core or multiple processor architecture on a machine. This article does a good job of explaining OS forking and multiprocessor modules in python.

Finally, most programmers are interested in multithreading and multiprocessing mainly to speed up the execution of their programs. In Python, it is important for the programmer to understand the difference between these to concepts, else they could end up slowing the program execution instead of speeding it up it. I found these this article to be very helpful in showing the difference between the two. It generates benchmarks using simple examples. Additionally, the following blog post explains how python sees and uses cores on a machine.

In summary, if your Python program does a lot of IO, use multithreading, but if you are looking to speed up the CPU bound work of your program, use multiprocessing.