Single-threaded vs Multi-threading vs Multi-processing in Python

We will try to run a few simulated processes to understand the performance difference between Single-threaded, Multi-threading and Multi-processing in Python.

We simulate the workload with two tasks

  1. CPU bound task(i.e countdown to zero from a given number)
  2. IO bound task(i.e perform a HTTP get for a given URL)

We will learn about GIL - Alternative Python interpreters - by counting to 255 million and downloading few webpages

In [1]:
from IPython.display import HTML

HTML('''
<script src='//code.jquery.com/jquery-3.3.1.min.js'></script>
<script>
code_show=false; 
function code_toggle() {
 if (code_show){
 $('div.input').show();
 $('div .jp-CodeCell .jp-Cell-inputWrapper').show();
 } else {
 $('div.input').hide();
 $('div .jp-CodeCell .jp-Cell-inputWrapper').hide();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code on/off"></form>''')
Out[1]:
In [2]:
import datetime
import pandas as pd
import plotly.express as px
In [3]:
import local_nb_utils
/home/codein/ve/jupyterlab/lib/python3.8/site-packages/requests/__init__.py:89: RequestsDependencyWarning:

urllib3 (1.26.2) or chardet (3.0.4) doesn't match a supported version!

In [4]:
# reloader
from importlib import reload
reload(local_nb_utils)
Out[4]:
<module 'local_nb_utils' from '/home/codein/src/git/gitlab_poc/jupyter/temp/local_nb_utils.py'>

Workload

We have three types of work load

1. a purely IO bound work load.
2. a purely CPU bound work load.
3. a combination of IO and CPU work load randomly shuffled together.
io_work_load = get_io_work_load(load_size=load_size)
    cpu_work_load = get_cpu_work_load(load_size=load_size)
    io_and_cpu_work_load = io_work_load+cpu_work_load
    random.shuffle(io_and_cpu_work_load)

Run Simulated work load

In [5]:
runtimes = local_nb_utils.run_simulated_work_load()
Load Size:300
io_work_load single thread
io_work_load 5 threads
io_work_load 5 process
cpu_work_load single thread
cpu_work_load 5 threads
cpu_work_load 5 process
io_and_cpu_work_load single thread
io_and_cpu_work_load 5 threads
io_and_cpu_work_load 5 process
In [6]:
process = []
tasks = []
for runtime in runtimes:
    _process = {}
    for key,value in runtime.items():
        if key != 'results':
            _process[key] = value
    process.append(_process) 
    for result in runtime['results']:
        result['work_load_label'] = runtime['work_load_label']
        result['work_load_size'] = runtime['work_load_size']
        result['process_duration'] = runtime['process_duration']
        result['process_start_time'] = runtime['process_start_time']
        tasks.append(result)
        
tasks_df = pd.DataFrame(tasks)
process_df = pd.DataFrame(process)
In [7]:
work_load_label_order = [
    'io_work_load single thread',
    'io_work_load 5 threads',
    'io_work_load 5 process',
    'cpu_work_load single thread',
    'cpu_work_load 5 threads',
    'cpu_work_load 5 process',
    'io_and_cpu_work_load single thread',
    'io_and_cpu_work_load 5 threads',
    'io_and_cpu_work_load 5 process',
]
In [ ]:
 

Summary of all process

  • We can calibrate the io_work_load single thread and cpu_work_load single thread to take roughly the same time to simulate a 50:50 CPU and IO work load.
In [8]:
fig = px.bar(
    process_df,
    y='work_load_label',
    x='process_duration',
    text='process_duration',
    category_orders={"work_load_label": work_load_label_order},
    facet_col_wrap=3,
    orientation='h',
)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.show()
In [9]:
def fromtimestamp(timestamp):
    return datetime.datetime.fromtimestamp(timestamp)

def set_microsecond_zero(dt):
    return dt.replace(microsecond=0)
In [10]:
tasks_df['work_start_time_sec'] = tasks_df['work_start_time'].apply(fromtimestamp).apply(set_microsecond_zero)
In [11]:
def print_work_load_summary(work_load_label):
    fig = px.bar(
        process_df.query('work_load_label == @work_load_label'),
        y='work_load_label',
        x='process_duration',
        text='process_duration',
        category_orders={"work_load_label": work_load_label_order},
        facet_col_wrap=3,
        orientation='h',
        height=175
    )
    fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
    fig.show()    

def print_work_load_details(work_load_label):    
    _df = tasks_df.query('work_load_label == @work_load_label')
    fig = px.bar(
        _df,
        x='work_start_time_sec',
        y='work_duration',
        color='work_type',
        facet_col="work_load_label",
    )
    fig.show()
    

Process Details

We have captured duration for each work. we can use this to drilldown to see how each process is preforming the combined IO and CPU work load.

Single Threaded

  • single threaded process needed 24 seconds to process the work loads.
In [12]:
print_work_load_summary('io_and_cpu_work_load single thread')
  • Each bar below represents 1 sec and all the work processed in that 1 sec. Since the entire process needed 24 seconds we have 24 bars.
  • Note all CPU bound work is of the same size i.e. needed the same duration to complete.
In [13]:
print_work_load_details('io_and_cpu_work_load single thread')

Threads vs Process for IO and CPU work load

Threads

  • 5 Threads needed 16 seconds to process the work loads.
In [14]:
print_work_load_summary('io_and_cpu_work_load 5 threads')
  • Each bar below represents 1 sec and all the work processed in that 1 sec. Since the entire process needed 20 seconds we have 20 bars.
In [15]:
print_work_load_details('io_and_cpu_work_load 5 threads')

Process

  • 5 Process needed only 5.7 seconds to process the work loads.
In [16]:
print_work_load_summary('io_and_cpu_work_load 5 process')
  • Since 5 process only needed 5.7 seconds to process the work we only have 5 whole bars. Multiprocessing was able to squeeze in more work in every seconds compared with multithreaded simulation.
In [17]:
print_work_load_details('io_and_cpu_work_load 5 process')

Key Takeaways

  1. multi-processing approach is faster because each Python process gets its own Python interpreter and memory space so the GIL isn’t a problem.
  2. Processes can get blocked and may starve for CPU resources for long running CPU bound operations.
In [ ]: