Concurrent Computations on Multicore Processors
Concurrent Computations
on Multicore Processors
Threads & Subprocesses
Christian Zielinski
18
th
June 2015 PyCon SG 2015
Christian Zielinski 1
Concurrent Computations on Multicore Processors
Introduction
Introduction
Who am I?
Currently a Ph.D. student at Nanyang Technological University
Working in computational physics, i.e. simulation of elementary particles
with Monte Carlo methods (“lattice quantum chromodynamics”)
Heavy user of Python, numpy and scipy for data analysis and
evaluation of numerical algorithms
First experiments with Python around ten years ago
Slides available on:
www.github.com/czielinski/pyconsg2015
Christian Zielinski 2
Concurrent Computations on Multicore Processors
Introduction
Concurrency
Why concurrency?
Python is awesome, but slow (if you don’t use C/C++ extensions)
Single CPU core performance only improves slowly over time
But modern processors have increasing number of cores
Blocking operations like I/O cause idle time
Rather do computations in the meantime
Concurrency helps to mitigate all problems
Potentially significant speed ups
Christian Zielinski 3
Concurrent Computations on Multicore Processors
Introduction
Concurrency
Concurrency in Python
How to make us of concurrency in Python?
For stock Python typically use either
Threads via the threading module
Subprocesses via the multiprocessing module
They provide convenient high-level interfaces
threading and multiprocessing have similar API
There are great third-party modules to go beyond a single CPU:
mpi4py, pycuda, pyopencl (but not today’s topic)
Christian Zielinski 4
Concurrent Computations on Multicore Processors
Introduction
Concurrency
Global Interpreter Lock (GIL)
The Global Interpreter Lock (GIL) prevents that more than one thread
is executing Python bytecode at the same time
The Global Interpreter Lock:
makes the CPython interpreter explicitly thread-safe
simplifies CPython’s code base
prevents true parallel threading
is released during I/O
The GIL is implementation dependent
CPython and PyPy have a GIL
Jython and IronPython have no GIL
Read more on https://wiki.python.org/moin/GlobalInterpreterLock
Christian Zielinski 5
Concurrent Computations on Multicore Processors
Overview
Threads
Threads
Threads are accessible via threading module
Can be spawned quickly
Share memory
CPython’s threads are real POSIX/Windows threads
Threads can be useful for I/O heavy problems
Do useful operations during blocking operations
Usually not suited for CPU bound problems
GIL prevents true parallel computations
(with a few exceptions, e.g. some numpy functions)
Introduce additional overhead
Use a Lock to define critical sections,
i.e. code that requires mutual exclusion of access
Christian Zielinski 6
Concurrent Computations on Multicore Processors
Overview
Threads
Simple threading example
1 import thread i ng as th
2
3 def worker ( lock , num ):
4 cubed = num **3
5 # A worker should ac t ually avoid I /O ...
6 with lock :
7 print (" {} cubed is {} ". format ( num , cubed ))
8
9 lock = th . Lock ()
10 my_t h reads = []
11 for i in range (4):
12 t = th. Thread ( target = worker , args =( lock , i ))
13 my_t hreads . append (t )
14 t. start ()
Christian Zielinski 7
Concurrent Computations on Multicore Processors
Overview
Subprocesses
Subprocesses (I)
Subprocesses accessible via multiprocessing module
Self-sufficient interpreter instance
No memory sharing
Subprocesses have their own interpreter instance
Spawning takes long time, significant overhead
Every subprocess has own GIL, allow for true parallelism
Christian Zielinski 8
Concurrent Computations on Multicore Processors
Overview
Subprocesses
Subprocesses (II)
Subprocesses suited for CPU bound problems
Can do computations truly in parallel to utilize several cores
Usually not useful to spawn more than one subprocess per physical core
Not suited for I/O bound problems
Cannot e.g. read files faster from HDD
To keep cores busy need to feed data quickly enough
Potential memory bandwidth bottleneck for large number of cores
Christian Zielinski 9
Concurrent Computations on Multicore Processors
Overview
Subprocesses
Simple multiprocessing example
1 import mul tipr oces sin g as mp
2
3 def worker ( lock , num ):
4 cubed = num **3
5 # A worker should ac t ually avoid I /O ...
6 with lock :
7 print (" {} cubed is {} ". format ( num , cubed ))
8
9 lock = mp . Lock ()
10 my_proc s = []
11 for i in range (4):
12 p = mp. Process ( target = worker , args =( lock , i))
13 my_pro c s . append (p )
14 p. start ()
Christian Zielinski 10
Concurrent Computations on Multicore Processors
Overview
Theory
Scaling and Amdahl’s law
Using N cores does not automatically reduce runtime by a factor of N
Some code section inherently serial, like I/O
Some algorithms are difficult to parallelize, e.g. recursive functions
One usually parallelizes only the performance-critical parts of the code
Amdahl’s law [Amdahl ’67]
Parallelizing a program with serial runtime T
1
using N processes and a
serial code fraction f [0, 1] (by runtime) results in a runtime of:
T
N
= T
1
·
f +
1
N
(1 f )
So for sufficiently large N, runtime is dominated by serial code sections!
Christian Zielinski 11
Concurrent Computations on Multicore Processors
Overview
Theory
0
20
40
60
80
100
10 20 30 40 50 60 70 80 90 100
Speedup factor
Number of processes
f = 10%
f = 5%
f = 1%
Perfect scaling
Figure: Amdahl’s law
Christian Zielinski 12
Concurrent Computations on Multicore Processors
Overview
Theory
Some general remarks
Split up larger problems into smaller subproblems with appropriate size
If too small, performance degraded due to overhead
If too large, load balancing becomes problematic
Arguing about the correctness of a parallel program is hard
Order of computations are not fixed
Christian Zielinski 13
Concurrent Computations on Multicore Processors
Multiprocessing API
API of the multiprocessing module
Christian Zielinski 14
Concurrent Computations on Multicore Processors
Multiprocessing API
Worker pools
Worker pools
Simplest parallelization strategy is to use a worker pool
Accessible via multiprocessing.Pool
Important methods:
map (map_async) for (asynchronous) parallel map
apply_async for asynchronous function evaluation
Limitations:
Can only deal with pickable objects
Cannot use lambda expressions, nested functions, class methods
Can use global functions, global classes and partial functions via
functools.partial
Pool will not work in interactive sessions
(__main__ module has to be importable by the children)
Christian Zielinski 15
Concurrent Computations on Multicore Processors
Multiprocessing API
Worker pools
Parallel map
1 from m u lti proc essi ng import Pool
2
3 def cube ( num ):
4 return num **3
5
6 if __ n ame__ == __main__ :
7 # Can also spec i fy pr o cesses para meter
8 p = Pool ()
9 # Equiv a lent of serial map ( cube , range (10))
10 res = p . map ( cube , range (10))
11 print ( res )
12 # Output : [0 , 1, 8, 27 , 64, 125 , 216 , 343 , 512 , 729]
Note: This overly simplified example runs slower than the serial version
Christian Zielinski 16
Concurrent Computations on Multicore Processors
Multiprocessing API
Worker pools
Asynchronous function evaluation
1 from m u lti proc essi ng import Pool
2
3 def double ( arg ):
4 return 2* arg
5
6 def cube ( num ):
7 return num **3
8
9 if __ n ame__ == __main__ :
10 p = Pool ()
11 f1_get = p. apply _asyn c ( cube , args =(6 ,))
12 f2_get = p. apply _asyn c ( double , args =( haha ,))
13 # Do other work here
14 res = [ f1_get . get (), f 2 _ g e t . get ()]
15 print ( res )
16 # Output : [216 , ha hahaha ]
Christian Zielinski 17
Concurrent Computations on Multicore Processors
Multiprocessing API
Sharing data
Sharing data
Christian Zielinski 18
Concurrent Computations on Multicore Processors
Multiprocessing API
Sharing data
Shared memory maps
To share state with a Process use shared memory maps:
1 from m u lti proc essi ng import Process , Array
2
3 def cube_arr ( arr ):
4 for i in range ( len ( arr )):
5 arr [ i] = arr [i ]**3
6
7 if __ n ame__ == __main__ :
8 shar ed_arr = Array ( i , range (10)) # d for double
9 p = P r ocess ( target = cube_arr , args =( shared_arr ,))
10 p. start ()
11 # Do other work here
12 p. join ()
13 print ( share d_arr [:])
14 # Output : [0 , 1, 8, 27 , 64, 125 , 216 , 343 , 512 , 729]
Christian Zielinski 19
Concurrent Computations on Multicore Processors
Multiprocessing API
Sharing data
Server processes and managers
Alternatively use a Manager for more complex objects:
1 from m u lti proc essi ng import Process , Manager
2
3 def modi fy_n a mes ( names ):
4 names [ Alice ] = 42
5 if Bob in names :
6 sha r ed_d i ct [ Bob ] += 1
7
8 if __ n ame__ == __main__ :
9 with Manage r () as m:
10 sha r ed_d i ct = m. dict ()
11 sha r ed_d i ct [ Bob ] = 7
12 p = P r ocess ( target = m o d i f y _ n a m e s , args =( shared_dict ,))
13 p. start ()
14 # Do other work here
15 p. join ()
16 print ( shar ed_di c t ) # Output : { Bob : 8 , Alice : 42}
Christian Zielinski 20
Concurrent Computations on Multicore Processors
Multiprocessing API
Interprocess communication
Interprocess communication
Christian Zielinski 21
Concurrent Computations on Multicore Processors
Multiprocessing API
Interprocess communication
Pipes
To communicate between two processes we use a Pipe:
1 from m u lti proc essi ng import Process , Pipe
2
3 def cube ( my_pipe ):
4 data = my_pipe . recv ()
5 my_pipe . send ([n **3 for n in data ])
6
7 if __ n ame__ == __main__ :
8 parent_pipe , c hild_p ipe = Pipe ()
9 p = P r ocess ( target = cube , args =( child_pipe ,))
10 p. start ()
11 par e nt_p i pe . send ( range (10)) # Send work
12 # Do other work here
13 print ( pare nt_pi p e . recv ())
14 p. join ()
15 # Output : [0 , 1, 8, 27 , 64, 125 , 216 , 343 , 512 , 729]
Christian Zielinski 22
Concurrent Computations on Multicore Processors
Multiprocessing API
Interprocess communication
Queues
For several processes we can use a Queue:
1 from m u lti proc essi ng import Process , Queue
2
3 def place _work ( q ):
4 q. put ( range (10))
5
6 def proc ess_ w ork ( q ):
7 data = q. get ()
8 q. put ([ n **3 for n in data ])
9
10 if __ n ame__ == __main__ :
11 queue = Queue ()
12 p1 = Process ( t a r g e t = place_work , args =( queue ,))
13 p2 = Process ( t a r g e t = process_work , args =( queue ,))
14 p1 . start (); p2 . start (); p1 . join (); p2 . join ()
15 print ( queue . get ())
16 # Output : [0 , 1, 8, 27 , 64, 125 , 216 , 343 , 512 , 729]
Christian Zielinski 23
Concurrent Computations on Multicore Processors
Multiprocessing API
Guidelines
Guidelines (I)
Most important: Only parallelize where necessary!
Try to avoid sharing state if possible
If need to share:
Use Value and Array for simple data
Use server process Manager for more complex objects
For interprocess communication use Pipe and Queue
Pipes for fast point-to-point communication
Queues are multi-producer, multi-consumer FIFO queues
Christian Zielinski 24
Concurrent Computations on Multicore Processors
Multiprocessing API
Guidelines
Guidelines (II)
Explicitly pass resources to subprocesses and avoid global variables as
they can lead to inconsistencies (Windows)
Ensure that the main module can be safely imported by a new Python
instance (Windows)
Use if __name__ == ’__main__’ guard to avoid side effects
Avoid Process.terminate as it might break access to shared
resources
Read more on:
https://docs.python.org/2/library/multiprocessing.html#programming-guidelines
Christian Zielinski 25
Concurrent Computations on Multicore Processors
Multiprocessing API
Demo
Time for a short demo!
Midpoint rule for numerical integration
b
Z
a
f (x) dx h
N
X
k=1
f
a +
k
1
2
h
, h =
b a
N
Allows for straightforward parallel evaluation
b
Z
a
f (x) dx =
c
Z
a
f (x) dx +
b
Z
c
f (x) dx, c R
Can split up large integration range into several small ones, which can
be evaluated in parallel
Christian Zielinski 26
Concurrent Computations on Multicore Processors
Multiprocessing API
Questions & Answers
Questions & Answers
(if time permits)
Slides on:
www.github.com/czielinski/pyconsg2015
Christian Zielinski 27