并行编程模型
在计算机科学中,并行编程模型(Parallel programming model)是并行计算机架构的抽象化,通过它可方便的表达算法和它们在程序中的合成。判别编程模型的价值可以通过它的通用性:在各种不同架构上能表达多大范围的不同问题,和它的性能:编译的程序在执行时有多高的效率[1]。并行编程模型的实现形式可以是从“顺序编程”语言中调用的函式库,作为现存语言的扩展,或作为全新的语言。
围绕特定编程模型的共识是很重要的,这可导致建造不同的并行计算机来支持这个模型,从而促进软件的可移植性。在这个意义上,编程模型被称为在硬件和软件之间的桥接[2]。
并行编程模型的分类
并行编程模型的分类可以宽泛的在两个领域主题下进行:进程交互和问题分解[3][4][5]。
进程交互
进程交互有关于并行进程能够籍此相互通信的机制。最常见的交互形式是共享内存和消息传递。
共享内存
共享内存是在进程间传递数据的高效方式。在共享内存模型中,并行进程共享它们可以异步读与写的全局地址空间。异步并发访问可能导致竞争条件,和用来避免它们的机制如:锁、信号量和监视器。常规的多核处理器直接支持共享内存,很多并行编程语言和库在设计上利用了它,比如采用分叉会合模型的:Cilk、OpenMP和线程建造块[6]。
消息传递
在消息传递模型中,并行进程通过消息传递相互交换数据。这种通信可以是异步的,就是说消息可以在接收者准备好之前发出;或是同步的,就是说消息发出前接收者必须准备好。通信顺序进程(CSP)形式化了使用同步通信通道来连接进程的消息传递,并引出了重要的语言如:Occam、Limbo和Go。与之相对,演员模型使用异步消息传递,并被采用于如下语言的设计中:D、Scala和SALSA[7]。
分布式内存
分布式内存指称一类多处理器计算机系统,其中每个处理器都有自己私有的内存,计算任务只能在本地数据上运算,如果需要远程数据,计算任务必须与一个或多个远程处理器通信。在分布式内存系统编程中的关键要点是如何把数据分布到各个内存上;依赖于所解决的问题,数据可以静态分布,也可以在节点间移动;数据可以在需要时移动,也可以事先推入新的节点。
MPI规定了用于分布式内存系统的通信协议,支持点到点通信和集体通信(collective communication)二者。MPI还是消息传递API,带有对其特征在任何实现中必须如何表现的协议和语义规定[8]。MPI的目标是高性能、可伸缩性和可移植性,目前仍是高性能计算领域中统治性的模型[9]。此外还有支持单边通信(one-sided communication)的分区全局地址空间模型。
数据并行
数据并行模型关注进行运算所在的数据集,典型的是正规结构的数组。一组任务将在这些数据上运算,但是单独的处于在不相交的分区中。数据并行通常对应SPMD编程模型[12],相应执行模型对应费林分类法中的SIMD(例如AVX扩展)或MIMD(例如Xeon Phi)[13],还有GPGPU采用的SIMT[14] (例如NVIDIA Tesla)。
术语
并行计算模型是计算模型的一大范畴,包括:细胞自动机、PRAM机、LogP机、佩特里网、进程网和交互网等。计算模型是用来分析计算进程代价的一种抽象化,利用它分析并行算法性能,可以不取决于特定实现和技术所特有的各种变化,并行算法一般而言是针对特定计算模型而编写的,为PRAM机编写的伪码通常会采用某种For循环形式的并发编程构造[17]。
编程模型指称一种编程样式,即通过看起来像库调用的方式引发执行。例子包括POSIX的Pthreads库和Apache Hadoop中的MapReduce。在这二者情况下,执行模型都符合这个库所用语言的语法却不能按照其语义来理解。不同于计算模型,编程模型特别暗含着对硬件或软件实现的实际考虑[18]。
在并行计算中,执行模型经常必须暴露硬件特征来达成高性能。并行硬件有大量的变种导致了同时需要类似数量的并行执行模型。对每个执行模型都建立一门新语言是不实际的,因此常见的实践都是通过某个API来引发并行执行模型的行为。并行编程语言可以基于一种或一个组合的编程模型。例如,高性能Fortran基于共享内存交互和数据并行问题分解,而Go提供共享内存交互和消息传递交互。
并行编程模型
这里列出的编程模型是可称为桥接模型的计算机的抽象模型[2],它提供了在一个机器的物理实现和编程者可获得的这个机器的抽象概念之间的桥梁;换句话说,它意图在硬件和软件工程师之间提供共同的理解层面。成功的编程模型可以在现实中有效的实现并被编程者有效的作为目标;特别是应当有可能用典型的高级语言编译器生成良好的代码。从编程者的角度来看,这种桥接并行编程模型一般典型的位于Pthreads、IPC、MPI等之上,而在OpenMP、OpenACC等之下。
名称 | 交互类别 | 分解类别 | 实现及用例 |
---|---|---|---|
分叉会合模型 | 共享内存 | 数据 | Cilk,OpenMP,线程建造块[6] |
整体同步并行模型 | 共享内存 | 数据[19] | BSPlib[20]:MulticoreBSP[21]、BSPonMPI[22],Apache Giraph[23],Apache Hama[24] |
分区全局地址空间模型 | 分布式内存 | 数据 | SHMEM[25],Coarray Fortran,Chapel,X10 |
通信顺序进程模式 | 同步消息传递 | 任务 | Occam,Ada,Go |
演员模型 | 异步消息传递 | 任务 | Erlang[26],D,Scala,很多的框架和库实现 |
异构计算框架 | 共享内存 | 数据 | OpenCL,CUDA[27],SYCL,Pocl[28] |
串流处理/数据流程范型 | 管道 | 数据[1] | Brook[29],Apache Flink,RaftLib,TensorFlow,Lustre |
高级综合电子设计自动化 | 通道 | 任务 | C转HDL:Handel-C、SystemC |
OpenCL将计算系统视为组成自一组“计算设备”,它们可以是CPU或是附加的“加速器”比如GPU。它定义了一种类C语言用来写程序。在OpenCL设备上执行的函数叫做“内核”[30]:17。一个单一的计算设备典型的组成自一些“计算单元”,它们依次又包含很多“处理元素”(PE)。一个单一的内核执行可以在所有或多个PE上并行运行。OpenCL定义了API,允许运行于主机上的程序,启动在计算设备上的内核,并管理设备内存,它至少在概念上分离于主机内存。用OpenCL语言写的程序预期会被即时编译,所以使用OpenCL的应用程序在针对各种设备的实现之间是可移植的[31]。
FPGA可以被用来解决任何可计算的问题,这通过用FPGA能实现软微处理器的事实就可轻易的证明。它们的好处在于对某些应用它们明显的要更快速,因为它们有着并行本质和在对用在特定处理上的逻辑门的数目方面的优化[32]。近年来开始兴起使用OpenCL编程来利用FPGA提供的性能和能耗效率。OpenCL允许编程者用C语言编码并把FPGA组合函数作为使用OpenCL构造的OpenCL计算内核的目标[33]。
MapReduce是通过并行、分布式算法在集群上处理和生成键/值对形式的大数据集的编程模型和有关实现[34] ,Apache Hadoop中将它与HDFS分立实现[35]。MapReduce受到了在函数式编程范型中常用的map和reduce函数的启发[36],但是它们在MapReduce框架中的用途不同于它们在起初形式中那样[37]。
并行编程模型还有很多,比如:马里兰大学学院市分校依据PRAM计算模型,建立了指令级并行显式多线程的多处理器计算机和编程语言XMTC[38],实现了Spawn-Join范型[39]。
引用
- Skillicorn, David B., "Models for practical parallel computation", International Journal of Parallel Programming, 20.2 133–158 (1991), https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf (页面存档备份,存于)
- Leslie G. Valiant. (PDF). Communications of the ACM. August, 1990, 33 (8): 103–111 [2019-12-05]. (原始内容 (PDF)存档于2019-08-11).
- John E. Savage, Models of Computation: Exploring the Power of Computing, 2008, Chapter 7 (Parallel Computation), http://cs.brown.edu/~jes/book/ (页面存档备份,存于)
- Ian Foster, Designing and Building Parallel Programs, 1995, Section 1.3, "A Parallel Programming Model", http://www.mcs.anl.gov/~itf/dbpp/text/node9.html (页面存档备份,存于)
- Blaise Barney, Introduction to Parallel Computing, "Models", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Models (页面存档备份,存于)
- Michael McCool; James Reinders; Arch Robison. (PDF). Elsevier. 2013 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23).
Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.
- SALSA(页面存档备份,存于)
- Gropp, William; Lusk, Ewing; Skjellum, Anthony. . Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5.
- Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. . . ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565.
- Ian Foster, Designing and Building Parallel Programs, 1995, Section 2.2, "Partitioning", http://www.mcs.anl.gov/~itf/dbpp/text/node16.html (页面存档备份,存于)
- Blaise Barney, Introduction to Parallel Computing, "Partitioning", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning (页面存档备份,存于)
- Blaise Barney. . [2019-12-02]. (原始内容存档于2019-12-24).
SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
- Flynn, Michael J. (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始内容 (PDF)存档于2022-04-11).
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures. - Michael McCool; James Reinders; Arch Robison. . (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23).
There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01).
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. - Blaise Barney. . [2019-12-02]. (原始内容存档于2019-12-24).
MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).
- McBurney, D. L., and M. Ronan Sleep. "Transputer-based experiments with the ZAPP architecture." PARLE Parallel Architectures and Languages Europe. Springer Berlin Heidelberg, 1987.
Hammond, Kevin. Parallel functional programming: An introduction. In International Symposium on Parallel Symbolic Computation, p. 46. 1994. “The Alfalfa project implemented serial combinators for the Intel iPSC[44]. …… Buckwheat re-implemented this model for the Encore Multimax, a shared-memory multiprocessor. ……[45].” - Blelloch, Guy. (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04).
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
Siddhartha Chatterjee, Jan Prins. (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06).The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
Vishkin, Uzi, (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15),For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α. - Skillicorn, David B., and Domenico Talia, Models and languages for parallel computation, ACM Computing Surveys, 30.2 123–169 (1998), https://www.cs.utexas.edu/users/browne/CS392Cf2000/papers/ModelsOfParallelComputation-Skillicorn.pdf (页面存档备份,存于)
- Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. (原始内容存档 (PDF)于2019-11-04).
Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.
- BSPlib(页面存档备份,存于)
- MulticoreBSP(页面存档备份,存于)
- BSPonMPI(页面存档备份,存于)
- . [2019-12-09]. (原始内容存档于2019-12-19).
Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.
- . [2019-12-09]. (原始内容存档于2019-12-18).
Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.
- . [2019-12-09]. (原始内容存档于2019-12-09).
The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
- Armstrong, Joe. . Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. (原始内容存档于2020-06-09).
Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
- (PDF). 6/23/2007 [2019-12-09]. (原始内容存档 (PDF)于2018-04-17).
CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.
. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09).OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
. [2019-12-09]. (原始内容存档于2019-12-09).Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
. [2019-12-09]. (原始内容存档于2019-12-09).MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
. [2019-12-09]. (原始内容存档于2019-12-09).NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
- Pocl(页面存档备份,存于)
- . [2019-12-09]. (原始内容存档于2019-11-19).
Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.
- Howes, Lee. (PDF). Khronos OpenCL Working Group. November 11, 2015 [November 16, 2015]. (原始内容存档 (PDF)于2015-11-18).
- Stone, John E.; Gohara, David; Shi, Guochin. . Computing in Science & Engineering. 2010, 12 (3): 66–73. Bibcode:2010CSE....12c..66S. PMC 2964860 . PMID 21037981. doi:10.1109/MCSE.2010.69.
- . secdatabase.com. [May 6, 2018]. (原始内容存档于2018-05-07).
- . StreamComputing. 2014-09-16 [2019-12-06]. (原始内容存档于2017-01-01).
- (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15).
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
- . Apache Hadoop. [3 July 2019]. (原始内容存档于2019-12-14).
- (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15).
Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
- Lämmel, R. (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始内容存档 (PDF)于2019-12-07).
The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming. - Vishkin, Uzi, (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757,
The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
For i =0 to n−1 do
X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
…… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
For i =0 to n−1 pardo
X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
…… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
spawn (0 ,n−1) {
var x
x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
}
The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year. - Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, , : 140–151, 1998 [2019-12-12], (原始内容存档于2017-09-21),
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL) and relies on the ability of the hardware to suspend, and later reactive, threads.
进一步阅读
- Blaise Barney, , Lawrence Livermore National Laboratory
- Murray I. Cole., (PDF), University of Glasgow
- J. Darlinton; M. Ghanem; H. W. To, (PDF), In Programming Models for Massively Parallel Computers. IEEE Computer Society Press, 1993
- Ian Foster, , Argonne National Laboratory