Introducción a la computación de altas prestaciones
description
Transcript of Introducción a la computación de altas prestaciones
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introducción a la computación de altas prestaciones
Francisco Almeida y Francisco de SandeDepartamento de Estadística, I.O. y
ComputaciónUniversidad de La Laguna
La Laguna, 12 de febrero de 2004
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Questions
Why Parallel Computers? How Can the Quality of the Algorithms be Analyzed? How Should Parallel Computers Be Programmed? Why the Message Passing Programming Paradigm? Why de Shared Memory Programming Paradigm?
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OUTLINE Introduction to Parallel Computing Performance Metrics Models of Parallel Computers The MPI Message Passing Library Examples The OpenMP Shared Memory Library Examples
Improvements in black hole detection using parallelism
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why Parallel Computers ?
Applications Demanding more Computational Power:
Artificial Intelligence Weather Prediction Biosphere Modeling Processing of Large Amounts of
Data (from sources such as satellites)
Combinatorial Optimization Image Processing Neural Network Speech Recognition Natural Language Understanding etc..
Cost
Performance 1960 s
1970 s
1980 s
1990 s
SUPERCOMPUTERS
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Top500
www.top500.org
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Performace Metrics
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Ts = Sequential Run Time: Time elapsed between the begining and the end of its execution on a sequential computer.
Tp = Parallel Run Time: Time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes the execution.
Speed-up: T*s / Tp ? p
T*s = Time of the best sequential algorithm to solve the problem.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear and Actual Speed-up
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Processors
Spee
d-up
Linear Speed-up
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear and Actual Speed-up
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Processors
Spee
d-up
Linear Speed-upActual Speed-up1
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Processors
Spee
d-up
Linear Speed-upActual Speed-up1Actual Speed-up2
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Processors
Spee
d-up
Linear Speed-upActual Speed-up1Actual Speed-up2
Optimal Number of Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Efficiency
In practice, ideal behavior of an speed-up equal to p is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm.
Efficiency: Measure of the fraction of time for which a processing element is usefully employed.
E = (Speed-up / p) x 100 %
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Efficiency
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100Processors
Effic
ienc
y (%
)
Optimum EfficiencyActual Efficiency-1Actual Efficiency-2
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Amdahl`s Law Amdahl`s law attempt to give a maximum bound for speed-up from the
nature of the algorithm chosen for the parallel implementation. Seq = Proportion of time the algorithm needs to be spent in purely
sequential parts. Par = Proportion of time that might be done in parallel Seq + Par = 1 (where 1 is for algebraic simplicity) Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p)
% Seq Par Maximum
Speed-up0,1 0,001 0,999 500,250,5 0,005 0,995 166,81
1 0,01 0,99 90,9910 0,1 0,9 9,91
p = 1000
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Example
A problem to be solved many times over several different inputs. Evaluate F(x,y,z)
x in {1 , ..., 20}; y in {1 , ..., 10}; z in {1 , ..., 3}; The total number of evaluations is 20*10*3 = 600. The cost to evaluate F in one point (x, y, z) is t. The total running time is t * 600. If t is equal to 3 hours.
The total running time for the 600 evaluations is 1800 hours 75 days
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Processors
Spee
d-up
Linear Speed-upActual Speed-up1Actual Speed-up2Super-Linear Speedup
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Sequential Model
RAM
Von Neumann
The RAM model express computations on von Neumann architectures.
The von Neumann architecture is universally accepted for sequential computations.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Parallel Model
PRAM
BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
Computational Models
Programming Models
Architectural Models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Address-Space Organization
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Digital AlphaServer 8400
Hardware
•C4-CEPBA•10 Alpha processors21164•2 Gb Memory•8,8 Gflop/s
•Shared Memory•BusTopology
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
SGI Origin 2000
Hardware
•C4-CEPBA•64 R1000 processos•8 Gb memory•32 Gflop/s
•Shared Dsitributed Memory•Hypercubic Topology
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (1/2)jen50.ciemat.es
160 processors MIPS R14000 / 600MHz On 40 nodes with 4 processors each Data and instruction cache on-chip Irix Operating System Hypercubic Network
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (2/2)
cc-Numa memory Architecture 1 Gflops Peak Speed 8 MB external cache 1 Gb main memory each proc. 1 Tb Hard Disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Beowulf Computers
• Distributed Memory
• COTS: Commercial-Off-The-Shelf computers
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Towards Grid Computing….
Source: www.globus.org & updated
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Parallel Model
PRAM
BSP, LogP
PVM, MPI, HPF, Threads, OpenMP
Parallel Architectures
Computational Models
Programming Models
Architectural Models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Drawbacks that arise when solving Problems using Parallelism
Parallel Programming is more complex than sequential.
Results may vary as a consequence of the intrinsic non determinism.
New problems. Deadlocks, starvation...
Is more difficult to debug parallel programs.
Parallel programs are less portable.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Message Passing Model
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Message Passing Model
Interconnection Network processor
processor
processor
processor
processor
processorprocessor
Send(parameters)
Recv(parameters)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI
MPIEUI
p4
pvm ExpressZipcode
CMMD
PARMACS
Parallel Libraries
Parallel Applications
Parallel Languages
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI
• What Is MPI?• Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995
• What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI hello.c#include <stdio.h>#include "mpi.h"main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); printf(“Hello from processor %d of %d\n",name, p); MPI_Finalize();}
A Simple MPI Program
$> mpicc -o hello hello.c$> mpirun -np 4 helloHello from processor 2 of 4Hello from processor 3 of 4Hello from processor 1 of 4Hello from processor 0 of 4
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI helloms.c#include <stdio.h>#include <string.h>#include "mpi.h"main(int argc, char*argv[]) {
int name, p, source, dest, tag = 0;char message[100];MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&name);MPI_Comm_size(MPI_COMM_WORLD,&p);
if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize();}
A Simple MPI Program
$> mpirun -np 4 hellomsProcessor 2 of 4Processor 3 of 4Processor 1 of 4processor 0, p = 4 greetings from process 1!greetings from process 2!greetings from process 3!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Linear Model to Predict Communication Performance
Time to send N bytes= n +
0,00001
0,0001
0,001
0,01
0,1
1
BEOULL
CRAYT3E
5E-08 n + 5E-05
7E-07 n + 0,0003
00,10,20,30,4
0,50,60,70,80,9
0 5 0000 1 00000 1 5 0000 2 00000 2 5 0000 3 00000 3 5 0000 4 00000 4 5 0000 5 00000 5 5 0000 6 00000 6 5 0000 7 00000 7 5 0000 8 00000 8 5 0000 9 00000 9 5 0000 1 000000 1 05 0000
BEOULL
CRAYT3E
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Performace Prediction: Fast, Gigabit Ethernet, Myrinet
y = 9E-08x + 0,0007
y = 5E-08x + 0,0003
y = 4E-09x + 2E-05
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
0 500000 1000000
FastGigaMyrinetLineal (Fast)Lineal (Giga)Lineal (Myrinet)
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
1 100 10000 1000000
FastGigaMyrinet
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Basic Communication Operations
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
One-to-all broadcast Single-node Accumulation
. . .0 p1 . . .0 p1
M One-to-all broadcast
Single-node AccumulationM MM
. . .
1
p
2
0Step 1
Step 2
Step p
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Broadcast on Hypercubes
76
54
32
10
76
54
32
10
First Step
76
54
32
10
76
54
32
10
Second Step
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Broadcast on Hypercubes
76
54
32
10
76
54
32
10
Third Step
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI Broadcast
int MPI_Bcast( void *buffer;
int count;
MPI_Datatype datatype;
int root;
MPI_Comm comm;
);
Broadcasts a message from the
process with rank "root" to
all other processes of the group
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reduction on Hypercubes
@ conmutative and associative operator
Ai in processor i
Every processor has to obtain A0@A1@...@AP-1
A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001
A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101
A7 @A6@A5@A4 101A6 @A7@A4@A5 110
A7 @A6@ A5@A4 101 A0@A1 000 A1@A0 001
A2 @A3 101 A3 @A2 101
A5@A4 101
A7 @A6 101A6 @A7 110A0 000 A1 001
A2 101A3 101
A5 101
A7 101A6 110
A0
A1
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reductions with MPI int MPI_Reduce(
void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;);
Reduces values on all processes to a single value processes
int MPI_Allreduce(
void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;);
Combines values form all processes and distributes the result back to all
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
All-To-All BroadcastMultinode Accumulation
. . .0 p1 . . .0 p1
M1All-to-all broadcast
Single-node AccumulationM0
Mp
M1
M2 Mp
M0 M0
M1 M1
MpMp
Reductions, Prefixsums
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI Collective OperationsMPI Operator Operation
---------------------------------------------------------------
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bitwise and
MPI_LOR logical or
MPI_BOR bitwise or
MPI_LXOR logical exclusive or
MPI_BXOR bitwise exclusive or
MPI_MAXLOC max value and location
MPI_MINLOC min value and location
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing : Sequential
0.0 0.2 0.4 0.6 0.8 1.0
2
4
=0
14
(1+x2)dx
double t, pi=0.0, w;long i, n = ...;double local, pi = 0.0;....
h = 1.0 / (double) n; for (i = 0; i < n; i++) { x = (i + 0.5) * h; pi += f(x); } pi *= h;
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing : Parallel
0.0 0.2 0.4 0.6 0.8 1.0
2
4
=0
14
(1+x2)dx
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double) n; mypi = 0.0; for (i = name; i < n; i += numprocs) { x = h * (i + 0.5) *h; mypi+= f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Master Slave Paradigm
Master
Slaves
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Condor University Wisconsin-Madison. www.cs.wisc.edu/condor
A problem to be solved many times over several different inputs.
The problem to be solved is computational expensive.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Condor Condor is a specialized workload management system for compute-
intensive jobs. Like other full-featured batch systems, Condor provides a job queueing
mechanism, scheduling policy, priority scheme, resource monitoring, and resource management.
Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.
Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations.
In many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle.
As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What is OpenMP?
Application Program Interface (API) for shared memory parallel programming What the application programmer inserts into code to make
it run in parallel Addresses only shared memory multiprocessors
Directive based approach with library support Concept of base language and extensions to base
language OpenMP is available for Fortran 90 / 77 and C / C++
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP is not...
Not Automatic parallelization User explicitly specifies parallel execution Compiler does not ignore user directives even if wrong
Not just loop level parallelism Functionality to enable coarse grained parallelism
Not a research project Only practical constructs that can be implemented with high
performance in commercial compilers Goal of parallel programming: application speedup
Simple/Minimal with Opportunities for Extensibility
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why OpenMP?
Parallel programming landscape before OpenMP Standard way to program distributed memory computers
(MPI and PVM) No standard API for shared memory programming
Several vendors had directive based APIs for shared memory programming Silicon Graphics, Cray Research, Kuck & Associates, Digital
Equipment …. All different, vendor proprietary
Sometimes similar but with different spellings Most were targeted at loop level parallelism
Limited functionality - mostly for parallelizing loops
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why OpenMP? (cont.)
Commercial users, high end software vendors have big investment in existing code Not very eager to rewrite their code in new languages
Performance concerns of new languages
End result: users who want portability forced to program shared memory machines using MPI Library based, good performance and scalability But sacrifice the built in shared memory advantages of the
hardware
Both require major investment in time and money Major effort: entire program needs to be rewritten New features needs to be curtailed during conversion
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP API Multi-platform shared-memory parallel programming
OpenMP is portable: supported by Compaq, HP, IBM, Intel, SGI, Sun and others on Unix and NT
Multi-Language: C, C++, F77, F90
Scalable
Loop Level parallel control
Coarse grained parallel control
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP API
Single source parallel/serial programming: OpenMP is not intrusive (to original serial code).
Instructions appear in comment statements for Fortran and through pragmas for C/C++
Incremental implementation: OpenMP programs can be implemented incrementally one
subroutine (function) or even one do (for) loop at a time
!$omp parallel do do i = 1, n ... enddo
#pragma omp parallel for for (i = 0; I < n; i++) { ... }
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Threads
Multithreading: Sharing a single CPU between multiple tasks (or "threads") in
a way designed to minimise the time required to switch threads
This is accomplished by sharing as much as possible of the program execution environment between the different threads so that very little state needs to be saved and restored when changing thread.
Threads share more of their environment with each other than do tasks under multitasking.
Threads may be distinguished only by the value of their program counters and stack pointers while sharing a single address space and set of global variables
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP Overview: How do threads interact?
OpenMP is a shared memory model Threads communicate by sharing variables
Unintended sharing of data causes race conditions: race condition: when the program’s outcome changes
as the threads are scheduled differently To control race conditions:
Use synchronizations to protect data conflicts
Synchronization is expensive so: Change how data is accessed to minimize the need for
synchronization
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP Parallel Computing Solution Stack
Directives OpenMP Library Environment Variables
Application
End User
Runtime Library
OS/system support for shared memory
Sys
t em
l aye
rU
s er l
ayer
Pro
g. L
ayer
(Ope
nMP
AP
I)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reasoning about programming
Programming is a process of successive refinement of a solution relative to a hierarchy of models
The models represent the problem at a different level of abstraction The top levels express the problem in the original problem
domain The lower levels represent the problem in the computer’s
domain
The models are informal, but detailed enough to support simulation
Source: J.-M. Hoc, T.R.G. Green, R. Samurcay and D.J. Gilmore (eds.), Psychology of Programming, Academic Press Ltd., 1990
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Layers of abstraction in Programming
Problem
Algorithm
Source Code
Computation
Hardware
Specification
Programming
Computational
Cost
Domain Model: Bridges between domains
OpenMPonly definesthese two!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP Computational Model
Shared Address Space
Proc 1
Proc 2
Proc 3
Proc N
. . .
OpenMP was created with a particular abstract machine or computational model in mind: Multiple processing elements A shared address space with “equal-time” access for each
processor Multiple light weight processes (threads) managed outside
of OpenMP (the OS or some other “third party”)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP programming model
MasterThread
Parallel Regions
A nestedParallel Region
fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally: i.e. the sequential
program evolves into a parallel program
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
So, How good is OpenMP?
A high quality programming environment supports transparent mapping between models
OpenMP does this quite well for the models it defines: Programming model:
Threads forked by OpenMP map onto threads in modern OSs.
Computational model:Multiprocessor systems with cache coherency
map onto OpenMP shared address space
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What about the cost model?
OpenMP doesn’t say much about the cost model programmers are left to their own devices
Real systems have memory hierarchies, OpenMP’s assumed machine model doesn’t: Caches mean some data is closer to some processors Scalable multiprocessor systems organize their RAM into
modules - another source of NUMA
OpenMP programmers must deal with these issues as they: Optimize performance on each platform Scale to run onto larger NUMA machines
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What about the specification model?
Programmers reason in terms of a specification model as they design parallel algorithms
Some parallel algorithms are natural in OpenMP: Specification models implied by loop-splitting and SPMD
algorithms map well onto OpenMP’s programming model
Some parallel algorithms are hard for OpenMP
Recursive problems and list processing is challenging for OpenMP’s models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Is OpenMP a “good” API?
Specification
Programming
Computational
Cost
Model: Bridges between domains
Good (8)
Good (9)
Poor (3)
Fair (5)
Overall score: OpenMP is “OK” (6), but it sure could be better!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP today
Hardware Vendors
Compaq/Digital (DEC)
Hewlett-Packard (HP)
IBM
Intel
Silicon Graphics
Sun Microsystems
3rd Party Software Vendors
Absoft
Edinburgh Portable Compilers (EPC)
Kuck & Associates (KAI)
Myrias
Numerical Algorithms Group (NAG)
Portland Group (PGI)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP components Directives
Environment Variables
Shared / Private Variables
Runtime Library
OS ‘Threads’
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP directives
Directives are special comments in the language Fortran fixed form: !$OMP, C$OMP, *$OMP Fortran free form: !$OMP
Special comments are interpreted by OpenMP compilers
w = 1.0/n sum = 0.0!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum) do I=1,n x = w*(I-0.5) sum = sum + f(x) end do pi = w*sum print *,pi end
Commentin Fortran
but interpreted by OpenMP
compilers
Don’t worry about details now!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP directives
Look like a comment (sentinel / pragma syntax) F77/F90 !$OMP directive_name [clauses] C/C++ #pragma omp pragmas_name [clauses]
Declare start and end of multithread execution Control work distribution Control how data is brought into and taken out of
parallel sections Control how data is written/read inside sections
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP environment variables
OMP_NUM_THREADS - number of threads to run in a parallel section
MPSTKZ – size of stack to provide for each thread OMP_SCHEDULE - Control state of scheduled
executions. setenv OMP_SCHEDULE "STATIC, 5“ setenv OMP_SCHEDULE "GUIDED, 8“ setenv OMP_SCHEDULE "DYNAMIC"
OMP_DYNAMIC OMP_NESTED
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Shared / Private variables
Shared variables can be accessed by all of the threads.
Private variables are local to each thread In a ‘typical’ parallel loop, the loop index is private,
while the data being indexed is shared
!$omp parallel !$omp parallel do !$omp& shared(X,Y,Z), private(I) do I=1, 100 Z(I) = X(I) + Y(I) end do !$omp end parallel
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP Runtime routines
Writing a parallel section of code is matter of asking two questions How many threads are working in this section? Which thread am I?
Other things you may wish to know How many processors are there? Am I in a parallel section? How do I control the number of threads?
Control the execution by setting directive state
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Os ‘threads’
In the case of Linux, it needs to be installed with an SMP kernel.
Not a good idea to assign more threads than CPUs available:
omp_set_num_threads(omp_get_num_procs())
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
N
i NiNdx
x 02
1
02 5.01
41
4
A simple example: computing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
double t, pi=0.0, w;long i, n = 100000000;double local, pi = 0.0, w = 1.0 / n;...
for(i = 0; i < n; i++) { t = (i + 0.5) * w; pi += 4.0/(1.0 + t*t);}pi *= w;...
Computing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
double t, pi=0.0, w;long i, n = 100000000;double local, pi = 0.0, w = 1.0 / n;...
#pragma omp parallel for reduction(+:pi) private(i,t)for(i = 0; i < n; i++) { t = (i + 0.5) * w; pi += 4.0/(1.0 + t*t);}pi *= w;...
Computing #pragma omp directives in C
Ignored by non-OpenMP compilers
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing on a SunFire 6800
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Compiling OpenMP programs OpenMP directives are ignored by default
Example: SGI Irix platforms
f90 -O3 foo.fcc -O3 foo.c
OpenMP directives are enabled with “-mp” Example: SGI Irix platforms
f90 -O3 -mp foo.fcc -O3 -mp foo.c
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m)
c$omp parallel default(none)c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddoc$omp end parallel
end
OpenMP directives used:c$omp parallel [clauses]c$omp end parallel
Parallel clauses include:default(none|private|shared)private(...)shared(...)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m)
c$omp parallel default(none)c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddoc$omp end parallel
end
Each arrow denotes one thread
All threads perform identical task
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m)
c$omp parallel default(none)c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddoc$omp end parallel
end
With default scheduling, Thread a works on j = 1:5Thread b on j = 6:10Thread c on j = 11:15Thread d on j = 16:20
a b c d
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Foundations of OpenMP:
OpenMP: a parallel programming API
Layers of abstractions or models used to understand and use OpenMP
Parallelism Working withconcurrency
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Summary of OpenMP Basics Parallel Region (to create threads)
C$omp parallel #pragma omp parallel Worksharing (to split-up work between threads)
C$omp do #pragma omp for C$omp sections #pragma omp sections C$omp single #pragma omp single C$omp workshare
Data Environment (to manage data sharing) # directive: threadprivate # clauses: shared, private, lastprivate, reduction, copyin, copyprivate
Synchronization directives: critical, barrier, atomic, flush, ordered, master
Runtime functions/environment variables
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Improvements in black hole detection using parallelism
Francisco Almeida1, Evencio Mediavilla2, Álex Oscoz2 and Francisco de Sande1
1 Departamento de Estadística,
I.O. y ComputaciónUniversidad de La Laguna Tenerife, Canary Islands, SPAIN
Dresden, September 3 2003
2 Instituto de Astrofísica de Canarias
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (1/3)
Very frequently there is a divorce between computer scientists and researchers in other scientific disciplines
This work collects the experiences of a collaboration between researchers coming from two different fields: astrophysics and parallel computing
We present different approaches to the parallelization of a scientific code that solves an important problem in astrophysics:
the detection of supermassive black holes
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (2/3)
The IAC co-authors initially developed a Fortran77 code solving the problem
The execution time for this original code was not acceptable and that motivated them to contact with researchers with expertise in the parallel computing field
We know in advance that these scientist programmers deal with intense time-consuming sequential codes that are not difficult to tackle using HPC techniques
Researchers with a purely scientific background are interested in these techniques, but they are not willing to spend time learning about them
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (3/3)
One of our constraints was to introduce the minimum amount of changes in the original code
Even with the knowledge that some optimizations could be done in the sequential code
To preserve the use of the NAG functions was also another restriction in our development
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Outline
The Problem Black holes and quasars The method: gravitational lensing Fluctuations in quasars light curves Mathematical formulation of the problem
Sequential code Parallelizations: MPI, OpenMP, Mixed MPI/OpenMP Computational results Conclusions
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Black holes
Supermassive black holes (SMBH) are supposed to exist in the nucleus of many if not all the galaxies
Some of these objects are surrounded by a disk of material continuously spiraling towards the deep gravitational potential pit of the SMBH and releasing huge quantities of energy giving rise to the phenomena known as quasars (QSO)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Quasars (Quasi Stellar Objects, QSO) QSOs are currently believed to be the most luminous
and distant objects in the universe
QSOs are the cores of massive galaxies with super giant black holes that devour stars at a rapid rate, enough to produce the amount of energy observed by a telescope
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The method
We are interested in objects of dimensions comparable to the Solar System in galaxies very far away from the Milky Way
Objects of this size can not be directly imaged and alternative observational methods are used to study their structure
The method we use is the observation of QSO images affected by a microlensing event to study the structure of the accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Gravitational Microlensing
Gravitational lensing (the attraction of light by matter) was predicted by General Relativity and observationally confirmed in 1919
If light from a QSO pass through a galaxy located between the QSO and the observer it is possible that a star in the intervening galaxy crosses the QSO light beam
The gravitational field of the star amplifies the light emission coming from the accretion disk (gravitational microlensing)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing Quasar-Star
OBSERVER
QUASAR
MACROIMAGESMICROIMAGES
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing
The phenomenon is more complex because the magnification is not due to a single isolated microlens, but it rather is a collective effect of many stars
As the stars are moving with respect to the QSO light beam, the amplification varies during the crossing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Double Microlens
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Multiple Microlens Magnification pattern in the source plane, produced by a dense field
of stars in the lensing galaxy. The color reflects the magnification as a function of the quasar
position: the sequence blue-green-red-yellow indicates increasing magnification
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305 (1/2)
So far the best example of a microlensed quasar is the quadruple quasar Q2237+0305
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305 (2/2)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fluctuations in Q2237+0305 light curves
Lightcurves of the four images of Q2237+0305 over a period of almost ten years
The changes in relative brightness are very obvious
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305
In Q2237+0305, and thanks to the unusually small distance between the observer and the lensing galaxy, the microlensing events have a timescale of the order of months
We have observed Q2237+0305 from 1999 October during approximately 4 months at the Roque de los Muchachos Observatory
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fluctuations in light curves The curve representing the change in luminosity of
the QSO with time depends on the position of the star and also on the structure of the accretion disk
Microlens-induced fluctuations in the observed brightness of the quasar contain information about the light-emitting source (size of continuum region or broad line region of the quasar, its brightness profile, etc.)
Hence from a comparison between observed and simulated quasar microlensing we can draw conclusions about the accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Our goal is to model light curves of QSO images affected by a microlensing event to study the unresolved structure of the accretion disk
Q2237+0305 light curves (2/2)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Mathematical formulation (1/5) Leaving aside the physical meaning of the different
variables, the function modeling the dependence of the observed flux with time t, can be written as:
Where is the ratio between the outer and inner radii of the accretion disk (we will adopt ).
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
To speed up the computation, G has been approximated by using MATHEMATICA
And G is the function:
Mathematical formulation (2/5)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
the values of the parameters
Our goal is to estimate in the observed flux :
Mathematical formulation (3/5)
by fitting to the observational data
, B, C, , t0
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
N is the number of data points corresponding
to times ti (i =1, 2, ..., N)
Specifically, to find the values of the 5 parameters that minimize the error between the theoretical model and the observational data according to a chi-square criterion:
Mathematical formulation (4/5)
is the theoretical function evaluated at time ti
Where:
is the observational error associated to each data value
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
To minimize we use the e04ccf NAG routine, that only requires evaluation of the function and not of the derivatives
Mathematical formulation (5/5)
The determination of the minimum in the 5-parameters space depends on the initial conditions, so
we consider a 5-dimensional grid of starting points and m sampling intervals in each variable for each one of the points of the this grid we compute
a local minimum Finally, we select the absolute minimum among them
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The sequential code
program seq_black_hole double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit
c Data inputc Initialize best solution do k1=1, m do k2=1, m do k3=1, m do k4=1, m do k5=1, mc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo enddo enddo enddo end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The sequential code
program seq_black_hole double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit
c Data inputc Initialize best solution do k1=1, m do k2=1, m do k3=1, m do k4=1, m do k5=1, mc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo enddo enddo enddo end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Sequential Times
In a Sun Blade 100 Workstation running Solaris 5.0 and using the native Fortran77 Sun compiler (v. 5.0) with full optimizations this code takes
5.89 hours for sampling intervals of size m=4
12.45 hours for sampling intervals of size m=5
In a SGI Origin 3000 using the native MIPSpro F77 compiler (v. 7.4) with full optimizations
0.91 hours for m=4
2.74 hours for m=5
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Loop transformation
program seq_black_hole2 implicit none parameter(m = 4)
... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) integer k1, k2, k3, k4, k5 common/var/t, fit
c Data inputc Initialize best solution do k = 1, m^5c Index transformationc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo end seq_black_hole
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI & OpenMP (1/2) In the last years OpenMP and MPI have been
universally accepted as the standard tools to develop parallel applications
OpenMP is a standard for shared memory programming It uses a fork-join model and is mainly based on compiler directives that are added to the code
that indicate the compiler regions of code to be executed in parallel MPI
Uses an SPMD model Processes can read and write only to their respective local memory Data are copied across local memories using subroutine calls The MPI standard defines the set of functions and procedures
available to the programmer
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI & OpenMP (2/2) Each one of these two alternatives have both
advantages and disadvantages Very frequently it is not obvious which one should be
selected for a specific code: MPI programs run on both distributed and shared memory
architectures while OpenMP run only on shared memory The abstraction level is higher in OpenMP MPI is particularly adaptable to coarse grain parallelism.
OpenMP is suitable for both coarse and fine grain parallelism While it is easy to obtain a parallel version of a sequential code in
OpenMP, usually it requires a higher level of expertise in the case of MPI. A parallel code can be written in OpenMP just by inserting proper directives in a sequential code, preserving its semantics, while in MPI mayor changes are usually needed
Portability is higher in MPI
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI-OpenMP hybridization (1/2)
Hybrid codes match the architecture of SMP clusters
SMP clusters are an increasingly popular platforms
MPI may suffer from efficiency problems in shared memory architectures
MPI codes may need too much memory
Some vendors have attempted to exted MPI in shared memory, but the result is not as efficient as OpenMP
OpenMP is easy to use but it is limited to the shared memory architecture
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI-OpenMP hybridization (2/2)
An hybrid code may provide better scalability
Or simply enable a problem to exploit more processors
Not necessarily faster than pure MPI/OpenMP
It depends on the code, architecture, and how the programming models interact
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI code
program black_hole_mpi include 'mpif.h' implicit none parameter(m = 4) ... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )c Data inputc Initialize best solution do k = myid, m^5 - 1, numprocsc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo call MPI_FINALIZE( ierr ) end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI code
program black_hole_mpi include 'mpif.h' implicit none parameter(m = 4) ... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )c Data inputc Initialize best solution do k = myid, m^5 - 1, numprocsc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo call MPI_FINALIZE( ierr ) end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp implicit none parameter(m = 4)
... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$OMP THREADPRIVATE(/var/, /data/)
c Data inputc Initialize best solution!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5 - 1c Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$OMP END PARALLEL DO end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp implicit none parameter(m = 4)
... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$OMP THREADPRIVATE(/var/, /data/)
c Data inputc Initialize best solution!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5 - 1c Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$OMP END PARALLEL DO end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp implicit none parameter(m = 4)
... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$OMP THREADPRIVATE(/var/, /data/)
c Data inputc Initialize best solution!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5 - 1c Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$OMP END PARALLEL DO end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Hybrid MPI – OpenMP code program black_hole_mpi_omp double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit!$OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr)c Data inputc Initialize best solution!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/)LASTPRIVATE(fx) do k = myid, m^5 - 1, mpi_numprocsc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddoc Reduce the OpenMP best solution!$OMP END PARALLEL DOc Reduce the MPI best solution call MPI_FINALIZE(ierr) end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Hybrid MPI – OpenMP code program black_hole_mpi_omp double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit!$OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr)c Data inputc Initialize best solution!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/)LASTPRIVATE(fx) do k = myid, m^5 - 1, mpi_numprocsc Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddoc Reduce the OpenMP best solution!$OMP END PARALLEL DOc Reduce the MPI best solution call MPI_FINALIZE(ierr) end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (1/2)jen50.ciemat.es
160 processors MIPS R14000 / 600MHz On 40 nodes with 4 processors each Data and instruction cache on-chip Irix Operating System Hypercubic Network
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (2/2)
cc-Numa memory Architecture 1 Gflops Peak Speed 8 MB external cache 1 Gb main memory each proc. 1 Tb Hard Disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computational results (m=4)
As we do not have exclusive mode access to the architecture, the times correspond to the minimum time from 5 different executions
The figures corresponding to the mixed mode code (label MPI-OpenMP) correspond to the minimum times obtained for different combinations of MPI processes/OpenMP threads
Time (secs.) Speedup MPI OMP MPI-OMP MPI OMP MPI-OMP
2 1674.9 1670.5 1671.6 2.0 2.0 2.0 4 951.1 930.6 847.4 3.4 3.5 3.9 8 496.1 485.7 481.9 6.6 6.7 6.8 16 257.0 257.2 255.3 12.7 12.7 12.8 32 133.9 134.0 133.7 24.4 24.4 24.4
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
0
200
400
600
800
1000
1200
1400
1600
1800
0 5 10 15 20 25 30 35
Tim
e
Processors
Time (secs.)
MPIOpenMP
MPI-OpenMP
Parallel execution time (m=4)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speedup (m=4)
0
5
10
15
20
25
0 5 10 15 20 25 30 35
Spee
dup
Processors
Speedup
MPIOpenMP
MPI-OpenMP
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Results for mixed MPI-OpenMP
0200400600800
10001200140016001800
1 3 5
Processors
Tim
e (s
ec.)
1 Thread2 Threads4 Threads8 Threads16 Threads32 Threads
32 16 8 4 2
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computational results (m=5)
Time (secs.) Speedup MPI OMP MPI-OMP MPI OMP MPI-OMP
2 4957.0 5085.9 4973.1 2.0 1.9 2.0 4 2504.9 2614.5 2513.0 3.9 3.8 3.9 8 1261.2 1372.4 1265.1 7.8 7.2 7.8
16 640.3 755.8 642.9 15.4 13.1 15.4 32 338.1 400.1 339.2 29.2 24.7 29.1
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Parallel execution time (m=5)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
0 5 10 15 20 25 30 35
Tim
e
Processors
Time (secs.)
MPIOpenMP
MPI-OpenMP
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speedup (m=5)
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
Spee
dup
Processors
Speedup
MPIOpenMP
MPI-OpenMP
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Results for mixed MPI-OpenMP (m=5)
0200400600800
10001200140016001800
1 2 3 4 5
Processors
Tim
e (s
ecs.
) 1 Thread2 Threads4 Threads8 Threads16 Threads32 Threads
32 16 8 4 2
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (1/3)
The computational results obtained from all the parallel versions confirm the robustness of the method
For the case of non-expert users and the kind of codes we have been dealing with, we believe that MPI parallel versions are easier and safer
In the case of OpenMP, the proper usage of data scope attributes for the variables involved may be a handicap for users with non-parallel programming expertise
The higher current portability of the MPI version is another factor to be considered
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (2/3)
The mixed MPI/OpenMP parallel version is the most expertise-demanding
Nevertheless, as it has been stated by several authors and even for the case of a hybrid architecture, this version does not offer any clear advantage and it has the disadvantage that the combination of processes/threads has to be tuned
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (3/3)
We conclude a first step of cooperation in the way of applying HPC techniques to improve performance in astrophysics codes
The scientific aim of applying HPC to computationally-intensive codes in astrophysics has been successfully achieved
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions and Future Work
The relevance of our results do not come directly from the particular application chosen, but from stating that parallel computing techniques are the key to broach large scale real problems in the mentioned scientific field
From now on we plan to continue this fruitful collaboration by applying parallel computing techniques to some other astrophysical challenge problems
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Supercomputing Centers
http://www.ciemat.es/
http://www.cepba.upc.es/
http://www.epcc.ed.ac.uk/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI links
Message Passing Interface Forumhttp://www.mpi-forum.org/
MPI: The Complete Reference http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
Parallel Programming with MPI By Peter Pacheco. http://www.cs.usfca.edu/mpi/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP links
http://www.openmp.org/
http://www.compunity.org/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introducción a la computación de altas prestaciones
Francisco Almeida y Francisco de Sande(falmeida, fsande)@ull.esDepartamento de Estadística, I.O. y
ComputaciónUniversidad de La Laguna
La Laguna, 12 de febrero de 2004
¡Gracias!