Download - SUMA de Vectores: Hands-on - fisica.cab.cnea.gov.ar · Preguntas recolectadas en la clase 1 (1) ¿ Que pasa si los vectores a sumar son “muy” grandes ? (2) ¿ Como saber en que

SUMA de Vectores: Hands-on

Clase 2http://fsica.cab.cnea.gov.ar/gpgpu/index.php/en/icnpg/clases

F. D. Colavecchia (in-absentia A. B. Kolton)/a

Preguntas recolectadas en la clase 1

(1) ¿ Que pasa si los vectores a sumar son “muy” grandes ?

(2) ¿ Como saber en que placa corrió mi job ? (LISTO)

(3) ¿ Que argumentos puede recibir y que vale hacer dentro de un kernel ?

(4) ¿ Como medir el tiempo empleado para transferencias CPU ↔ GPU ?

¿ Que pasa si los vectores a sumar son muy grandes ?

● Problema 1● Problema 2

Problema 1Device 0: "GeForce GT 620M" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073479680 bytes) ( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA CoresETC...

// SUMA-Vectores#define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaMalloc((void**)&d_A, sizeof(float) * N); cudaMalloc((void**)&d_B, sizeof(float) * N);

...}

¿ CUAL ES EL PROBLEMA ?

Problema 1Device 0: "GeForce GT 620M" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073479680 bytes) ( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA CoresETC..

// SUMA-Vectores#define N 200000000int main(){


...}

● Pista: 1 foat = 4 bytes … ¿ CUAL ES EL PROBLEMA ?

CPU → MemTotal: 3932884 kB.

Problema 1

SUMA-Vectoresmain.cu

#include <stdio.h>#include <stdlib.h>#include <sys/time.h>#include <cuda.h>#include "vector_io.h"#include "vector_ops.h"

#ifndef N#define N 1000000000#endif

#ifndef VECES#define VECES 10#endif

● Experimentar con N● HANDLE_ERROR()● Device Properties● /proc/meminfo

Problema 1 → manejo de errores #define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaError_t error; error=cudaMalloc((void**)&d_A, sizeof(float) * N); if (error != cudaSuccess) { printf("cudaMalloc d_A error %d, linea(%d)\n", error, __LINE__); exit(EXIT_FAILURE); } error=cudaMalloc((void**)&d_B, sizeof(float) * N);

...}

CUDA Runtime APICONSULTAR:

Problema 1 → manejo de errores...#define N 200000000int main(){


checkCUDAError("alocando d_A y d_B"); ...

}void checkCUDAError(const char *msg){ cudaError_t err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error : %s: %s.\n", msg,

cudaGetErrorString( err) ); exit(EXIT_FAILURE); }}

CUDA Runtime API

Problema 1 → manejo de errores...#include "curso.h" #define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaError_t error ; HANDLE_ERROR(cudaMalloc((void**)&d_A, sizeof(float) * N)); HANDLE_ERROR(cudaMalloc((void**)&d_B, sizeof(float) * N));

... HANDLE_ERROR(

cudaMemcpy(d_A,h_A,sizeof(foat)*N, cudaMemcpyHostToDevice) );

HANDLE_ERROR(cudaMemcpy(d_B, h_B, sizeof(foat) * N, cudaMemcpyHostToDevice)

); ...

} CUDA Runtime API

HANDLE_ERROR (cuda by example) → MACRO: se reemplaza por un fragmento de código (preprocessor)http://gcc.gnu.org/onlinedocs/cpp/index.html#Top

Problema 1 → manejo de errores...#include <helper_cuda.h>#define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; checkCudaErrors(cudaMalloc((void**)&d_A, sizeof(float) * N)); checkCudaErrors(cudaMalloc((void**)&d_B, sizeof(float) * N));

... checkCudaErrors(

cudaMemcpy(d_A,h_A,sizeof(foat)*N, cudaMemcpyHostToDevice) );

checkCudaErrors(cudaMemcpy(d_B, h_B, sizeof(foat) * N, cudaMemcpyHostToDevice)

); ...

}

CUDA Runtime API

Problema 2

#defne dim 40000000/* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); //dim3 nBlocks((dim / nThreads.x) + (dim % nThreads.x ? 1 : 0)); //alternativa dim3 nBlocks((dim+nThreads.x-1)/nThreads.x);

kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); …}

Device 0: "GeForce GT 620M" Total amount of global memory: 1024 MBytes (1073479680 bytes) Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

¿ CUAL ES EL PROBLEMA ?

Problema 2

#defne dim 40000000.../* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); //dim3 nBlocks((dim / nThreads.x) + (dim % nThreads.x ? 1 : 0)); //alternativa dim3 nBlocks((dim+nThreads.x-1)/nThreads.x);

kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); …}

Device 0: "GeForce GT 620M" Total amount of global memory: 1024 MBytes (1073479680 bytes) Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

¿ CUAL ES EL PROBLEMA ?Pista: 1 foat=4 bytesPista: nBlocks=¿?

Ocurriría antes que el problema 1 !!

Problema 2 → gridDim

#defne dim 40000000.../* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); dim3 nBlocks(512); kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); checkCUDAError("invocación de kernel_suma"); …}

/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

while(id < dim) { v1[id] = v1[id] + v2[id];

id+= blockDim.x * gridDim.x; }}

Problema 2 → gridDim

Thread 0 calcula: v1[0]=v1[0]+v2[0]; v1[gridDim.x]=v1[gridDim.x]+v2[gridDim.x]; (si gridDim.x < dim)v1[2*gridDim.x]=v1[2*gridDim.x]+v2[2*gridDim.x]; (si 2*gridDim.x < dim)

Thread id calcula: v1[id]=v1[id]+v2[id]; v1[id+gridDim.x]=v1[id+gridDim.x]+v2[id+gridDim.x]; (si id+gridDim.x < dim)v1[id+2*gridDim.x]=v1[id+2*gridDim.x]+v2[id+2*gridDim.x]; (si id+2*gridDim.x < dim)



id+= blockDim.x * gridDim.x; }}

Problema 2 → gridDim/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);


id += blockDim.x * gridDim.x; }}

gridDim.x*blockDim.x

dim

DATOS

GRID = BLOQUES DE THREADS

Serializa la tarea de cada thread...

¿Como saber en que placa corrió mi job?

int main(){

cudaDeviceProp deviceProp;int dev; cudaGetDevice(&dev);cudaGetDeviceProperties(&deviceProp, dev);printf("\nDevice %d: \"%s\"\n", dev, deviceProp.name);

....}

CUDA Runtime API

¿ Que tipo de argumentos puede recibir un kernel ?


if (id < dim) { v1[id] = v1[id] + v2[id]; }}

...kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim);...

Punteros a memoria alocada de device (GPU) Variable del host

Se copia al device constant memory

Dereferencia: Seria incorrecto hacerlo en una Funcion del host


/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < dim) { v1[id] = v1[id] + v2[id]; }}

...kernel_suma<<<nBlocks, nThreads>>>(v1, v2);...

#define dim 10000000...

¿?

MACRO: se reemplaza por un fragmento de código (preprocessor)http://gcc.gnu.org/onlinedocs/cpp/index.html#Top


/* suma dim vectores en el plano ... */__global__ void kernel_suma(punto *w1, punto *w2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < dim) { v1[id].a = v1[id].a + v2[id].a; v1[id].b = v1[id].b + v2[id].b; }}

struct punto{

float a,b;};...

punto *w1, *w2;cudaMalloc((void**)&w1, sizeof(punto) * N); cudaMalloc((void**)&w2, sizeof(punto) * N);...

Limite para el tamaño de los argumentos es 4KB


/* suma dim vectores en el plano ... */__global__ void kernel_suma(punto *w1, punto *w2, Parametros par){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < par.dim) { v1[id].a = v1[id].a + v2[id].a; v1[id].b = v1[id].b + v2[id].b; }}

struct Parametros{

int dim;float numero;

};...

Parametros params;params.dim = N; params.numero=83.2;kernel_suma<<<nBlocks, nThreads>>>(v1, v2, params);

Limite para el tamaño de los argumentos es 4KB

¿ Que “vale” hacer dentro de un kernel ?

CUDA-C PROGRAMMING GUIDE

Los fuentes compilados con nvcc pueden incluir una mezcla de código de HOST y de DEVICE.

● HOST: soporta todo el C++ standard.● DEVICE: soporta parte (ver E.1. Code Samples) con algunas restricciones

(E.2. Restrictions).

cat /usr/local/cuda-5.5/samples/*/*/*.cu | grep -A 5 "__global__" | less

cat /usr/local/cuda-5.5/samples/*/*/*.h | grep -A 5 "__global__" | less

CHUSMEAR CUDA SAMPLES

http://stackoverfow.com/questions/8302506/parameters-to-cuda-kernels

¿ Que “vale” hacer dentro de un kernel?

Si lo dice Mark Harris es palabra santa y claridad absoluta

http://stackoverfow.com/questions/9309195/copying-a-struct-containing-pointers-to-cuda-device/9323898#9323898

Consultar Foros

http://stackoverflow.com/questions/8302506/parameters-to-cuda-kernels

http://stackoverflow.com/questions/9309195/copying-a-struct-containing-pointers-to-cuda-device/9323898#9323898

http://stackoverflow.com/questions/9309195/copying-a-struct-containing-pointers-to-cuda-device/9323898#9323898

¿Tiempo empleado para transferir de CPU a GPU?

#! /bin/bash##$ -cwd#$ -j y#$ -S /bin/bash## pido la cola gpu.q#$ -q gpu.q## pido una placa#$ -l gpu=1##ejecuto el binario

/usr/local/cuda-5.5/bin/nvprof ./main

http://docs.nvidia.com/cuda/profler-users-guide/index.htmlnvprof

Experimentar con SUMA-Vectores:● Cambiar NVECES● Cambiar N● Hacer mas intensivo el calculo

http://docs.nvidia.com/cuda/profiler-users-guide/index.html