CUDA-1

CUDA-1

准备记录一下学习cuda的过程,cuda的威力我之前见识过,感觉是个很有用的东西,所以抽点时间了解一下,然后慢慢地再经过实战进行练习。

Hello world!

  • cuda 需要写核函数,需要告诉编译器这个是在device上运行的,而不是在host上面。

下面是一个例子。

#include <iostream>

__global__ void kernel(void){
}

int main(void){
    kernel<<<1,1>>>();
    printf("Hello, world!\n");
    return 0;

}

这里面有两个东西

  • __global__ 这个就是会通知编译器这个函数要在device上在进行运行。

  • «<1,1»> 这个地方的参数到后面再说,因为这涉及到block, dim等的介绍。

下面是一个带参数的核函数的例子,对了核函数不能有返回值。

«««< HEAD

=======

#include "../common/book.h"

__global__ void add( int a, int b, int *c ) { 
    *c = a + b;
}

int main( void ) { 
    int c;
    int *dev_c;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );

    add<<<1,1>>>( 2, 7, dev_c );

    HANDLE_ERROR( cudaMemcpy( &c, dev_c, sizeof(int),
                              cudaMemcpyDeviceToHost ) );
    printf( "2 + 7 = %d\n", c );
    HANDLE_ERROR( cudaFree( dev_c ) );

    return 0;
}


其中 HANDLE_ERROR 是头文件 book.h中写的防止出错的一个函数。能够看出来这里面的过程大致分为下面的几步,

  1. 在device上计算
  2. 从device传结果到host。

其中cudaMalloc, cudaMemcpy, cudaMemcpyDeviceToHost,cudaFree涉及到这些东西占的大小的申请分配和释放。只不过不同的是,cudaMalloc是告诉cuda runtime 去allocate the memory on device,其中第一个参数是指针的指针,第二个大小。

Querying Devices

这个主要是访问cuda的信息的。

下面是一个示例代码

#include "../common/book.h"

int main( void ) { 
    cudaDeviceProp  prop;

    int count;
    HANDLE_ERROR( cudaGetDeviceCount( &count ) );
    for (int i=0; i< count; i++) {
        HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );
        printf( "   --- General Information for device %d ---\n", i );
        printf( "Name:  %s\n", prop.name );
        printf( "Compute capability:  %d.%d\n", prop.major, prop.minor );
        printf( "Clock rate:  %d\n", prop.clockRate );
        printf( "Device copy overlap:  " );
        if (prop.deviceOverlap)
            printf( "Enabled\n" );
        else
            printf( "Disabled\n");
        printf( "Kernel execution timeout :  " );
        if (prop.kernelExecTimeoutEnabled)
            printf( "Enabled\n" );
        else
            printf( "Disabled\n" );

        printf( "   --- Memory Information for device %d ---\n", i );
        printf( "Total global mem:  %ld\n", prop.totalGlobalMem );
        printf( "Total constant Mem:  %ld\n", prop.totalConstMem );
        printf( "Max mem pitch:  %ld\n", prop.memPitch );
        printf( "Texture Alignment:  %ld\n", prop.textureAlignment );

        printf( "   --- MP Information for device %d ---\n", i );
        printf( "Multiprocessor count:  %d\n",
                    prop.multiProcessorCount );
        printf( "Shared mem per mp:  %ld\n", prop.sharedMemPerBlock );
        printf( "Registers per mp:  %d\n", prop.regsPerBlock );
        printf( "Threads in warp:  %d\n", prop.warpSize );
        printf( "Max threads per block:  %d\n",
                    prop.maxThreadsPerBlock );
        printf( "Max thread dimensions:  (%d, %d, %d)\n",
                    prop.maxThreadsDim[0], prop.maxThreadsDim[1],
                    prop.maxThreadsDim[2] );
        printf( "Max grid dimensions:  (%d, %d, %d)\n",
                    prop.maxGridSize[0], prop.maxGridSize[1],
                    prop.maxGridSize[2] );
        printf( "\n" );
    }   
}


然后 nvcc -o enum_gpu enum_gpu.cu, ./enum_gpu 就可以看出打印的信息如下

   --- General Information for device 0 ---
Name:  GeForce GTX 1060 6GB
Compute capability:  6.1
Clock rate:  1733500
Device copy overlap:  Enabled
Kernel execution timeout :  Enabled
   --- Memory Information for device 0 ---
Total global mem:  6371475456
Total constant Mem:  65536
Max mem pitch:  2147483647
Texture Alignment:  512
   --- MP Information for device 0 ---
Multiprocessor count:  10
Shared mem per mp:  49152
Registers per mp:  65536
Threads in warp:  32
Max threads per block:  1024
Max thread dimensions:  (1024, 1024, 64)
Max grid dimensions:  (2147483647, 65535, 65535)


可以看到每个block的最大线程是1024, 还有三个维度的个数,以及grid的数目,这些是cuda的核心内容,后面会详细介绍。 当然这些也不光只是查看查看,有时候要在

b11aa3142d8430dff33337a7e75fbb17e57dc6ea

打赏,谢谢~~

取消

感谢您的支持,我会继续努力的!

扫码支持
扫码打赏,多谢支持~

打开微信扫一扫,即可进行扫码打赏哦