pytorch-cuda-计算roc

pytorch-cuda-计算roc

之前实现了用mxnet调用cuda,而且也画出过roc,现在用pytorch来试一下,感觉过程要复杂一些,代码量也多一些,不过总结一下,为了方便以后写pytorch与cuda的接口时用,并且记录一下出bug的过程。

第一步写cuda核函数

代码如下,主题部分和上一个blog是一样的。

#ifdef __cplusplus
extern "C" {
#endif

#include <stdio.h>
#include <float.h>
#include "roc_kernel.h"
#define NUM 1024

__global__ void roc_kernel(const float *score, const int q_num, const int g_num, float *histo1, float *histo2, const float *labels_q, const float *labels_g){
    __shared__ float temp1[NUM];
    __shared__ float temp2[NUM];

    //initialize
    temp1[threadIdx.x]=0;
    temp2[threadIdx.x]=0;
    //syncthreads
    __syncthreads();

    int j = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    
    while(j<q_num*g_num){
        atomicAdd(&(temp1[(int)(1.0*(score[j]+1)/2*(NUM-1))]), 1); 
        atomicAdd(&(temp2[(int)(1.0*(score[j]+1)/2*(NUM-1))]), labels_q[j/g_num]==labels_g[j%g_num]);
        j += stride;
    }

    __syncthreads();

    atomicAdd(&(histo1[threadIdx.x]), temp1[threadIdx.x]);
    atomicAdd(&(histo2[threadIdx.x]), temp2[threadIdx.x]);

}

void _roc(const float *score, const int q_num, const int g_num, float *histo1, float *histo2, const float *labels_q, const float *labels_g){
    roc_kernel<<<1024, 1024>>>(score, q_num, g_num, histo1, histo2, labels_q, labels_g);
}


#ifdef __cplusplus
}
#endif


这个过程写的比较多了,主要是运算搞清晰,我这里传入的参数有些多了,其实不需要那么多的, 不同的是,之前的代码中histo1,histo2,labelq,labelgint *的,而现在是float *的了,这里就有一个bug,我最初用int *的时候,就出了bug,好像意思是THCudaTensor的类型需要是float

上面代码中的void _roc是方便外面调用的。 其中头文件"roc_kernel.h"里面写的是

#ifndef _ROC_KERNEL
#define _ROC_KERNEL

#ifdef __cplusplus
extern "C" {
#endif

void _roc(const float *score, const int q_num, const int g_num, float *histo1, float *histo2, const float *labels_q, const float *labels_g);

#ifdef __cplusplus
}
#endif


#endif


上面的两个写好之后编译生成可执行文件

nvcc -c -o roc_kernel roc_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_61

其中最后一个参数是计算能力的参数,不同的GPU参数不一样。

写cuda与pytorch的连接代码

这在mxnet中是没有这处过程的,可能是mxnet内部已经实现好了吧。

#include <THC/THC.h>
#include <TH/TH.h>
#include <stdio.h>

#include "cuda/roc_kernel.h"


extern THCState *state;

void gpu_roc(THCudaTensor *score, THCudaTensor *histo1, THCudaTensor *histo2, THCudaTensor *labels_q, THCudaTensor *labels_g){
    THArgCheck(THCudaTensor_isContiguous(state, score), 0, "must be contiguous");
    THArgCheck(THCudaTensor_isContiguous(state, histo1), 1, "must be contiguous");
    THArgCheck(THCudaTensor_isContiguous(state, histo2), 2, "must be contiguous");
    THArgCheck(THCudaTensor_isContiguous(state, labels_q), 3, "must be contiguous");
    THArgCheck(THCudaTensor_isContiguous(state, labels_g), 4, "must be contiguous");

    const int q_num = THCudaTensor_size(state, score, 0); 
    const int g_num = THCudaTensor_size(state, score, 1); 

    printf("nums of query: %d\n", q_num);
    printf("nums of gallery: %d\n", g_num);


    _roc(THCudaTensor_data(state, score), q_num, g_num,
        THCudaTensor_data(state, histo1),
        THCudaTensor_data(state, histo2),
        THCudaTensor_data(state, labels_q),
        THCudaTensor_data(state, labels_g));



}
~                  

这里面最后传参的时候最后的

THCudaTensor_data(state, score) 可以先用一个float 去接,然后 再传写行

float* score_data = THCudaTensor_data(state, score), 然后传入score_data.

其中头文件中就一句话

void gpu_roc(THCudaTensor *score, THCudaTensor *histo1, THCudaTensor *histo2, THCudaTensor *labels    _q, THCudaTensor *labels_g);

build

写一个build.py

里面是

import os
import torch
from torch.utils.ffi import create_extension

if torch.cuda.is_available():
    sources = ['src/roc_cuda.c']
    headers = ['src/roc_cuda.h']
    defines = [('WITH_CUDA', None)]

    with_cuda = True

this_file = os.path.dirname(os.path.realpath(__file__))

print(this_file)
extra_objects = ['src/cuda/roc_kernel']
extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]

ffi = create_extension(
    '_ext.roc',
    headers = headers,
    sources = sources,
    define_macros = defines,
    relative_to = __file__,
    with_cuda=with_cuda,
    extra_objects=extra_objects
    )   

if __name__ == "__main__":
    ffi.build()


至此文件目录是这样的

.
├── build.py
├── roc11.py
└── src
    ├── cuda
    │   ├── do.sh
    │   ├── roc_kernel
    │   ├── roc_kernel.cu
    │   └── roc_kernel.h
    ├── roc_cuda.c
    └── roc_cuda.h


然后运行python build.py

出现结果

/mnt/data1/cuda_roc_pytorch
generating /tmp/tmp2bcmkyqk/_roc.c
setting the current directory to '/tmp/tmp2bcmkyqk'
running build_ext
building '_roc' extension
creating mnt
creating mnt/data1
creating mnt/data1/cuda_roc_pytorch
creating mnt/data1/cuda_roc_pytorch/src
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -m64 -fPIC -fPIC -DWITH_CUDA -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/pengkun/anaconda3/envs/py35_tf19/include/python3.5m -c _roc.c -o ./_roc.o
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -m64 -fPIC -fPIC -DWITH_CUDA -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/pengkun/anaconda3/envs/py35_tf19/lib/python3.5/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/pengkun/anaconda3/envs/py35_tf19/include/python3.5m -c /mnt/data1/cuda_roc_pytorch/src/roc_cuda.c -o ./mnt/data1/cuda_roc_pytorch/src/roc_cuda.o
gcc -pthread -shared -L/home/pengkun/anaconda3/envs/py35_tf19/lib -Wl,-rpath=/home/pengkun/anaconda3/envs/py35_tf19/lib,--no-as-needed ./_roc.o ./mnt/data1/cuda_roc_pytorch/src/roc_cuda.o /mnt/data1/cuda_roc_pytorch/src/cuda/roc_kernel -L/home/pengkun/anaconda3/envs/py35_tf19/lib -lpython3.5m -o ./_roc.so

此时文件目录变成了

.
├── build.py
├── _ext
│   ├── __init__.py
│   └── roc
│       ├── __init__.py
│       └── _roc.so
├── roc11.py
└── src
    ├── cuda
    │   ├── do.sh
    │   ├── roc_kernel
    │   ├── roc_kernel.cu
    │   └── roc_kernel.h
    ├── roc_cuda.c
    └── roc_cuda.h



测试

测试代码是roc11.py

# ===========================================
# --coding:UTF-8 --
# file: roc11.py
# author: ZhengPengkun
# date: 2018-12-22
# email: pkzhengmath@pku.edu.cn
# description: 
# ===========================================
import torch
import numpy as np
import time
import datetime
import matplotlib.pyplot as plt 
import os, sys 
from _ext import roc 
#plt.figure(figsize=(100,80), dpi=100)

num = 1024
def run(file1, file2):
    query = np.load(file1)
    gallery = np.load(file2)
    label_q = query["label"].astype(np.float32)
    label_g = gallery["label"].astype(np.float32)
    feature_q = query["feature"].astype(np.float32)
    feature_g = gallery["feature"].astype(np.float32)
    assert dim1==dim2
    feature_q = torch.from_numpy(feature_q).cuda()
    feature_g = torch.from_numpy(feature_g).cuda()
    score = torch.matmul(feature_q, torch.t(feature_g))
    label_q = torch.from_numpy(label_q).contiguous().cuda()
    label_g = torch.from_numpy(label_g).contiguous().cuda()
    histo1 = torch.zeros((num,)).contiguous().cuda() 
    histo2 = torch.zeros((num,)).contiguous().cuda()
    roc.gpu_roc(score, histo1, histo2, label_q, label_g) 
    histo1 = histo1.cpu().numpy()
    histo2 = histo2.cpu().numpy()
    num_real_true = histo2.sum()
    num_real_false = q_size*g_size-num_real_true
    print("num_real_true, false", num_real_true, num_real_false)
    print(time.time()-start) 
    histo1 = np.cumsum(histo1[::-1])
    histo2 = np.cumsum(histo2[::-1])
    tpr = 1.0*histo2/num_real_true
    fpr = 1.0*(histo1-histo2)/num_real_false
    plt.plot(fpr, tpr, 'b')  
    plt.xlim([1e-8,1.01])
    plt.ylim([-0.01, 1.01])
    plt.xscale("log")
    plt.xlabel("fpr")
    plt.ylabel("tpr")
    plt.grid(True)
    plt.savefig(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")+".png")
    #plt.show()
if __name__ == "__main__":
    start = time.time()
    run(sys.argv[1], sys.argv[2])
                                             

这里面的bug之前一直没发现,是numpy读出来的时候是float64的,而需要传入的是float32的,所以之前的错总是说cuda 越界之类的, 但是check了好几遍也没有发现代码里面有哪个地方越界了。今天才想到会不会是数据不匹配导致的,然后发现果然是这个问题,从cv里面读出来的图像虽是numpy也不能直接传进去,也是这个道理。

#plt.figure() 注释掉的那里,虽然显示的图较大,但是会降低速度,而且不是一点点,有好几秒。

fpr和tpr的也可以放在roc_cuda.c里面完成,但是numpy的np.cumsum比较高效也比较方便。 最终测试的结果

num_real_true, false 10934.0 212054770.0
2.1989805698394775

上个blog中再算的时候,我是直接把score存下来然后算的,这次也计算了score. 这个结果和mxnet的结果差不多。

其它

上面的是基于pytorch=0.4.0的来写的,最新的1.0.0的还没有试过,这是下一步的目标。 当然上面的过程也可以用于pytorch的c++的连接方式。

打赏,谢谢~~

取消

感谢您的支持,我会继续努力的!

扫码支持
扫码打赏,多谢支持~

打开微信扫一扫,即可进行扫码打赏哦