CUDA notes¶

Tutorials¶

Error Handle¶

unified memory profiling failed¶

参考http://blog.csdn.net/u010837794/article/details/64443679

nvprof --unified-memory-profiling off ./add_cuda

Strange Output¶

Pay attention to the size of vectors when running cudaMemcpy() .

and don’t mess up the order of dimensions.

cudaMemcpy fails¶

check whether the order of destination and source variables.

cudaDeviceReset¶

think twice before adding the following code

cudaDeviceReset

free(): invalid next size (fast/normal)¶

http://blog.sina.com.cn/s/blog_77f1e27f01019qq9.html

Error [an illegal memory access was encountered]¶

多半数组越界，另外，注意对于double的数值型指针，

DO NOT

double *pone;

and DO NOT

double one;
double *pone = &one;

YOU MUST

double *pone = (double*)malloc(sizeof(double));
*pone = 1.0;

问题1¶

重复运行结果不一样
关于两个for循环 3.

help¶

Use of cudamalloc(). Why the double pointer? - Stack Overflow
Element-by-element vector multiplication with CUDA
Is there a cuda function to copy a row from a Matrix in column major?
“invalid configuration argument ” error for the call of CUDA kernel? 虽然block每行每列的thread最大值为512，高的thread最大值为62;但是行列高的乘积最大为768（有些硬件为1024） http://blog.csdn.net/dcrmg/article/details/54850766
Incomplete output from printf() called on device
关于CUDA中__threadfence的理解
Call cublas API from kernel
CUDA Memory Hierarchy
In a CUDA kernel, how do I store an array in “local thread memory”?
cublas handle reuse
Does __syncthreads() synchronize all threads in the grid?
How to call device function in CUDA with fewer threads
CUDA threads for inner loop
CUDA printf() crashes when large number of threads and blocks are launched
Intro to image processing with CUDA
CUDA parallelizing a nested for loop
CUDA kernel - nested for loop
For nested loops with CUDA
CUDA kernel - nested for loop
Converting “for” loops into cuda parallelized code
What does “Misaligned address error” mean?
memory allocation inside a CUDA kernel
NCSA GPU programming tutorial
Complicated for loop to be ported to a CUDA kernel
CUDA error message : unspecified launch failure
Extracting matrix columns with CUDA?
Is there a cuda function to copy a row from a Matrix in column major?
Element-by-element vector multiplication with CUDA
Converting C/C++ for loops into CUDA
GPU学习笔记系列
多线程有什么用？
CUDA常见问题与解答
cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize

use gsl in GNU¶

How to use the GNU scientific library (gsl) in nvidia Nsight eclipse

Getting started with parallel MCMC¶

Getting started with parallel MCMC

Multiple definitions Error¶

Some similar problems and explanations:

First Try: separate definition and implementations¶

According to Separate Compilation and Linking of CUDA C++ Device Code, it seems that it is reasonable to separate the device code header file with implementation into pure header file and implementation parts.

But the template cannot be separated, refer to How to define a template class in a .h file and implement it in a .cpp file and Why can’t templates be within extern “C” blocks?

Second Try: add `extern "C"`¶

A reference about extern "C": C++项目中的extern “C” {}

There are several function names with different parameter list, it reports

more than one instance of overloaded function "gauss1_pdf" has "C" linkage

In one word, overloading is a C++ feature, refer to More than one instance overloaded function has C linkage.

Last Try: add `inline`¶

Succeed!

Refer to C/C++ “inline” keyword in CUDA device-side code