CUDA notes¶
Tutorials¶
- An Even Easier Introduction to CUDA
- https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/
- https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/
- http://docs.nvidia.com/cuda/cusolver/index.html
Error Handle¶
unified memory profiling failed¶
参考http://blog.csdn.net/u010837794/article/details/64443679
nvprof --unified-memory-profiling off ./add_cuda
Strange Output¶
Pay attention to the size of vectors when running cudaMemcpy()
.
and don’t mess up the order of dimensions.
cudaMemcpy fails¶
check whether the order of destination and source variables.
cudaDeviceReset¶
think twice before adding the following code
cudaDeviceReset
free(): invalid next size (fast/normal)¶
http://blog.sina.com.cn/s/blog_77f1e27f01019qq9.html
Error [an illegal memory access was encountered]¶
多半数组越界,另外,注意对于double的数值型指针,
DO NOT
double *pone;
and DO NOT
double one;
double *pone = &one;
double *pone = (double*)malloc(sizeof(double));
*pone = 1.0;
问题1¶
- 重复运行结果不一样
- 关于两个for循环 3.
help¶
- Use of cudamalloc(). Why the double pointer? - Stack Overflow
- Element-by-element vector multiplication with CUDA
- Is there a cuda function to copy a row from a Matrix in column major?
- “invalid configuration argument ” error for the call of CUDA kernel? 虽然block每行每列的thread最大值为512,高的thread最大值为62;但是行列高的乘积最大为768(有些硬件为1024) http://blog.csdn.net/dcrmg/article/details/54850766
- Incomplete output from printf() called on device
- 关于CUDA中__threadfence的理解
- Call cublas API from kernel
- CUDA Memory Hierarchy
- In a CUDA kernel, how do I store an array in “local thread memory”?
- cublas handle reuse
- Does __syncthreads() synchronize all threads in the grid?
- How to call device function in CUDA with fewer threads
- CUDA threads for inner loop
- CUDA printf() crashes when large number of threads and blocks are launched
- Intro to image processing with CUDA
- CUDA parallelizing a nested for loop
- CUDA kernel - nested for loop
- For nested loops with CUDA
- CUDA kernel - nested for loop
- Converting “for” loops into cuda parallelized code
- What does “Misaligned address error” mean?
- memory allocation inside a CUDA kernel
- NCSA GPU programming tutorial
- Complicated for loop to be ported to a CUDA kernel
- CUDA error message : unspecified launch failure
- Extracting matrix columns with CUDA?
- Is there a cuda function to copy a row from a Matrix in column major?
- Element-by-element vector multiplication with CUDA
- Converting C/C++ for loops into CUDA
- GPU学习笔记系列
- 多线程有什么用?
- CUDA常见问题与解答
- cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize
use gsl in GNU¶
How to use the GNU scientific library (gsl) in nvidia Nsight eclipse
Getting started with parallel MCMC¶
Getting started with parallel MCMC
Multiple definitions Error¶
Some similar problems and explanations:
- multiple definition error c++
- Multple c++ files causes “multiple definition” error?
- getting “multiple definition” errors with simple device function in CUDA C
- CUDA multiple definition error during linking
First Try: separate definition and implementations¶
According to Separate Compilation and Linking of CUDA C++ Device Code, it seems that it is reasonable to separate the device code header file with implementation into pure header file and implementation parts.
But the template cannot be separated, refer to How to define a template class in a .h file and implement it in a .cpp file and Why can’t templates be within extern “C” blocks?
Second Try: add extern "C"
¶
A reference about extern "C"
: C++项目中的extern “C” {}
There are several function names with different parameter list, it reports
more than one instance of overloaded function "gauss1_pdf" has "C" linkage
In one word, overloading is a C++ feature, refer to More than one instance overloaded function has C linkage.
Last Try: add inline
¶
Succeed!