Skip to content

CUDA notes


Error Handle

unified memory profiling failed


nvprof --unified-memory-profiling off ./add_cuda

Strange Output

Pay attention to the size of vectors when running cudaMemcpy() .

and don’t mess up the order of dimensions.

cudaMemcpy fails

check whether the order of destination and source variables.


think twice before adding the following code


free(): invalid next size (fast/normal)

Error [an illegal memory access was encountered]



double *pone;

and DO NOT

double one;
double *pone = &one;
double *pone = (double*)malloc(sizeof(double));
*pone = 1.0;


  1. 重复运行结果不一样
  2. 关于两个for循环 3.


  1. Use of cudamalloc(). Why the double pointer? - Stack Overflow
  2. Element-by-element vector multiplication with CUDA
  3. Is there a cuda function to copy a row from a Matrix in column major?
  4. “invalid configuration argument ” error for the call of CUDA kernel? 虽然block每行每列的thread最大值为512,高的thread最大值为62;但是行列高的乘积最大为768(有些硬件为1024)
  5. Incomplete output from printf() called on device
  6. 关于CUDA中__threadfence的理解
  7. Call cublas API from kernel
  8. CUDA Memory Hierarchy
  9. In a CUDA kernel, how do I store an array in “local thread memory”?
  10. cublas handle reuse
  11. Does __syncthreads() synchronize all threads in the grid?
  12. How to call device function in CUDA with fewer threads
  13. CUDA threads for inner loop
  14. CUDA printf() crashes when large number of threads and blocks are launched
  15. Intro to image processing with CUDA
  16. CUDA parallelizing a nested for loop
  17. CUDA kernel - nested for loop
  18. For nested loops with CUDA
  19. CUDA kernel - nested for loop
  20. Converting “for” loops into cuda parallelized code
  21. What does “Misaligned address error” mean?
  22. memory allocation inside a CUDA kernel
  23. NCSA GPU programming tutorial
  24. Complicated for loop to be ported to a CUDA kernel
  25. CUDA error message : unspecified launch failure
  26. Extracting matrix columns with CUDA?
  27. Is there a cuda function to copy a row from a Matrix in column major?
  28. Element-by-element vector multiplication with CUDA
  29. Converting C/C++ for loops into CUDA
  30. GPU学习笔记系列
  31. 多线程有什么用?
  32. CUDA常见问题与解答
  33. cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize

use gsl in GNU

How to use the GNU scientific library (gsl) in nvidia Nsight eclipse

Getting started with parallel MCMC

Getting started with parallel MCMC

Multiple definitions Error

Some similar problems and explanations:

  1. multiple definition error c++
  2. Multple c++ files causes “multiple definition” error?
  3. getting “multiple definition” errors with simple device function in CUDA C
  4. CUDA multiple definition error during linking

First Try: separate definition and implementations

According to Separate Compilation and Linking of CUDA C++ Device Code, it seems that it is reasonable to separate the device code header file with implementation into pure header file and implementation parts.

But the template cannot be separated, refer to How to define a template class in a .h file and implement it in a .cpp file and Why can’t templates be within extern “C” blocks?

Second Try: add extern "C"

A reference about extern "C": C++项目中的extern “C” {}

There are several function names with different parameter list, it reports

more than one instance of overloaded function "gauss1_pdf" has "C" linkage

In one word, overloading is a C++ feature, refer to More than one instance overloaded function has C linkage.

Last Try: add inline


Refer to C/C++ “inline” keyword in CUDA device-side code