Skip to content

CUDA notes



Error Handle

unified memory profiling failed


nvprof --unified-memory-profiling off ./add_cuda

Strange Output

Pay attention to the size of vectors when running cudaMemcpy() .

and don’t mess up the order of dimensions.

cudaMemcpy fails

check whether the order of destination and source variables.


think twice before adding the following code


free(): invalid next size (fast/normal)

Error [an illegal memory access was encountered]



double *pone;

and DO NOT

double one;
double *pone = &one;


double *pone = (double*)malloc(sizeof(double));
*pone = 1.0;


  1. 重复运行结果不一样
  2. 关于两个for循环 3.


  1. Element-by-element vector multiplication with CUDA
  2. Is there a cuda function to copy a row from a Matrix in column major?
  3. “invalid configuration argument ” error for the call of CUDA kernel? 虽然block每行每列的thread最大值为512,高的thread最大值为62;但是行列高的乘积最大为768(有些硬件为1024)
  4. Incomplete output from printf() called on device
  5. 关于CUDA中__threadfence的理解
  6. Call cublas API from kernel
  7. CUDA Memory Hierarchy
  8. In a CUDA kernel, how do I store an array in “local thread memory”?
  9. cublas handle reuse
  10. Does __syncthreads() synchronize all threads in the grid?
  11. How to call device function in CUDA with fewer threads
  12. CUDA threads for inner loop
  13. CUDA printf() crashes when large number of threads and blocks are launched
  14. Intro to image processing with CUDA
  15. CUDA parallelizing a nested for loop
  16. CUDA kernel - nested for loop
  17. For nested loops with CUDA
  18. CUDA kernel - nested for loop
  19. Converting “for” loops into cuda parallelized code
  20. What does “Misaligned address error” mean?
  21. memory allocation inside a CUDA kernel
  22. NCSA GPU programming tutorial
  23. Complicated for loop to be ported to a CUDA kernel
  24. CUDA error message : unspecified launch failure
  25. Extracting matrix columns with CUDA?
  26. Is there a cuda function to copy a row from a Matrix in column major?
  27. Element-by-element vector multiplication with CUDA
  28. Converting C/C++ for loops into CUDA
  29. GPU学习笔记系列
  30. 多线程有什么用?
  31. CUDA常见问题与解答
  32. cudaStreamSynchronize vs CudaDeviceSynchronize vs cudaThreadSynchronize

use gsl in GNU

How to use the GNU scientific library (gsl) in nvidia Nsight eclipse

Getting started with parallel MCMC

Getting started with parallel MCMC

Multiple definitions Error

Some similar problems and explanations:

  1. multiple definition error c++
  2. Multple c++ files causes “multiple definition” error?
  3. getting “multiple definition” errors with simple device function in CUDA C
  4. CUDA multiple definition error during linking

First Try: separate definition and implementations

According to Separate Compilation and Linking of CUDA C++ Device Code, it seems that it is reasonable to separate the device code header file with implementation into pure header file and implementation parts.

But the template cannot be separated, refer to How to define a template class in a .h file and implement it in a .cpp file and Why can’t templates be within extern “C” blocks?

Second Try: add extern "C"

A reference about extern "C": C++项目中的extern “C” {}

There are several function names with different parameter list, it reports

more than one instance of overloaded function "gauss1_pdf" has "C" linkage

In one word, overloading is a C++ feature, refer to More than one instance overloaded function has C linkage.

Last Try: add inline


Refer to C/C++ “inline” keyword in CUDA device-side code