盡管使用atomicAdd函數(CUDA),競爭條件?

[英]Race conditions despite atomicAdd functions (CUDA)?


I have a problem that is parallel on two levels: I have a ton of sets of (x0, x1, y0, y1) coordinate pairs, which are turned into variables vdx, vdy, vyy and for each of these sets I'm trying to calculate the values of all "monomials" composed of them up to degree n (i.e. all possible combinations of different powers of them, like vdx^3*vdy*vyy^2 or vdx*1*vyy^4). These values are then added up over all the sets.

我有一個在兩個層次上並行的問題:我有大量的(x0,x1,y0,y1)坐標對,它們變成了變量vdx,vdy,vyy,並且對於我正在嘗試的每個集合計算由它們組成直到度數n的所有“單項式”的值(即,它們的不同冪的所有可能組合,例如vdx ^ 3 * vdy * vyy ^ 2或vdx * 1 * vyy ^ 4)。然后將這些值加到所有集合上。

My strategy (and for now I'd just like to get it to work, it doesn't have to be optimized with multiple kernels or complex reductions, unless it really has to) is to have each thread deal with one set of coordinate pairs and calculate the values of all their corresponding monomials. Each block's shared memory holds all the monomial sums, and when the block is done, the first thread in the block adds the result to the global sum. Since each block's shared memory is accessed by all threads in all places, I'm using atomicAdd; same with the blocks and the global memory.

我的策略(現在我只想讓它工作,它不需要用多個內核或復雜的減少進行優化,除非真的必須)是讓每個線程處理一組坐標對並計算所有相應單項式的值。每個塊的共享內存保存所有單項和,當塊完成時,塊中的第一個線程將結果添加到全局和。由於每個塊的共享內存都被所有地方的所有線程訪問,我使用的是atomicAdd;與塊和全局內存相同。

Unfortunately there still seems to be a race condition somewhere, since I different results every time I run the kernel.

不幸的是,某些地方似乎仍然存在競爭條件,因為每次運行內核時都會產生不同的結果。

If it helps, I'm currently using degree = 3 and omitting one of the variables, which means that in the code below, the innermost for loop (over evbl) doesn't do anything and just repeats 4 times. Indeed, the output of the kernel looks like this: 51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,... and it's noticable that there is a (rough) pattern of 4-tuples. But everytime I run it the values are all very different.

如果它有幫助,我現在使用degree = 3並省略其中一個變量,這意味着在下面的代碼中,最里面的for循環(over evbl)沒有做任何事情,只重復了4次。實際上,內核的輸出如下:51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8 ,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,...並且有一個(粗糙)模式的4是值得注意的元組。但每次我運行它的價值都是非常不同的。

Everything is in floats, but I'm on a 2.1 GPU and so that shouldn't be a problem. cuda-memcheck also reports no errors.

一切都在漂浮,但我在2.1 GPU上,所以這應該不是問題。 cuda-memcheck也報告沒有錯誤。

Can somebody with more CUDA experience give me some pointers how to track down the race condition here?

有更多CUDA經驗的人可以給我一些指示如何追蹤這里的競爭狀況嗎?

__global__ void kernel(...) {

  extern __shared__ float s_data[];

  // just use global memory for now
  // get threadID:
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if(idx >= nPairs) return;

  // ... do some calculations to get x/y...

  // calculate vdx, vdy and vyy
  float vdx = (x1 - x0)/(float)xheight;
  float vdy = (y1 - y0)/(float)xheight;
  float vyy =  0.5*(y0 + y1)/(float)xheight;


  const int offs1 = degree + 1;
  const int offs2 = offs1 * offs1;
  const int offs3 = offs2 * offs1;
  float sol = 1.0;

  // now calculate monomial results and store in shared memory

  for(int evdx = 0; evdx <= degree; evdx++) {
    for(int evdy = 0; evdy <= degree; evdy++) {
      for(int evyy = 0; evyy <= degree; evyy++) {
        for(int evbl = 0; evbl <= degree; evbl++) {
          s = powf(vdx, evdx) + powf(vdy, evdy) + powf(vyy, evyy);
          atomicAdd(&(s_data[evbl + offs1*evyy + offs2*evdy +
                offs3*evdx]), sol/1000.0 ); 

        }
      }
    }
  }

  // now copy shared memory to global
  __syncthreads();
  if(threadIdx.x == 0) {
    for(int i = 0; i < nMonomials; i++) {
      atomicAdd(&outmD[i], s_data[i]);
    }
  }
}

1 个解决方案

#1


4  

You are using shared memory but you are never initializing it.

您正在使用共享內存,但您永遠不會初始化它。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2013/05/16/725815b6f1b60e1fa424c61bfb3587c9.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com