Questions tagged [cuda]
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.
14,521
questions
0
votes
0
answers
23
views
Compilation Errors with CUDA Fortran and cuBLAS
I am trying to compile a Fortran program using CUDA and cuBLAS as per an example from the NVIDIA HPC SDK documentation. My setup includes an NVIDIA A100 GPU, and I have configured the CUDA and cuBLAS ...
0
votes
1
answer
47
views
Problem with Cmake and including third-party library
I'm trying to properly configure Cmake for my CUDA project. I'm using third party library, CGBN: https://github.com/NVlabs/CGBN/tree/master and Catch2 for unit-tests.
Basically I am trying to build ...
-2
votes
0
answers
20
views
Parallel Computing function value with Julia CUDA
I have a function f(x,y,z) defined as
global a = 1.0
fucntion f(x, y, z)
return a*x^2 + y*z
end
How to calculate the sum of function values at 10000 different points by using CUDA?
I ask GPT and ...
-2
votes
0
answers
19
views
Cupy - Changes in included file not updated
I have an external kernel cuda function which I call using a RawKernel.
This kernel function is defined in a first .cu file.
From this kernel function, I call some auxiliary __device__ functions which ...
-2
votes
0
answers
39
views
How can I use multi cpu core while running cuda with one gpu? [closed]
I am trying to run a simulation programmed with CUDA on a server that has an single Intel Xeon with 36 cores and a single P4000 GPU. When I used mpirun -np 16 ./sim, it failed to run, stating that MPI ...
0
votes
0
answers
25
views
How many apis does the cutlass library have?
I want to use the cutlass API, and access the cutlass website, found that's website lists a lot of class templates, including common cutlass::gemm::device::gemm, etc. The problem is that there are so ...
0
votes
0
answers
47
views
Inconsistent global memory access between blocks despite use of volatile, threadfence and disabling L1 cache
In the following minimal reproducible example for the construction of a tree, where bodies are inserted based on their position (so a 1D version of a Quad/Octree) when multiple blocks are used, some ...
-4
votes
0
answers
31
views
Problems for installing NVIDIA driver in Ubuntu 22.04.2 LTS [closed]
NVIDIA driver installing failed. I tried to install NVIDIA driver by the command below.
sudo apt install nvidia-driver-545-open
but I got the following log:
Building for 6.5.0-44-generic 6.5.1-...
1
vote
0
answers
36
views
Unable to include thrust/host_vector.h and others with CUDA 12.5
This test program compiled fine with CUDA 12.4 and lower, but fails to compile w/ 12.5.1:
#include <thrust/host_vector.h>
#include <thrust/scan.h>
#include <iostream>
int main() {
...
-4
votes
1
answer
44
views
continuously getting the error: 'nvidia-smi' is not recognized as an internal or external command, operable program or batch file [closed]
disclaimer: I am not super experienced with python
I have been trying to set up SAM (segment anything model by meta), but have been running into issues with installing pytorch. I have followed ...
0
votes
0
answers
43
views
What are the risks of increasing cudaLimitDevRuntimePendingLaunchCount?
I encountered an error while using dynamic parallelism:
launch failed because launch would exceed
cudaLimitDevRuntimePendingLaunchCount
To resolve this issue, I increased ...
0
votes
1
answer
46
views
Can CUDA Thrust Kernels operate in parallel on multiple streams?
I am attempting to launch thrust::fill on two different device vectors in parallel on different CUDA streams. However, when I look at the kernel launches in NSight Systems, they appear to be ...
-2
votes
0
answers
29
views
My benchmark and Nsight Compute don't agree which kernel is faster [closed]
I have two CUDA convolution kernels which perform convolution of 1024X1024 with a mask size of 3X3.ran both of them for 1000 times. Average execution time of kernel 1 is better than kernel 2 according ...
0
votes
1
answer
56
views
CUBLAS matrix multiplication with row-major data
I read some related posts here, and success using do the row majored matrixes multiplication with cuBLAS:
A*B (column majored) = B*A (row majored)
I write a wrapper to do this so that I can pass row ...
1
vote
0
answers
49
views
Cannot determine Numba type of <class 'clr._internal.CLRMetatype'>
I am very new to CUDA...
I have written a module in .Net 6.0 and I need to scale up the execution time utilizing CUDA in a Ubuntu machine. The method I need to call is defined as:
namespace ...