The grid size is determined by the input size when a CUDAFunction is called; if you create the CUDAFunction with a block size of 1024 and then calls it on a list of length 8192 it will create 8 blocks. Inside the CUDA kernel, memory should be addressed something like:
foo = data[threadIdx.x + blockIdx.x * blockDim.x];
This is exemplified by the very first basic example in the CUDAFunctionLoad documentation page.
You can also explicitly set the number of threads. I'm not on a CUDA-capable machine at the moment, but I believe the syntax for 1D is:
fun = CUDAFunctionLoad[src, name, argtypes, 1024];
fun[args, 8192]
(for higher dimensions, the block/grid specification would be a list of dimensions).
Also maybe worth noting: the 1024 limit on the first dimension of block size is a CUDA/hardware limitation, not a Mathematica limitation.