Skip to content

Offload thrust::sort to GPU with use of cached allocator

As promised, I've worked a bit into enabling thrust::sort on the GPU and using a cached allocator (as suggested by @lmosimann ) for it to avoid multiple cudaMallocs that can be a measurable overhead. The cached allocator could be potentially improved by freeing unused blocks frequently, etc. I tried to just provide an initial implementation to give an idea how something like this could work. There's significant speed up in multiple parts because of the sorting algorithm offloading to the GPU so 🚀

Merge request reports

Loading