hmax2 in CUDA

You can make hmax2, which computes max of each element in a half2, using bit operations.
There is a built-in __hmax2 function if you use an arch >= sm_80.

__byte_perm(x, y, s) performs like this image.

In PTX, this code is only 9 operations.

You can make hmin2 by changing __hsub2 at first line.

Date:2020-04-15