hmax2 in CUDA

You can make hmax2, which computes max of each element in a half2, using bit operations.

__byte_perm(x, y, s) performs like this image.

In PTX, this code is only 9 operations.

You can make hmin2 by changing __hsub2 at first line.

Date:2020-04-15