You can make hmax2, which computes max of each element in a half2, using bit operations.
There is a built-in __hmax2 function if you use an arch >= sm_80.
__byte_perm(x, y, s) performs like this image.
In PTX, this code is only 9 operations.
You can make hmin2 by changing __hsub2 at first line.
Date:2020-04-15