Recently I’ve released IGPRS — tool to recover passwords for Apple iOS 4.x & 5.x and BlackBerry 5.x & 6.x backups, TrueCrypt containers and WPA/WPA2 handshakes. IGPRS x64 version was added today with CPU AVX and XOP optimizations for SHA-512 used in TrueCrypt containers.
Initial release back in February was using my old approach to support AMD GPUs — CAL API calls and kernels written with IL. However with new GCN architecture I’ve faced several problems — firstly AMD removed global buffer for GCN GPUs (instead of emulate it via UAV — after all it was not a problem to emulate UAV with global buffer back in 4xxx days). I was forced to waste some time to figure out how to deal with UAV but OK, it is not that hard (don’t use INT_4, just INT_1, etc). However later things became worse — with Catalyst 12.3 I’ve got several random lock ups with simple kernels and I was not able to run PBKDF2/SHA512 kernel for TrueCrypt at all — system just locks up, no matter what. After several days of programming and debugging I’ve got really annoyed by all these things and decided to give up CAL/IL and finally switched to OpenCL.
Things got better since last time I’ve took a look at OpenCL, after an year (of very “hard” work I guess) AMD made possible to use BFI_INT, BIT_ALIGN_INT directly from OpenCL kernels (via bitselect() and amd_bitalign()). I was amazed how easy to write GPU kernels for AMD cards now while their performance is nearly the same as hand-written IL kernels… but I felt that way for a very short time
.
I faced nearly all kind of bugs once I’ve tried to implement more advanced algorithms — AMD OpenCL compiler producing ineffective code, it simply locks up on complex kernels, it doesn’t know how to use hardware capabilities of GPUs properly, some kernels (after “optimizations” done by compiler) simply producing incorrect results. It even replaces vector calculations to scalar ones (trying to favor GCN architecture I guess) which results in very poor performance on VLIW4/5 GPUs. Now I can’t decide which is more annoying — to fight with OpenCL compiler checking intermediate IL/ISA hoping for proper code generation or still write kernels with IL because there you can control a bit more things at least. Or my old idea to write my own GPU assembler to deal with AMD GPUs was (very time consuming but) a much better thing to do after all?..
After I got question about SHA-512 performance in my blog I’ve decided to take more closer look to ISA produced by AMD’s OpenCL compiler and was totally disappointed with results. More information about SHA-512 performance on CPUs and GPUs will be in my next post.
