So, one day after previous post at April, 16… (What? And two years later as well?! Can’t be!)
Should I get back to my previous promise “More information about SHA-512 performance on CPUs and GPUs will be in my next post”? Surprisingly enough it’s still interesting topic as AMD has not fixed anything in int64 code generation for GPUs. I guess “stability” is everything for them — even if kernels using SHA-512 (i. e. password recovery for TrueCrypt) are about three times slower than they should be — “don’t touch anything if it works”. So (again!) manual binary kernel code hacking is required to get good performance. Or hooking into .CL -> .IL -> ISA sequence and patching it on a fly. AMD GPU programming is always so-o-o FUN!
Or should I write about why year 2013 was almost a waste as way too weird (to say at least) people from ElcomSoft (Russian company) tried to sue me (Russian citizen) and my business partner (also Russian citizen) in US Court using forged documents in process? They tried to use US patent as a “steamroller” to bring lawsuit to US instead of Russia but failed to do so. It’s actually a very interesting and long story. But nah, not for today.
So, let’s move to more pleasant themes. At February, 18 2014 NVIDIA announces new architecture — Maxwell. While GTX 750 and GTX 750 Ti GPUs being middle-ranged they are quite interesting. I was able to purchase 750 Ti OC version at April, 1, made simple tests with it (welcome back, 4+ years old ighashgpu!) and was simply amazed. 640 ALU running at 1.15Ghz shows 2100 M/s speed for single MD5 and 918M for single SHA1. Thus beating GTX580, GTX680 and even questioning GTX780 performance! I’ve updated my GPU estimation page but it was unclear to me what exactly NVIDIA changed in architecture to make Maxwell that fast.
Apparently SHF (circular shift == AMD’s bitalign) implemented in Titan GPUs presents in Maxwell as well (as Titan being SM 3.5 and Maxwell 5.0 — it’s no surprise). This boost performance a lot but MD5 hashing speed is not increased that much as SHA1 speed. I made an assumption that this is because AMD’s MD5 implementation using bitselect (== BFI_INT) instructions a lot while SHA1 is not that depends on it. So I thought that Maxwell being missing bitselect() at all but shows good performance with SHA1 because all 128 ALU units within SMM can perform circular shift.
Well, apparently I was wrong. And right at some point. Recently NVIDIA released “CUDA 6 Production Release” with updated documentations. According to it, SMM (containing 128 ALUs) can perform “only” 64 shifts per clock cycle. The same amount as Titan’s SMX (but it containing 192 ALUs). OK. Being clueless I’ve decided to look at disassembly produced by cuobjdump.exe starting from MD5 kernel:
LOP3.LUT R18, R20, R18, R19, 0xac;
IADD R17, R9, R18;
SHF.L.W R17, R17, 0×11, R17;
IADD R17, R17, R20;
LOP3.LUT R21, R19, R17, R20, 0xb8;
IADD R18, R6, R21;
SHF.L.W R18, R18, 0×16, R18;
IADD R18, R18, R17;
LOP3.LUT R21, R20, R18, R17, 0xb8;
IADD3 R19, R2, R19, R21;
SHF.L.W R19, R19, 0×7, R19;
IADD R19, R19, R18;
IADD3? LOP3.LUT?! What is this? Well, first one is not quite a puzzle — integer addition of 3 elements with placing results into 4th. But is it done in one clock cycle? Really?
LOP3.LUT produces even more questions for me — what LUT doing here? It was simple expression:
#define F(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
Which “translates” into bitselect but why LUT used? Assuming it’s “Look Up Table”. And LOP3 being “logical operation with 3 operands”. Then it suddenly hits me — it’s really a LUT :). We’re providing 3 inputs to some logic functions, we have 8 possible outputs for all inputs. And these outputs being coded directly in instruction as “truth table” with 8-bit immediate! So, no Maxwell have no bitselect instruction. It have even more powerful LOP3.LUT one!
We can replace more than 3 logical instructions with single LUT in Maxwell. And SHA1 implementation is actually perfect for LOP3.LUT — transformation functions coded as:
#define F_00_19(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
#define F_20_39(b,c,d) ((b) ^ (c) ^ (d))
#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))
#define F_60_79(b,c,d) F_20_39(b,c,d)
So, that’s 3 inputs and 1 output. Even for F_40_59 containing 5 instructions. One problem is that… compiler is not recognizing it :). SHA1 rounds from 40 to 59 compiled into:
LOP.AND R31, R24, R18;
LOP.OR R17, R24, R18;
LOP3.LUT R10, R12, R25, R26, 0×96;
SHF.L.W R26, R24, 0×5, R24;
LOP3.LUT R17, R31, R16, R17, 0xf8;
So, it’s better than previous code because 5 instructions transformed into 3 ones (AND + OR + LUT) but we don’t need 5 while one is enough. Solution? Ok, let’s bit hack executable kernel as it was done with AMD :D.
Instead of (correct):
#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))
I’ve used (incorrect but “LUT-able”):
#define F_40_59(b,c,d) (((b) | (c)) & (d))
This define compiles into something like:
LOP3.LUT R30, R25, R26, R20, 0xe0;
And now we only need to replace 0xe0 immediate (which represents (b | c) & d) to 0xe8 (representing the truth table for correct F_40_59 function). For single SHA1 hash there are exactly 20 places to applythis patch.
After these modifications SHA1 kernel performance increased from 918M to 981M == almost 7%! Patching pbkdf2/sha1 kernel provides 4-5% speed-up. But some tuning is still required.
And this should apply to all MD/SHA-based hashing schemes. Including (I guess, haven’t seen how compiler acts in that case) SHA256. Meaning — bitcoin mining of course ;). Need to make some more tests with it.
All in all, Maxwell chip is awesome. And as GTX 750 being “entry/mid level” GPU it really rising the question — what will top-end GPU from NVIDIA based on Maxwell chip show? If GTX880 will contain 3200 CUDA cores (there are such speculations) it will be a bomb!
Accent password recovery product line was recently updated to support Maxwell-based GPUs. No bit hacking of kernels was used though, thus — there is still place to improvements.