## x4. Now for NVIDIA

So, after discovery of amazing LOP3.LUT (and small discussion at NVIDIA’s devtalk forum with filing RFE) I’ve decided to take a closer look at Maxwell’s MD5 performance. While I’ve got solid 2.2 billions per second with ighashgpu in reality it doesn’t match the expectations!

With MD5 defines like:

#define F(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
#define G(b,c,d) ((((b) ^ (c)) & (d)) ^ (c))
#define H(b,c,d) ((b) ^ (c) ^ (d))
#define I(b,c,d) (((~(d)) | (b)) ^ (c))

#define R0(a,b,c,d,k,s,t) { \
a+=((k)+(t)+F((b),(c),(d))); \
a=ROTATE(a,s); \
a+=b; };\

It’s obvious that we’re need only four instructions per round (LOP3.LUT + IADD3 + SHF + IADD). It is not possible to get rid of last IADD for MD5 (thus reducing whole calculations to three instructions) but it will work for MD4 (== NTLM) as I’ll show later. NVIDIA’s documentation states that one Maxwell’s SMM can handle 64 shifts per clock cycle, so we can assume that computing 128 SHFs will take 2 clock cycles. And it means 5 clock cycles per round while we’re need only 45 rounds (plus several additional instructions for comparisons/block updates) to check one MD5 hash (with “reversed” method).

So, GTX 750 Ti ideally should show MD5 speed like:

640SP * 1150Mhz / (45*5) ~= 3.27 B/s

But I’ve got “only” 2.2B/s. Not good. OK, Let’s look at disassembly:

SHF.L.W R15, R15, 0×7, R15;
MOV32I R21, 0x8b44f7af;

LOP3.LUT R19, R17, R15, R18, 0xb8;
SHF.L.W R16, R16, 0xc, R16;

LOP3.LUT R19, R18, R16, R15, 0xb8;

SHF.L.W R17, R17, 0×11, R17;
MOV32I R20, 0x895cd7be;

LOP3.LUT R19, R15, R17, R16, 0xb8;
SHF.L.W R18, R18, 0×16, R18;

Apparently IADD3 cannot use 32-bit immediates. So whole sequence became “MOV reg, 32-bit imm/IADD3″ making whole IADD3 optimization pointless here. Also notice the ” IADD3 R17, R17, 0xf5bb1, R19″ line. Apparently 0xffff5bb1 value was coded directly within instruction leading to idea what only 20-bit immediates (with sign-extension) supported by IADD3.

Solution? OK, I’ve implemented vectorized version of MD5 for NVIDIA SM 2.1 GPUs long time ago. Reusing this code for SM 5.0 results in:

OK, looks promising. Now constant moved to register once and (re)used with IADD3 twice. So we’re need to increase this ratio. Going x4 with NVIDIA took some time (mostly because I’ve totally forgot how ighashgpu works :)) but results in:

Yes, it looks better now. There is still one load instruction per four IADD3, thus we’re need about 45*5.25 = 236.25 (+comparisons and block updates) clock cycles, so 3049M looks like really close value to expected (~3100M) one. I haven’t tried x8 version because even with x4 scheme register pressure is already high enough resulting in spills.

Another interesting thing is that results for AMD GPUs (VLIW4/5+BIT_ALIGN+BFI_INT and GCN ones) are also very close to these 240-250 clock cycles per single MD5. This happens because BIT_ALIGN performed at a rate of one instruction per clock, logic functions replaced by BFI_INT (aka bitselect) and some additions with zeros simply optimized out. So effectively performance is nearly the same — 5-6 clock cycles per round.

Now moving to MD4. Of course it only important these days because of NTLM used in Windows family. Password being converted into Unicode and hashed (only single iteration) with MD4. Way too simple scheme even for ten+ years old software.

MD4 definitions making it perfect algorithm for Maxwell’s GPUs:

#define MD4_F(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
#define MD4_G(b,c,d) (((b) & (c)) | ((b) & (d)) | ((c) & (d)))
#define MD4_H(b,c,d) ((b) ^ (c) ^ (d))

#define MD4_R0(a,b,c,d,k,s,t) { \
a+=((k)+(t)+MD4_F((b),(c),(d))); \
a=ROTATE(a,s); };

It’s just LOP3.LUT + IADD3 + SHF. Also constant value being reused for 16 rounds (not as in MD5 where each round requires separate one). With the same “reversing” method we’re only need to perform 29 rounds of MD4, thus leading us to theoretical value of:

640 SP * 1150Mhz / (29*4) ~= 6.345B/s

While ighashgpu shows “only” 4 billions. Why?.. Looking at disassembly of MD4_F function it looks perfectly:

SHF.L.W R28, R28, 0xb, R28;
LOP3.LUT R30, R27, R28, R26, 0xb8;

SHF.L.W R29, R29, 0×13, R29;
LOP3.LUT R32, R26, R29, R28, 0xb8;

SHF.L.W R30, R27, 0×3, R27;
LOP3.LUT R27, R28, R30, R29, 0xb8;

SHF.L.W R27, R26, 0×7, R26;
LOP3.LUT R26, R29, R27, R30, 0xb8;

Exactly three instructions per round. But situation with MD4_G function is repeating situation with SHA1 I’ve described in previous post — while it’s possible to replace whole construction with single LOP3.LUT compiler can’t recognize it, so we have:

MOV32I R26, 0x5a827999;
LOP3.LUT R33, R32, R30, R31, 0xf8;

LOP.OR R30, R31, R29;
LOP.AND R33, R31, R29;

SHF.L.W R28, R28, 0×3, R28;
LOP3.LUT R32, R33, R30, R28, 0xf8;

LOP.OR R30, R28, R31;
LOP.AND R32, R28, R31;
SHF.L.W R27, R27, 0×5, R27;

LOP3.LUT R30, R32, R30, R27, 0xf8;
LOP.OR R30, R27, R28;

5 instructions instead of 3. After applying patch to remove these unneeded OR+AND, performance is still below expected one — around 4.5-5B/s. I suspecting that it was caused by way too short NTLM function body (29*3 == 87 instructions) and so flow control instructions seriously affects performance. It was natural to try x4 scheme with NTLM and it ends as:

Yes, I can live with these 5.9B/s. Although there is a place to improvements, of course.

So, conclusions? OK. Maxwell architecture is simply the best among all GPUs (And, yes, including AMD ones — because of constant problems with AMD’s buggy software making fast hardware pointless). At least for hashing/cryptography purposes. And because of it I’ve got the feeling that NVIDIA won’t release any top-end Maxwell based GPU anytime soon because they need to get rid of GK110 chips first. \$3K for Titan Z? Astonishing!

I’ve have updated GPU estimation table based on these new results with adding new cards. Also results for NVIDIA SM 3.0 based GPUs was corrected — it seems like they were too low for SHA-1 based algorithms.

As interesting addition to all above, Passcovery Suite 3.0 has been released recently with GPU acceleration for PDF /R 3 /R 4 (aka Acrobat 6-8 compatible protection) passwords. And GTX 750 Ti is being just 17% slower than AMD 7970. Why so small difference? Well, RC4 performance (which is bottleneck of PDF password validation) is totally different thing. It mainly depends on how fast GPU’s local memory is and how it’s organized. But I’ll keep this story for another time.

## Hello, world!

So, one day after previous post at April, 16… (What? And two years later as well?! Can’t be!)

Should I get back to my previous promise “More information about SHA-512 performance on CPUs and GPUs will be in my next post”? Surprisingly enough it’s still interesting topic as AMD has not fixed anything in int64 code generation for GPUs. I guess “stability” is everything for them — even if kernels using SHA-512 (i. e. password recovery for TrueCrypt) are about three times slower than they should be — “don’t touch anything if it works”. So (again!) manual binary kernel code hacking is required to get good performance. Or hooking into .CL -> .IL -> ISA sequence and patching it on a fly. AMD GPU programming is always so-o-o FUN!

Or should I write about why year 2013 was almost a waste as way too weird (to say at least) people from ElcomSoft (Russian company) tried to sue me (Russian citizen) and my business partner (also Russian citizen) in US Court using forged documents in process? They tried to use US patent as a “steamroller” to bring lawsuit to US instead of Russia but failed to do so. It’s actually a very interesting and long story. But nah, not for today.

So, let’s move to more pleasant themes. At February, 18 2014 NVIDIA announces new architecture — Maxwell. While GTX 750 and GTX 750 Ti GPUs being middle-ranged they are quite interesting. I was able to purchase 750 Ti OC version at April, 1, made simple tests with it (welcome back, 4+ years old ighashgpu!) and was simply amazed. 640 ALU running at 1.15Ghz shows 2100 M/s speed for single MD5 and 918M for single SHA1. Thus beating GTX580, GTX680 and even questioning GTX780 performance! I’ve updated my GPU estimation page but it was unclear to me what exactly NVIDIA changed in architecture to make Maxwell that fast.

Apparently SHF (circular shift == AMD’s bitalign) implemented in Titan GPUs presents in Maxwell as well (as Titan being SM 3.5 and Maxwell 5.0 — it’s no surprise). This boost performance a lot but MD5 hashing speed is not increased that much as SHA1 speed. I made an assumption that this is because AMD’s MD5 implementation using bitselect (== BFI_INT) instructions a lot while SHA1 is not that depends on it. So I thought that Maxwell being missing bitselect() at all but shows good performance with SHA1 because all 128 ALU units within SMM can perform  circular shift.

Well, apparently I was wrong. And right at some point. Recently NVIDIA released “CUDA 6 Production Release” with updated documentations. According to it, SMM (containing 128 ALUs) can perform “only” 64 shifts per clock cycle. The same amount as Titan’s SMX (but it containing 192 ALUs). OK. Being clueless I’ve decided to look at disassembly produced by cuobjdump.exe starting from MD5 kernel:

LOP3.LUT R18, R20, R18, R19, 0xac;
SHF.L.W R17, R17, 0×11, R17;

LOP3.LUT R21, R19, R17, R20, 0xb8;

SHF.L.W R18, R18, 0×16, R18;
LOP3.LUT R21, R20, R18, R17, 0xb8;

SHF.L.W R19, R19, 0×7, R19;

IADD3? LOP3.LUT?! What is this? Well, first one is not quite a puzzle — integer addition of 3 elements with placing results into 4th. But is it done in one clock cycle? Really?

LOP3.LUT produces even more questions for me — what LUT doing here? It was simple expression:

#define F(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))

Which “translates” into bitselect but why LUT used? Assuming it’s “Look Up Table”. And LOP3 being “logical operation with 3 operands”. Then it suddenly hits me — it’s really a LUT :). We’re providing 3 inputs to some logic functions, we have 8 possible outputs for all inputs. And these outputs being coded directly in instruction as “truth table” with 8-bit immediate! So, no Maxwell have no bitselect instruction. It have even more powerful LOP3.LUT one!

We can replace more than 3 logical instructions with single LUT in Maxwell. And SHA1 implementation is actually perfect for LOP3.LUT — transformation functions coded as:

#define F_00_19(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
#define F_20_39(b,c,d) ((b) ^ (c) ^ (d))
#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))
#define F_60_79(b,c,d) F_20_39(b,c,d)

So, that’s 3 inputs and 1 output. Even for F_40_59 containing 5 instructions. One problem is that… compiler is not recognizing it :). SHA1 rounds from 40 to 59 compiled into:

LOP.AND R31, R24, R18;
LOP.OR R17, R24, R18;

LOP3.LUT R10, R12, R25, R26, 0×96;
SHF.L.W R26, R24, 0×5, R24;
LOP3.LUT R17, R31, R16, R17, 0xf8;

So, it’s better than previous code because 5 instructions transformed into 3 ones (AND + OR + LUT) but we don’t need 5 while one is enough. Solution? Ok, let’s bit hack executable kernel as it was done with AMD :D.

#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))

I’ve used (incorrect but “LUT-able”):

#define F_40_59(b,c,d) (((b) | (c)) & (d))

This define compiles into something like:

LOP3.LUT R30, R25, R26, R20, 0xe0;

And now we only need to replace 0xe0 immediate (which represents (b | c) & d) to 0xe8 (representing the truth table for correct F_40_59 function). For single SHA1 hash there are exactly 20 places to applythis patch.

After these modifications SHA1 kernel performance increased from 918M to 981M == almost 7%! Patching pbkdf2/sha1 kernel provides 4-5% speed-up. But some tuning is still required.

And this should apply to all MD/SHA-based hashing schemes. Including (I guess, haven’t seen how compiler acts in that case) SHA256. Meaning — bitcoin mining of course ;). Need to make some more tests with it.

All in all, Maxwell chip is awesome. And as GTX 750 being “entry/mid level” GPU it really rising the question — what will top-end GPU from NVIDIA based on Maxwell chip show? If GTX880 will contain 3200 CUDA cores (there are such speculations) it will be a bomb!

Accent password recovery product line was recently updated to support Maxwell-based GPUs. No bit hacking of kernels was used though, thus — there is still place to improvements.

Recently I’ve released IGPRS — tool to recover passwords for Apple iOS 4.x & 5.x and BlackBerry 5.x & 6.x backups, TrueCrypt containers and WPA/WPA2 handshakes. IGPRS x64 version was added today with CPU AVX and XOP optimizations for SHA-512 used in TrueCrypt containers.

Initial release back in February was using my old approach to support AMD GPUs — CAL API calls and kernels written with IL. However with new GCN architecture I’ve faced several problems — firstly AMD removed global buffer for GCN GPUs (instead of emulate it via UAV — after all it was not a problem to emulate UAV with global buffer back in 4xxx days). I was forced to waste some time to figure out how to deal with UAV but OK, it is not that hard (don’t use INT_4, just INT_1, etc). However later things became worse — with Catalyst 12.3 I’ve got several random lock ups with simple kernels and I was not able to run PBKDF2/SHA512 kernel for TrueCrypt at all — system just locks up, no matter what. After several days of programming and debugging I’ve got really annoyed by all these things and decided to give up CAL/IL and finally switched to OpenCL.

Things got better since last time I’ve took a look at OpenCL, after an year (of very “hard” work I guess) AMD made possible to use BFI_INT, BIT_ALIGN_INT directly from OpenCL kernels (via bitselect() and amd_bitalign()). I was amazed how easy to write GPU kernels for AMD cards now while their performance is nearly the same as hand-written IL kernels… but I felt that way for a very short time :D.

I faced nearly all kind of bugs once I’ve tried to implement more advanced algorithms — AMD OpenCL compiler producing ineffective code, it simply locks up on complex kernels, it doesn’t know how to use hardware capabilities of GPUs properly, some kernels (after “optimizations” done by compiler) simply producing incorrect results. It even replaces vector calculations to scalar ones (trying to favor GCN architecture I guess) which results in very poor performance on VLIW4/5 GPUs. Now I can’t decide which is more annoying — to fight with OpenCL compiler checking intermediate IL/ISA hoping for proper code generation or still write kernels with IL because there you can control a bit more things at least. Or my old idea to write my own GPU assembler to deal with AMD GPUs was (very time consuming but) a much better thing to do after all?..

After I got question about SHA-512 performance in my blog I’ve decided to take more closer look to ISA produced by AMD’s OpenCL compiler and was totally disappointed with results. More information about SHA-512 performance on CPUs and GPUs will be in my next post.

Posted in GPU programming, Password Recovery | Tagged , , , , , , | 55 Comments

## Another Big One

Almost a year ago I’ve wrote post about 5970 and this week I’ve finally grabbed 6990 for tests by my own. As title states:

Same ruler, almost the same size as 5970. I’ve already had several tests results from 6990 owners and they looks kinda weird — while 6990 was faster than 5970 it still was slower than my expectations. First tests by my own produced values like:

That’s 10% slower. I’ve tested with SHA1 kernels, SL3, Office 07-10, WPA — everything were slower than expectations. I’ve grabbed my old program to measure GFLOPS of ATI GPUs and started series of experiments.  Apparently lowering GPU core frequency resulting in “closer to estimations” performance. My first guess was that there is internal throttling in 6990 and so overheating causing performance drop. I’ve even posted in official forum about this but some more experiments reveals that I wasn’t totally true. Answer was pretty simple:

Yep, by default it isn’t enough power provided for 6990 to make it work with 100% performance! This adjust results in:

Thus, power usage for 2nd core must also be tweaked and we’ll see:

At last the value I was expecting! Apparently, 5s running time is not good enough for precise measurements, so I’ve increased charset size and, well, hardware monitoring value as it reaching 90C in no time (and 95C is also happens very soon).

Mystery solved. My several months old estimations were correct.

If you’re going to repeat above steps with your 6990 make sure you have proper cooling and PSU as looks like official 375W TDP can easily became 450W and this means A LOT of heat you’re need to deal with somehow.

Posted in GPU programming, Hash cracking | Tagged , , , , , | 40 Comments

## Another GPU?.. Ha!

Well, when I was buying guitar I just couldn’t resist to buy tambourine. I’m not sure about English speaking community but in Russian tambourine and IT are very closely related :D.

And, yes, 6990 is going to be released soon but from first sights it won’t be anything revolutionary (except for power usage, 450W!) — good ol’ 5970 is still an awesome option, by first estimation 6990 will be just about 15-20% faster than 5970.

But, anyway, who cares about GPUs, my next aim are… DRUMS!

Posted in Uncategorized | 62 Comments

## Про WPA и “облака” Amazon

Для разнообразия — на русском. Попалась мне тут на глаза небольшая заметка: http://habrahabr.ru/blogs/infosecurity/111488/.

Оригинал тут, хотя настоящий оригинал должен быть вообще-то где-то на Black Hat. Сработали все естественно в духе испорченного телефона — чем дальше, тем меньше правды.

Суть: Amazon стал предлагать в аренду сервера с GPU. Машинка с 2x Tesla M2050 стоит 2.1 \$/час. Что такое Tesla M2050 в срезе хэширования/кракинга? Это замедленная по шейдеру версия GTX470 с 3ГГб памяти, поддержкой ECC и полноскоростной double precision floating point. Ничего из перечисленного для хэширования не требуется, однако цену разгоняет до \$2600 за одну штуку. Ну, учёные такие деньги и заплатят, ибо для серьёзных научных расчётов конкурентов у Tesla (среди GPU) просто нету. Однако использовать Tesla для хэш кракинга… всё равно что покупать Феррари для грузовых перевозок — типа мощности больше, чем у КамАЗа, стоит дороже — ну значит и мебель на ней можно возить быстрее! Однозначно! Табличка со скоростями перебора на разных карточках всё там же.

Ладно, пусть нам не так важна конечная стоимость решения, нас волнует только скорость работы удалённой системы — предполагается что мы сидим в кустах с маломощным ноутбуком, перехватываем им WPA handshake, засылаем его на удалённый сервер (Amazon или какой другой) и ждём результата.

Thomas Roth утверждает что получил скорость в 400К/секунду. При этом платил по 28 центов в минуту. То есть 0.28*60 = 16.8\$/час / \$2.1 = арендовал он 8 систем. Или 16x Tesla M2050. По данным из приведённой выше таблички скорость вполне совпадает с расчётной — 25К с одной теслы. Далее он говорит, что взлом соседней сети (в оригинале ничего про то, что сеть была защищена профессионалами в области безопасности я не заметил — просто “protected network in his neighborhood”) занял 20 минут, которые он потом сократил до 6-ти минут. OK, за 6 минут он перебрал 6*60*400000 = 144M паролей. Такой диапазон можно перебрать за час на одной ATI 5770 стоимостью в \$130. Но не в этом суть.

144М паролей это примерно половина диапазона all small latins, 6 symbols long (который составляет 308.9M). То есть вполне вероятно, что он действительно “забрутфорсил” какую-то соседнюю WiFi сеть с паролём вроде “miguel”. Но как назвать того человека, который ставит такие пароли?..

При более-менее приличном выборе (большие+маленькие латинские+цифры 8 символов длиной) диапазон перебора составит 62^8 = 218 340 105 584 896 паролей. Или 17.3 лет перебора на скорости в 400К. Или \$2.5 миллиона долларов за аренду тесл на всё это время. Прорыв, что и говорить!

Если же идти дальше — пусть есть люди в сером, которые сильно заинтересованы во взломе WiFi сетей. Люди эти умные и используют самые эффективные для взлома GPU — ATI 5970.  Максимально известный мне размер GPU фермы (не Ферми :)) на 5970 — 200 штук. Соотвественно на полной нагрузке они покажут скорость примерно 200 * 131 000 ~= (округлим чтобы учесть накладные расходы) 25M PMK/s. При этом потребление всей фермы будет где-то под 100 кВт.

Тот же “нормальный” пароль из 8-ми символов потребует 62^8/25M = 101 день перебора. Стоимость сожжённой электроэнергии надо считать исходя из локальных тарифов. Не говоря уж об отдельном охлаждении.

Какие же из всего этого выводы? Самый главный — ставьте на свои WiFi нормальные пароли и никакие кракеры вам страшны не будут. Ну и думать своей головой тоже чертовски полезно, пЕарщики-то не спят…

Всех с Наступившим и успехов в нём!

Posted in Hash cracking, In Russian, WPA | | 16 Comments

## Happy New Year

It was incorrect to end this year with post like my previous

So Happy New Year to everyone!

However, can’t leave post without any information related to blog’s tagline. So just some information I’ve found interesting in last days.

1. I’ve got question recently about AVX extension for upcoming CPUs, checked latest docs available and suddenly found that  (quote) “VI: “Vector Integer” instructions are not promoted to 256-bit” applied to every instruction needed for MD4/MD5/SHA1 hashes. This means AVX will be as useless as SSE was for password cracking as vector size increased from 128 to 256 bit only for floating point values. Simply ridiculous.

2. Somehow I’ve read disassembly of Cayman’s ISA like it capable of doing 32-bit integer multiplications with each of XYZW units. Actually I was wrong and in reality all 4 these units required now to perform one 32-bit multiplication. So with previous architecture it was possible to perform 4x additions/logic/bit-aligns AND multiplication and now multiplication requires 4x more instructions. Not very good for classic ZIP encryption…

3. …which currently not supported for ATI GPUs at all because they missing some functionality presents in NVIDIA GPUs. So right now AccentZPR (v2.0 final released recently) shows millions of passwords per second for NVIDIA GPUs and ZERO for ATI ones. Good counterexample for “Oh, ATI’s GPUs are so good and fast and cheap”…

4. For SHA-1 with fixed charset (let’s say 10 symbols) and fixed password length (like 15 + 9 byte salt) it’s possible to optimize algorithm a lot. I’ve got 800M/s speed compared with ighashgpu v0.80′s 680M/s at single 5770.

| Tagged , , , , | 41 Comments

## Спи\$дили

Pardon my French.

I understand that many people disrespect ighashgpu’s license agreement (and so disrespect me in fact) by using it in commercial environment. It clearly states that it’s free only for personal, non-commercial use but nobody cares.

However it is nothing compared with some motherfu\$kers who took ighashgpu, removed all copyright notices, included it into their package and started selling it as WORLD FIRST GPU SOLUTION FOR SL3 UNLOCKING. Seriously, are they that brain-damaged?..

I was planning to release SHA-1/SL3 version for some time already as I’ve been constantly asking about it (well, it’s just single SHA-1 iteration, so surely ighashgpu is ideal for this) but now… what’s the point after all?

***

One more screenshot — dump of sl3bf.exe contained in MX-KEY package:

So “many” changes, they even renamed ighashgpu to mxhashgpu. Seriously, did they thought nobody will notice this?!

***

More updates. Inside sl3bf.exe at offset 0xd10f4 starts… ighashgpu executable. Simply as that.

Comparing ighashgpu.exe v0.80.16.1 with data inside sl3bf.exe:

[ighashgpu.exe] 524800 bytes
[ighashgpu_inside_sl3bf.exe_at_0xd10f4_offset] 522752 bytes
00000111 02 ( ) da (Ú)
00000112 06 ( ) 05 ( )
0000017c 88 (ˆ) 10 ( )
0000017d 09 ( ) 00 ( )
000002b8 88 (ˆ) 10 ( )
000002b9 09 ( ) 00 ( )
000002c1 0a ( ) 02 ( )
000002ed da (Ú) d2 (Ò)
00024870 69 (i) 6d (m)
00024872 67 (g) 78 (x)
00024924 69 (i) 6d (m)
00024926 67 (g) 78 (x)
0007c010 2a (*) 20 ( )
0007c012 2a (*) 20 ( )
0007c014 2a (*) 20 ( )
0007c016 2a (*) 20 ( )
0007c018 2a (*) 20 ( )

0007d966 54 (T) 00 ( )
0007d968 72 (r) 00 ( )
0007d96a 61 (a) 00 ( )
0007d96c 6e (n) 00 ( )
0007d96e 73 (s) 00 ( )
0007d970 6c (l) 00 ( )
0007d972 61 (a) 00 ( )
0007d974 74 (t) 00 ( )
0007d976 69 (i) 00 ( )
0007d978 6f (o) 00 ( )
0007d97a 6e (n) 00 ( )
0007d980 09 ( ) 00 ( )
0007d981 04 ( ) 00 ( )
0007d982 b0 (°) 00 ( )
0007d983 04 ( ) 00 ( )
1553 bytes out of 522752 are different

Full diff file is here. Copyrights removed, ig changed to mx… Awesome work indeed!

***

27-Dec-2010 update.

To summarize everything:

1. ighashgpu’s code was stolen by MX-KEY authors. Evidences are above.

2. They clearly understand that they violating ighashgpu’s license agreement, that’s why they removed all copyright notices. Totally stupid move from their side.

3. It is OK to use ighashgpu in SL3 unlocking solutions if it isn’t violates ighashgpu’s license agreement. And this means:

a) you can’t use it in commercial environment without separate agreement with copyright holder (so you can’t charge money for GPU brute-forcing itself or use it in clusters selling results of GPU brute-forcing performed by ighashgpu)

b) you can’t modify any parts of executable file or other files contained in ighashgpu’s distribution package.

c)  if you’re including ighashgpu.exe in your package you must include ighashgpu’s license agreement in your package as well.

Right now I’m totally OK with CycloneBox realization of local sl3 unlock (except for non-included license agreement part).

This post locked for comments, if you wan’t to contact me — you should know how to find me.

Posted in Hash cracking, Password Recovery | Tagged , | 42 Comments

## x4, x5, x6…

So, if you’ve read comments to previous post you’re know that there are some major speed-ups for MD5 — 5xhashes processing in a single thread, BFI_INT usage to replace 3 instructions with 1 (kinda the same optimization as it was with _rotl to bitalign change), etc. However implementing BFI_INT lowered total number of instructions required to perform single MD5 and so utilization percentage dropped again. But, as I wrote yesterday on ATI devforums, there are reserves — we can process 6x MD5 per thread despite the fact it’s VLIW5. Tried and and yes — other speed-up is possible though this time it’s just ~3%. However, x7 and x8 also looks like good candidates to test.

Fresh version is here.

Also, utilization percentage looks interesting:

As you can see, for newest Caymans with VLIW4 it nearly no difference how to process hashes. For RV770 we reached peak with x5 and x6 can’t significantly change this as 98.4% is really huge value. For RV870 there are still some options available though T unit of RV870′s VLIW cannot accept neigher BIT_ALIGN_INT or BFI_INT, that’s why I guess utilization stuck around 90%.

***

Updated table with x7 & x8 results. Probably by manually scheduling instruction it’s possible to push utilization further. Need to write simulator for this.

As Cayman’s results aren’t looking too impressive (with 1536@880 setup), 5970 will stay fastest GPU for hashing/cracking for several more months.

Posted in Uncategorized | Tagged , , | 46 Comments