Ivan Golubev's blog

Cryptography, code optimizations, GPUs & CPUs and other

Recently I’ve released IGPRS — tool to recover passwords for Apple iOS 4.x & 5.x and BlackBerry 5.x & 6.x backups, TrueCrypt containers and WPA/WPA2 handshakes. IGPRS x64 version was added today with CPU AVX and XOP optimizations for SHA-512 used in TrueCrypt containers.

IGPRS x64 running at HD7970 + HD4850

Initial release back in February was using my old approach to support AMD GPUs — CAL API calls and kernels written with IL. However with new GCN architecture I’ve faced several problems — firstly AMD removed global buffer for GCN GPUs (instead of emulate it via UAV — after all it was not a problem to emulate UAV with global buffer back in 4xxx days). I was forced to waste some time to figure out how to deal with UAV but OK, it is not that hard (don’t use INT_4, just INT_1, etc). However later things became worse — with Catalyst 12.3 I’ve got several random lock ups with simple kernels and I was not able to run PBKDF2/SHA512 kernel for TrueCrypt at all — system just locks up, no matter what. After several days of programming and debugging I’ve got really annoyed by all these things and decided to give up CAL/IL and finally switched to OpenCL.

Things got better since last time I’ve took a look at OpenCL, after an year (of very “hard” work I guess) AMD made possible to use BFI_INT, BIT_ALIGN_INT directly from OpenCL kernels (via bitselect() and amd_bitalign()). I was amazed how easy to write GPU kernels for AMD cards now while their performance is nearly the same as hand-written IL kernels… but I felt that way for a very short time :D .

I faced nearly all kind of bugs once I’ve tried to implement more advanced algorithms — AMD OpenCL compiler producing ineffective code, it simply locks up on complex kernels, it doesn’t know how to use hardware capabilities of GPUs properly, some kernels (after “optimizations” done by compiler) simply producing incorrect results. It even replaces vector calculations to scalar ones (trying to favor GCN architecture I guess) which results in very poor performance on VLIW4/5 GPUs. Now I can’t decide which is more annoying — to fight with OpenCL compiler checking intermediate IL/ISA hoping for proper code generation or still write kernels with IL because there you can control a bit more things at least. Or my old idea to write my own GPU assembler to deal with AMD GPUs was (very time consuming but) a much better thing to do after all?..

 

After I got question about SHA-512 performance in my blog I’ve decided to take more closer look to ISA produced by AMD’s OpenCL compiler and was totally disappointed with results. More information about SHA-512 performance on CPUs and GPUs will be in my next post.

 

Almost a year ago I’ve wrote post about 5970 and this week I’ve finally grabbed 6990 for tests by my own. As title states:

Same ruler, almost the same size as 5970. I’ve already had several tests results from 6990 owners and they looks kinda weird — while 6990 was faster than 5970 it still was slower than my expectations. First tests by my own produced values like:

 

That’s 10% slower. I’ve tested with SHA1 kernels, SL3, Office 07-10, WPA — everything were slower than expectations. I’ve grabbed my old program to measure GFLOPS of ATI GPUs and started series of experiments.  Apparently lowering GPU core frequency resulting in “closer to estimations” performance. My first guess was that there is internal throttling in 6990 and so overheating causing performance drop. I’ve even posted in official forum about this but some more experiments reveals that I wasn’t totally true. Answer was pretty simple:

Yep, by default it isn’t enough power provided for 6990 to make it work with 100% performance! This adjust results in:

Thus, power usage for 2nd core must also be tweaked and we’ll see:

At last the value I was expecting! Apparently, 5s running time is not good enough for precise measurements, so I’ve increased charset size and, well, hardware monitoring value as it reaching 90C in no time (and 95C is also happens very soon).

Mystery solved. My several months old estimations were correct.

 

If you’re going to repeat above steps with your 6990 make sure you have proper cooling and PSU as looks like official 375W TDP can easily became 450W and this means A LOT of heat you’re need to deal with somehow.

Latest ighashgpu can be downloaded here.

Instead of buying another GPU month ago I’ve chosen other thing:

Well, when I was buying guitar I just couldn’t resist to buy tambourine. I’m not sure about English speaking community but in Russian tambourine and IT are very closely related :D .

And, yes, 6990 is going to be released soon but from first sights it won’t be anything revolutionary (except for power usage, 450W!) — good ol’ 5970 is still an awesome option, by first estimation 6990 will be just about 15-20% faster than 5970.

But, anyway, who cares about GPUs, my next aim are… DRUMS! :D

Для разнообразия — на русском. Попалась мне тут на глаза небольшая заметка: http://habrahabr.ru/blogs/infosecurity/111488/.

Оригинал тут, хотя настоящий оригинал должен быть вообще-то где-то на Black Hat. Сработали все естественно в духе испорченного телефона — чем дальше, тем меньше правды.

Суть: Amazon стал предлагать в аренду сервера с GPU. Машинка с 2x Tesla M2050 стоит 2.1 $/час. Что такое Tesla M2050 в срезе хэширования/кракинга? Это замедленная по шейдеру версия GTX470 с 3ГГб памяти, поддержкой ECC и полноскоростной double precision floating point. Ничего из перечисленного для хэширования не требуется, однако цену разгоняет до $2600 за одну штуку. Ну, учёные такие деньги и заплатят, ибо для серьёзных научных расчётов конкурентов у Tesla (среди GPU) просто нету. Однако использовать Tesla для хэш кракинга… всё равно что покупать Феррари для грузовых перевозок — типа мощности больше, чем у КамАЗа, стоит дороже — ну значит и мебель на ней можно возить быстрее! Однозначно! Табличка со скоростями перебора на разных карточках всё там же.

Ладно, пусть нам не так важна конечная стоимость решения, нас волнует только скорость работы удалённой системы — предполагается что мы сидим в кустах с маломощным ноутбуком, перехватываем им WPA handshake, засылаем его на удалённый сервер (Amazon или какой другой) и ждём результата.

Thomas Roth утверждает что получил скорость в 400К/секунду. При этом платил по 28 центов в минуту. То есть 0.28*60 = 16.8$/час / $2.1 = арендовал он 8 систем. Или 16x Tesla M2050. По данным из приведённой выше таблички скорость вполне совпадает с расчётной — 25К с одной теслы. Далее он говорит, что взлом соседней сети (в оригинале ничего про то, что сеть была защищена профессионалами в области безопасности я не заметил — просто “protected network in his neighborhood”) занял 20 минут, которые он потом сократил до 6-ти минут. OK, за 6 минут он перебрал 6*60*400000 = 144M паролей. Такой диапазон можно перебрать за час на одной ATI 5770 стоимостью в $130. Но не в этом суть.

144М паролей это примерно половина диапазона all small latins, 6 symbols long (который составляет 308.9M). То есть вполне вероятно, что он действительно “забрутфорсил” какую-то соседнюю WiFi сеть с паролём вроде “miguel”. Но как назвать того человека, который ставит такие пароли?..

При более-менее приличном выборе (большие+маленькие латинские+цифры 8 символов длиной) диапазон перебора составит 62^8 = 218 340 105 584 896 паролей. Или 17.3 лет перебора на скорости в 400К. Или $2.5 миллиона долларов за аренду тесл на всё это время. Прорыв, что и говорить!

Если же идти дальше — пусть есть люди в сером, которые сильно заинтересованы во взломе WiFi сетей. Люди эти умные и используют самые эффективные для взлома GPU — ATI 5970.  Максимально известный мне размер GPU фермы (не Ферми :) ) на 5970 — 200 штук. Соотвественно на полной нагрузке они покажут скорость примерно 200 * 131 000 ~= (округлим чтобы учесть накладные расходы) 25M PMK/s. При этом потребление всей фермы будет где-то под 100 кВт.

Тот же “нормальный” пароль из 8-ми символов потребует 62^8/25M = 101 день перебора. Стоимость сожжённой электроэнергии надо считать исходя из локальных тарифов. Не говоря уж об отдельном охлаждении.

Какие же из всего этого выводы? Самый главный — ставьте на свои WiFi нормальные пароли и никакие кракеры вам страшны не будут. Ну и думать своей головой тоже чертовски полезно, пЕарщики-то не спят…

Всех с Наступившим и успехов в нём!

It was incorrect to end this year with post like my previous :)

So Happy New Year to everyone!

However, can’t leave post without any information related to blog’s tagline. So just some information I’ve found interesting in last days.

1. I’ve got question recently about AVX extension for upcoming CPUs, checked latest docs available and suddenly found that  (quote) “VI: “Vector Integer” instructions are not promoted to 256-bit” applied to every instruction needed for MD4/MD5/SHA1 hashes. This means AVX will be as useless as SSE was for password cracking as vector size increased from 128 to 256 bit only for floating point values. Simply ridiculous.

2. Somehow I’ve read disassembly of Cayman’s ISA like it capable of doing 32-bit integer multiplications with each of XYZW units. Actually I was wrong and in reality all 4 these units required now to perform one 32-bit multiplication. So with previous architecture it was possible to perform 4x additions/logic/bit-aligns AND multiplication and now multiplication requires 4x more instructions. Not very good for classic ZIP encryption…

3. …which currently not supported for ATI GPUs at all because they missing some functionality presents in NVIDIA GPUs. So right now AccentZPR (v2.0 final released recently) shows millions of passwords per second for NVIDIA GPUs and ZERO for ATI ones. Good counterexample for “Oh, ATI’s GPUs are so good and fast and cheap”… :P

4. For SHA-1 with fixed charset (let’s say 10 symbols) and fixed password length (like 15 + 9 byte salt) it’s possible to optimize algorithm a lot. I’ve got 800M/s speed compared with ighashgpu v0.80′s 680M/s at single 5770.

More information will come next year, so stay tuned!

Pardon my French.

I understand that many people disrespect ighashgpu’s license agreement (and so disrespect me in fact) by using it in commercial environment. It clearly states that it’s free only for personal, non-commercial use but nobody cares.

However it is nothing compared with some motherfu$kers who took ighashgpu, removed all copyright notices, included it into their package and started selling it as WORLD FIRST GPU SOLUTION FOR SL3 UNLOCKING. Seriously, are they that brain-damaged?..

I was planning to release SHA-1/SL3 version for some time already as I’ve been constantly asking about it (well, it’s just single SHA-1 iteration, so surely ighashgpu is ideal for this) but now… what’s the point after all?

No download link.

***

One more screenshot — dump of sl3bf.exe contained in MX-KEY package:

So “many” changes, they even renamed ighashgpu to mxhashgpu. Seriously, did they thought nobody will notice this?!

***

More updates. Inside sl3bf.exe at offset 0xd10f4 starts… ighashgpu executable. Simply as that.

Comparing ighashgpu.exe v0.80.16.1 with data inside sl3bf.exe:

[ighashgpu.exe] 524800 bytes
[ighashgpu_inside_sl3bf.exe_at_0xd10f4_offset] 522752 bytes
00000111 02 ( ) da (Ú)
00000112 06 ( ) 05 ( )
0000017c 88 (ˆ) 10 ( )
0000017d 09 ( ) 00 ( )
000002b8 88 (ˆ) 10 ( )
000002b9 09 ( ) 00 ( )
000002c1 0a ( ) 02 ( )
000002ed da (Ú) d2 (Ò)
00024870 69 (i) 6d (m)
00024872 67 (g) 78 (x)
00024924 69 (i) 6d (m)
00024926 67 (g) 78 (x)
0007c010 2a (*) 20 ( )
0007c012 2a (*) 20 ( )
0007c014 2a (*) 20 ( )
0007c016 2a (*) 20 ( )
0007c018 2a (*) 20 ( )

0007d966 54 (T) 00 ( )
0007d968 72 (r) 00 ( )
0007d96a 61 (a) 00 ( )
0007d96c 6e (n) 00 ( )
0007d96e 73 (s) 00 ( )
0007d970 6c (l) 00 ( )
0007d972 61 (a) 00 ( )
0007d974 74 (t) 00 ( )
0007d976 69 (i) 00 ( )
0007d978 6f (o) 00 ( )
0007d97a 6e (n) 00 ( )
0007d980 09 ( ) 00 ( )
0007d981 04 ( ) 00 ( )
0007d982 b0 (°) 00 ( )
0007d983 04 ( ) 00 ( )
1553 bytes out of 522752 are different

Full diff file is here. Copyrights removed, ig changed to mx… Awesome work indeed!

***

27-Dec-2010 update.

To summarize everything:

1. ighashgpu’s code was stolen by MX-KEY authors. Evidences are above.

2. They clearly understand that they violating ighashgpu’s license agreement, that’s why they removed all copyright notices. Totally stupid move from their side.

3. It is OK to use ighashgpu in SL3 unlocking solutions if it isn’t violates ighashgpu’s license agreement. And this means:

a) you can’t use it in commercial environment without separate agreement with copyright holder (so you can’t charge money for GPU brute-forcing itself or use it in clusters selling results of GPU brute-forcing performed by ighashgpu)

b) you can’t modify any parts of executable file or other files contained in ighashgpu’s distribution package.

c)  if you’re including ighashgpu.exe in your package you must include ighashgpu’s license agreement in your package as well.

Right now I’m totally OK with CycloneBox realization of local sl3 unlock (except for non-included license agreement part).

This post locked for comments, if you wan’t to contact me — you should know how to find me.

So, if you’ve read comments to previous post you’re know that there are some major speed-ups for MD5 — 5xhashes processing in a single thread, BFI_INT usage to replace 3 instructions with 1 (kinda the same optimization as it was with _rotl to bitalign change), etc. However implementing BFI_INT lowered total number of instructions required to perform single MD5 and so utilization percentage dropped again. But, as I wrote yesterday on ATI devforums, there are reserves — we can process 6x MD5 per thread despite the fact it’s VLIW5. Tried and and yes — other speed-up is possible though this time it’s just ~3%. However, x7 and x8 also looks like good candidates to test.

Fresh version is here.

Also, utilization percentage looks interesting:

As you can see, for newest Caymans with VLIW4 it nearly no difference how to process hashes. For RV770 we reached peak with x5 and x6 can’t significantly change this as 98.4% is really huge value. For RV870 there are still some options available though T unit of RV870′s VLIW cannot accept neigher BIT_ALIGN_INT or BFI_INT, that’s why I guess utilization stuck around 90%.

***

Updated table with x7 & x8 results. Probably by manually scheduling instruction it’s possible to push utilization further. Need to write simulator for this.

As Cayman’s results aren’t looking too impressive (with 1536@880 setup), 5970 will stay fastest GPU for hashing/cracking for several more months.

In previous post I’ve wrote about 83.5% utilization percentage for MD5. This value while looking good enough in fact isn’t that good as it’s already was for SHA-1 (95.5%). Back in January I was tried to improve utilization by processing 5xMD5 hashes per thread (with 5D VLIW it should be ideal setup, obviously). But either I’ve done something wrong or CAL compiler wasn’t in mood, anyway 5xMD5 wasn’t the best option that days — I’ve only got slowdowns compared to (classic) 4xMD5 vectors.

But recently, Marc Bevand released Whitepixel and he claims 28.6B/sec speed for 4×5970. Obviously it isn’t possible to reach with 83.5% utilization, so I’ve made some tests with 5xMD5 again and this time speed-up is here. Simple IL kernel modifications ends in 95.5% utilization for inner/main cycle or, in other words, +12% performance. That’s 2.1B/s single MD5 for 5770 and around 7.1B for 5970 (I’m lazy right now to check this by my own). You can get latest version of ighashgpu here (still very limited for ATI GPUs).

***

In other news — we’ve finally got program icon from designer and so Accent RAR Password Recovery (beta) was released today. At last! There are a lot of things I want to write about RAR GPU (well, most of them are negative :D ), so I’m planning to make a separate post about it. But as many peoples asking about Fermi/68xx support for rars I’ve decided to put here this announce at least.

And, yes, this program is fully commercial (cruel world indeed!) but discount coupon I’ve posted some time ago should works with it (if not — let me know).

It looks like recently released 6850/6870 are just slightly tuned 57XX GPUs. There no changes in supported instructions/cache size/double precision (not presented as it was before) or anything else. The only difference I’ve found so far reading forums and papers is — “flow control clauses don’t require as many cycles [as 5XXX]“. Meaning that complex kernels with large number of control clauses may works better at 68XX comparing with 57XX. But it doesn’t matters for hash calculations (well, may be a bit for huge muti-hash lists). Looks like marketing guys won and so it’s 68XX not 67XX family now, though from peak performance point of view it looks weird that 6870 slower than 5850. Of course it has nothing in common with 3D gaming benchmarks. But who buying modern GPUs for gaming these days? :D

Anyway, while 68XX looks totally boring from programming point of view the upcoming 69XX is a different story. It turns out that even Catalyst 10.6 can compile code for mysterious ISA id=15 and resulting disassembly looks very interesting — T unit indeed gone from ATI’s thread processors and XYWZ units now can process instructions they weren’t able to handle before, like 32-bit integer multiplies. It basically means that utilization percentage for ATI GPUs should grow and I’ve decided to check it.

I’ve took 2 GPU kernels to analyze — PBKDF2 (the algorithm core here is just 2xSHA1 transforms) and single MD5. Right now (for 5XXX family) utilization for PBKDF2 is already at 95.5%. After analyzing disassembly for ISA id=15 it turns out that it increased to 99.2%. Also number of instructions reduced by about 1% making final value of 4.6% estimated performance gain.

For MD5 results looks way more impressive — right now it’s really hard to fully utilize all 5 stream cores, I’ve made several tests with different numbers of hashes processed per thread simultaneously and ends with (first and default) value of 4. Utilization in this case is just around 83.5%. But with new 4x stream cores the 4xMD5 hashes can be perfectly vectorized, thus hitting 98.6% utilization value. So it’s ~18% speed-up just from 4+1 to 4x stream cores architecture change. In other words, if they’ll be 69XX with 768 SP @ 850Mhz it should show about 2100M single MD5 speed compared to current 1870M with 5770 (800 SP @ 850Mhz).

Of course, it’s very premature assumptions and there are some chances that ISA id=15 will be even just a myth, who knows. But if you’re planning to update your GPU right now (and use it mainly for GPGPU not games) I suggest to wait for 69XX release.

Updated speed estimations (with above assumptions for 69XX) available here.

Updated ighashgpu available here. It should works with 68XX family now, if it isn’t — run it with /debuglog switch and send “CAL device N, target = XX” value to me.

And yet again I’ve got bored with RAR Password Recovery. Every time I thought I’m almost done — boom, more problems arrive. Mainly from ATI GPU’s side. Simply ridiculous that the only Catalyst version able to produce good code for RAR kernels is still an year old v9.9 one.

Anyway, I’ve decided to relax a bit by switching to another problem. As PBKDF2 was implemented long time ago (for OpenOffice & WinZip/AES) and it’s main part of WPA-PSK protection I’ve decided to go into this direction and:

not bad for first version and single 5970 system I guess. As it’s full WPA handshake validation, speed is a bit lower than it should be for PMK generation only (more details about PMK generation and WPA overall is here). Although there still some places to optimize.

Anyone knows where to get large number of real WPA handshakes with matching passwords to test things out?