Ivan Golubev's blog

Cryptography, code optimizations, GPUs & CPUs and other

Almost a year ago I’ve wrote post about 5970 and this week I’ve finally grabbed 6990 for tests by my own. As title states:

Same ruler, almost the same size as 5970. I’ve already had several tests results from 6990 owners and they looks kinda weird — while 6990 was faster than 5970 it still was slower than my expectations. First tests by my own produced values like:

 

That’s 10% slower. I’ve tested with SHA1 kernels, SL3, Office 07-10, WPA — everything were slower than expectations. I’ve grabbed my old program to measure GFLOPS of ATI GPUs and started series of experiments.  Apparently lowering GPU core frequency resulting in “closer to estimations” performance. My first guess was that there is internal throttling in 6990 and so overheating causing performance drop. I’ve even posted in official forum about this but some more experiments reveals that I wasn’t totally true. Answer was pretty simple:

Yep, by default it isn’t enough power provided for 6990 to make it work with 100% performance! This adjust results in:

Thus, power usage for 2nd core must also be tweaked and we’ll see:

At last the value I was expecting! Apparently, 5s running time is not good enough for precise measurements, so I’ve increased charset size and, well, hardware monitoring value as it reaching 90C in no time (and 95C is also happens very soon).

Mystery solved. My several months old estimations were correct.

 

If you’re going to repeat above steps with your 6990 make sure you have proper cooling and PSU as looks like official 375W TDP can easily became 450W and this means A LOT of heat you’re need to deal with somehow.

Latest ighashgpu can be downloaded here.

Instead of buying another GPU month ago I’ve chosen other thing:

Well, when I was buying guitar I just couldn’t resist to buy tambourine. I’m not sure about English speaking community but in Russian tambourine and IT are very closely related :D .

And, yes, 6990 is going to be released soon but from first sights it won’t be anything revolutionary (except for power usage, 450W!) — good ol’ 5970 is still an awesome option, by first estimation 6990 will be just about 15-20% faster than 5970.

But, anyway, who cares about GPUs, my next aim are… DRUMS! :D

Для разнообразия — на русском. Попалась мне тут на глаза небольшая заметка: http://habrahabr.ru/blogs/infosecurity/111488/.

Оригинал тут, хотя настоящий оригинал должен быть вообще-то где-то на Black Hat. Сработали все естественно в духе испорченного телефона — чем дальше, тем меньше правды.

Суть: Amazon стал предлагать в аренду сервера с GPU. Машинка с 2x Tesla M2050 стоит 2.1 $/час. Что такое Tesla M2050 в срезе хэширования/кракинга? Это замедленная по шейдеру версия GTX470 с 3ГГб памяти, поддержкой ECC и полноскоростной double precision floating point. Ничего из перечисленного для хэширования не требуется, однако цену разгоняет до $2600 за одну штуку. Ну, учёные такие деньги и заплатят, ибо для серьёзных научных расчётов конкурентов у Tesla (среди GPU) просто нету. Однако использовать Tesla для хэш кракинга… всё равно что покупать Феррари для грузовых перевозок — типа мощности больше, чем у КамАЗа, стоит дороже — ну значит и мебель на ней можно возить быстрее! Однозначно! Табличка со скоростями перебора на разных карточках всё там же.

Ладно, пусть нам не так важна конечная стоимость решения, нас волнует только скорость работы удалённой системы — предполагается что мы сидим в кустах с маломощным ноутбуком, перехватываем им WPA handshake, засылаем его на удалённый сервер (Amazon или какой другой) и ждём результата.

Thomas Roth утверждает что получил скорость в 400К/секунду. При этом платил по 28 центов в минуту. То есть 0.28*60 = 16.8$/час / $2.1 = арендовал он 8 систем. Или 16x Tesla M2050. По данным из приведённой выше таблички скорость вполне совпадает с расчётной — 25К с одной теслы. Далее он говорит, что взлом соседней сети (в оригинале ничего про то, что сеть была защищена профессионалами в области безопасности я не заметил — просто “protected network in his neighborhood”) занял 20 минут, которые он потом сократил до 6-ти минут. OK, за 6 минут он перебрал 6*60*400000 = 144M паролей. Такой диапазон можно перебрать за час на одной ATI 5770 стоимостью в $130. Но не в этом суть.

144М паролей это примерно половина диапазона all small latins, 6 symbols long (который составляет 308.9M). То есть вполне вероятно, что он действительно “забрутфорсил” какую-то соседнюю WiFi сеть с паролём вроде “miguel”. Но как назвать того человека, который ставит такие пароли?..

При более-менее приличном выборе (большие+маленькие латинские+цифры 8 символов длиной) диапазон перебора составит 62^8 = 218 340 105 584 896 паролей. Или 17.3 лет перебора на скорости в 400К. Или $2.5 миллиона долларов за аренду тесл на всё это время. Прорыв, что и говорить!

Если же идти дальше — пусть есть люди в сером, которые сильно заинтересованы во взломе WiFi сетей. Люди эти умные и используют самые эффективные для взлома GPU — ATI 5970.  Максимально известный мне размер GPU фермы (не Ферми :) ) на 5970 — 200 штук. Соотвественно на полной нагрузке они покажут скорость примерно 200 * 131 000 ~= (округлим чтобы учесть накладные расходы) 25M PMK/s. При этом потребление всей фермы будет где-то под 100 кВт.

Тот же “нормальный” пароль из 8-ми символов потребует 62^8/25M = 101 день перебора. Стоимость сожжённой электроэнергии надо считать исходя из локальных тарифов. Не говоря уж об отдельном охлаждении.

Какие же из всего этого выводы? Самый главный — ставьте на свои WiFi нормальные пароли и никакие кракеры вам страшны не будут. Ну и думать своей головой тоже чертовски полезно, пЕарщики-то не спят…

Всех с Наступившим и успехов в нём!

It was incorrect to end this year with post like my previous :)

So Happy New Year to everyone!

However, can’t leave post without any information related to blog’s tagline. So just some information I’ve found interesting in last days.

1. I’ve got question recently about AVX extension for upcoming CPUs, checked latest docs available and suddenly found that  (quote) “VI: “Vector Integer” instructions are not promoted to 256-bit” applied to every instruction needed for MD4/MD5/SHA1 hashes. This means AVX will be as useless as SSE was for password cracking as vector size increased from 128 to 256 bit only for floating point values. Simply ridiculous.

2. Somehow I’ve read disassembly of Cayman’s ISA like it capable of doing 32-bit integer multiplications with each of XYZW units. Actually I was wrong and in reality all 4 these units required now to perform one 32-bit multiplication. So with previous architecture it was possible to perform 4x additions/logic/bit-aligns AND multiplication and now multiplication requires 4x more instructions. Not very good for classic ZIP encryption…

3. …which currently not supported for ATI GPUs at all because they missing some functionality presents in NVIDIA GPUs. So right now AccentZPR (v2.0 final released recently) shows millions of passwords per second for NVIDIA GPUs and ZERO for ATI ones. Good counterexample for “Oh, ATI’s GPUs are so good and fast and cheap”… :P

4. For SHA-1 with fixed charset (let’s say 10 symbols) and fixed password length (like 15 + 9 byte salt) it’s possible to optimize algorithm a lot. I’ve got 800M/s speed compared with ighashgpu v0.80′s 680M/s at single 5770.

More information will come next year, so stay tuned!

Pardon my French.

I understand that many people disrespect ighashgpu’s license agreement (and so disrespect me in fact) by using it in commercial environment. It clearly states that it’s free only for personal, non-commercial use but nobody cares.

However it is nothing compared with some motherfu$kers who took ighashgpu, removed all copyright notices, included it into their package and started selling it as WORLD FIRST GPU SOLUTION FOR SL3 UNLOCKING. Seriously, are they that brain-damaged?..

I was planning to release SHA-1/SL3 version for some time already as I’ve been constantly asking about it (well, it’s just single SHA-1 iteration, so surely ighashgpu is ideal for this) but now… what’s the point after all?

No download link.

***

One more screenshot — dump of sl3bf.exe contained in MX-KEY package:

So “many” changes, they even renamed ighashgpu to mxhashgpu. Seriously, did they thought nobody will notice this?!

***

More updates. Inside sl3bf.exe at offset 0xd10f4 starts… ighashgpu executable. Simply as that.

Comparing ighashgpu.exe v0.80.16.1 with data inside sl3bf.exe:

[ighashgpu.exe] 524800 bytes
[ighashgpu_inside_sl3bf.exe_at_0xd10f4_offset] 522752 bytes
00000111 02 ( ) da (Ú)
00000112 06 ( ) 05 ( )
0000017c 88 (ˆ) 10 ( )
0000017d 09 ( ) 00 ( )
000002b8 88 (ˆ) 10 ( )
000002b9 09 ( ) 00 ( )
000002c1 0a ( ) 02 ( )
000002ed da (Ú) d2 (Ò)
00024870 69 (i) 6d (m)
00024872 67 (g) 78 (x)
00024924 69 (i) 6d (m)
00024926 67 (g) 78 (x)
0007c010 2a (*) 20 ( )
0007c012 2a (*) 20 ( )
0007c014 2a (*) 20 ( )
0007c016 2a (*) 20 ( )
0007c018 2a (*) 20 ( )

0007d966 54 (T) 00 ( )
0007d968 72 (r) 00 ( )
0007d96a 61 (a) 00 ( )
0007d96c 6e (n) 00 ( )
0007d96e 73 (s) 00 ( )
0007d970 6c (l) 00 ( )
0007d972 61 (a) 00 ( )
0007d974 74 (t) 00 ( )
0007d976 69 (i) 00 ( )
0007d978 6f (o) 00 ( )
0007d97a 6e (n) 00 ( )
0007d980 09 ( ) 00 ( )
0007d981 04 ( ) 00 ( )
0007d982 b0 (°) 00 ( )
0007d983 04 ( ) 00 ( )
1553 bytes out of 522752 are different

Full diff file is here. Copyrights removed, ig changed to mx… Awesome work indeed!

***

27-Dec-2010 update.

To summarize everything:

1. ighashgpu’s code was stolen by MX-KEY authors. Evidences are above.

2. They clearly understand that they violating ighashgpu’s license agreement, that’s why they removed all copyright notices. Totally stupid move from their side.

3. It is OK to use ighashgpu in SL3 unlocking solutions if it isn’t violates ighashgpu’s license agreement. And this means:

a) you can’t use it in commercial environment without separate agreement with copyright holder (so you can’t charge money for GPU brute-forcing itself or use it in clusters selling results of GPU brute-forcing performed by ighashgpu)

b) you can’t modify any parts of executable file or other files contained in ighashgpu’s distribution package.

c)  if you’re including ighashgpu.exe in your package you must include ighashgpu’s license agreement in your package as well.

Right now I’m totally OK with CycloneBox realization of local sl3 unlock (except for non-included license agreement part).

This post locked for comments, if you wan’t to contact me — you should know how to find me.

So, if you’ve read comments to previous post you’re know that there are some major speed-ups for MD5 — 5xhashes processing in a single thread, BFI_INT usage to replace 3 instructions with 1 (kinda the same optimization as it was with _rotl to bitalign change), etc. However implementing BFI_INT lowered total number of instructions required to perform single MD5 and so utilization percentage dropped again. But, as I wrote yesterday on ATI devforums, there are reserves — we can process 6x MD5 per thread despite the fact it’s VLIW5. Tried and and yes — other speed-up is possible though this time it’s just ~3%. However, x7 and x8 also looks like good candidates to test.

Fresh version is here.

Also, utilization percentage looks interesting:

As you can see, for newest Caymans with VLIW4 it nearly no difference how to process hashes. For RV770 we reached peak with x5 and x6 can’t significantly change this as 98.4% is really huge value. For RV870 there are still some options available though T unit of RV870′s VLIW cannot accept neigher BIT_ALIGN_INT or BFI_INT, that’s why I guess utilization stuck around 90%.

***

Updated table with x7 & x8 results. Probably by manually scheduling instruction it’s possible to push utilization further. Need to write simulator for this.

As Cayman’s results aren’t looking too impressive (with 1536@880 setup), 5970 will stay fastest GPU for hashing/cracking for several more months.

In previous post I’ve wrote about 83.5% utilization percentage for MD5. This value while looking good enough in fact isn’t that good as it’s already was for SHA-1 (95.5%). Back in January I was tried to improve utilization by processing 5xMD5 hashes per thread (with 5D VLIW it should be ideal setup, obviously). But either I’ve done something wrong or CAL compiler wasn’t in mood, anyway 5xMD5 wasn’t the best option that days — I’ve only got slowdowns compared to (classic) 4xMD5 vectors.

But recently, Marc Bevand released Whitepixel and he claims 28.6B/sec speed for 4×5970. Obviously it isn’t possible to reach with 83.5% utilization, so I’ve made some tests with 5xMD5 again and this time speed-up is here. Simple IL kernel modifications ends in 95.5% utilization for inner/main cycle or, in other words, +12% performance. That’s 2.1B/s single MD5 for 5770 and around 7.1B for 5970 (I’m lazy right now to check this by my own). You can get latest version of ighashgpu here (still very limited for ATI GPUs).

***

In other news — we’ve finally got program icon from designer and so Accent RAR Password Recovery (beta) was released today. At last! There are a lot of things I want to write about RAR GPU (well, most of them are negative :D ), so I’m planning to make a separate post about it. But as many peoples asking about Fermi/68xx support for rars I’ve decided to put here this announce at least.

And, yes, this program is fully commercial (cruel world indeed!) but discount coupon I’ve posted some time ago should works with it (if not — let me know).

It looks like recently released 6850/6870 are just slightly tuned 57XX GPUs. There no changes in supported instructions/cache size/double precision (not presented as it was before) or anything else. The only difference I’ve found so far reading forums and papers is — “flow control clauses don’t require as many cycles [as 5XXX]“. Meaning that complex kernels with large number of control clauses may works better at 68XX comparing with 57XX. But it doesn’t matters for hash calculations (well, may be a bit for huge muti-hash lists). Looks like marketing guys won and so it’s 68XX not 67XX family now, though from peak performance point of view it looks weird that 6870 slower than 5850. Of course it has nothing in common with 3D gaming benchmarks. But who buying modern GPUs for gaming these days? :D

Anyway, while 68XX looks totally boring from programming point of view the upcoming 69XX is a different story. It turns out that even Catalyst 10.6 can compile code for mysterious ISA id=15 and resulting disassembly looks very interesting — T unit indeed gone from ATI’s thread processors and XYWZ units now can process instructions they weren’t able to handle before, like 32-bit integer multiplies. It basically means that utilization percentage for ATI GPUs should grow and I’ve decided to check it.

I’ve took 2 GPU kernels to analyze — PBKDF2 (the algorithm core here is just 2xSHA1 transforms) and single MD5. Right now (for 5XXX family) utilization for PBKDF2 is already at 95.5%. After analyzing disassembly for ISA id=15 it turns out that it increased to 99.2%. Also number of instructions reduced by about 1% making final value of 4.6% estimated performance gain.

For MD5 results looks way more impressive — right now it’s really hard to fully utilize all 5 stream cores, I’ve made several tests with different numbers of hashes processed per thread simultaneously and ends with (first and default) value of 4. Utilization in this case is just around 83.5%. But with new 4x stream cores the 4xMD5 hashes can be perfectly vectorized, thus hitting 98.6% utilization value. So it’s ~18% speed-up just from 4+1 to 4x stream cores architecture change. In other words, if they’ll be 69XX with 768 SP @ 850Mhz it should show about 2100M single MD5 speed compared to current 1870M with 5770 (800 SP @ 850Mhz).

Of course, it’s very premature assumptions and there are some chances that ISA id=15 will be even just a myth, who knows. But if you’re planning to update your GPU right now (and use it mainly for GPGPU not games) I suggest to wait for 69XX release.

Updated speed estimations (with above assumptions for 69XX) available here.

Updated ighashgpu available here. It should works with 68XX family now, if it isn’t — run it with /debuglog switch and send “CAL device N, target = XX” value to me.

And yet again I’ve got bored with RAR Password Recovery. Every time I thought I’m almost done — boom, more problems arrive. Mainly from ATI GPU’s side. Simply ridiculous that the only Catalyst version able to produce good code for RAR kernels is still an year old v9.9 one.

Anyway, I’ve decided to relax a bit by switching to another problem. As PBKDF2 was implemented long time ago (for OpenOffice & WinZip/AES) and it’s main part of WPA-PSK protection I’ve decided to go into this direction and:

not bad for first version and single 5970 system I guess. As it’s full WPA handshake validation, speed is a bit lower than it should be for PMK generation only (more details about PMK generation and WPA overall is here). Although there still some places to optimize.

Anyone knows where to get large number of real WPA handshakes with matching passwords to test things out?

Firstly, I was going to make separate post for AccentOPR v5 beta 1 which was available at the end of July. Some things got me distracted, so it didn’t happens. Then I was going to post about results of GTX460 I’ve purchased at August, 1. Same thing. Couple more things I’ve missed at September too. And today browsing the sources I’ve found that ighashgpu v0.90 was compiled at 18-September… and not released of course. So, to summarize all of these things.

AccentOPR v5 running at i7-860+HD5770+HD4770+GT8600 config

AccentOPR v5 was released at October. Interface was completely rewritten (bye-bye, Borland Builder v5, it was so annoying to link optimized assembler code and GPU support with 10 years old compiler!) , password recovery engine redone, password generators remade and finally password recovery going in totally automated way, no need to define same attacks all over again. I was thinking about this concept for a long time and finally it programmed and working. You can dig up this blog and even find discount coupon for AccentOPR.

***

About GTX460 — firstly it was looking like cut-in-half version of GF100 and initial performance tests shows this too. However, GF104 differs a lot from original GF100. NVIDIA finally goes superscalar way, so from some point of view GF104 now even closer to ATI’s VLIW’s architecture rather than to GF100. Well, this is too strong claim of course :) . But main thing that you’re need to vectorize your code for GF104 (and 106/108) to get full performance of GPU as it already happens with ATI GPUs. Processing single hash per thread will drop performance by 2/3 (for algorithms which instructions heavily depends on each other, like MD5), thus making 336SP of GTX460 looking like only 224SP. There are a lot of another changes at GF104 but they aren’t so important for hash calculations like code vectorization.

New version of ighashgpu running with GTX460+GTS250:

First tests with vectorization and MD5 shows speed over 990M for single hash at GTX460 compared with 770M at old ighashgpu’s version. After some tunings it grows over 1B but it still lower than it should be for “real” 336SP. However from price/performance point of view it’s very nice. Also I’ve made new version of estimations tables, available here. (Obviously, word “estimations” says for itself. As well as clocks/SP count for new ATI 6XXX are unknown yet, speculations all the way). With proper programming GF104/6/8 can provide very nice results, however it always “lazy factor” here and I’m covered by it — simply can’t rewrite all kernels for GF104 right now. Luckily SHA1 isn’t that affected as MD5 and even without vectorization it shows good results (vectorized version of SHA-1 can be better… or worse depending on register pressure, as GF104 doesn’t have more registers than GF100 and it can be bottleneck when vectorizing things).

One more good news for Fermi owners is that it in fact supports “fused” SHL+ADD instruction. It means that cyclic bit rotations can be replaced with two instructions instead of “old” G80′s three. Not as good as ATI’s bitalign single instruction but good enough to provide (theoretical) ~12% speed-up for MD5 and ~19% for SHA-1 compared with G80-GT200 architecture. It wasn’t clear is it works or not for Fermi but latest tests shows 1800M+ single md5 hash speed for stock clocked GTX480 which is simply impossible without SHL+ADD combo. Bad thing that I need to get GF100 to my GPU collection to make tests by my own and right now I’m already packed with bunch of other GPUs… It’s becoming space issue.

You can get ighashgpu v0.90 from here. However note that it’s very limited for ATI GPUs now, only 3 ATI GPU kernels were rewritten, I was going to replace other kernels too but (almost month later from last compilation date) it doesn’t looks like it’ll happens soon. Check out history.txt & file_id.diz. From the other side all kernels should works for NVIDIA GPUs though vectorization used only for single md5 kernel (and sm_21 GPUs).

Speed-up for single NTLM hash at ATI GPUs looks most impressive with v0.90:

Shortly, feel free to experiment but it’s really alpha version and more like proof of concept for ATI GPUs.

***

Also one more interesting thing is going to happen soon — release of ATI 6XXX family. Unfortunately there almost no details about architecture and specs right now. Looks like ATI finally changed VLIW from 5D to 4D but it’s unknown how it’ll affect performance. In best case scenario it won’t change (or even become better) comparing with similar number of 5XXX’s SPs. However, current GPU names (well, one more time, everything is speculations of course) doesn’t match 5XXX family — 6870 being slower than 5870 and top 6990 slower than current 5970. Of course, specs can be changed and too earlier to talk about real ratios. However, it won’t be that big performance increase with 5XXX->6XXX change as it happens with 4XXX->5XXX where new architecure + bitalign provides over 2x speed boost. With 6XXX it more looks like just “+several percents”. But probably 6XXX family will be more suitable for GPGPU than 5XXX. Will see very soon, I guess.