Hello, world!

So, one day after previous post at April, 16… (What? And two years later as well?! Can’t be!)

Should I get back to my previous promise “More information about SHA-512 performance on CPUs and GPUs will be in my next post”? Surprisingly enough it’s still interesting topic as AMD has not fixed anything in int64 code generation for GPUs. I guess “stability” is everything for them — even if kernels using SHA-512 (i. e. password recovery for TrueCrypt) are about three times slower than they should be — “don’t touch anything if it works”. So (again!) manual binary kernel code hacking is required to get good performance. Or hooking into .CL -> .IL -> ISA sequence and patching it on a fly. AMD GPU programming is always so-o-o FUN!

 

Or should I write about why year 2013 was almost a waste as way too weird (to say at least) people from ElcomSoft (Russian company) tried to sue me (Russian citizen) and my business partner (also Russian citizen) in US Court using forged documents in process? They tried to use US patent as a “steamroller” to bring lawsuit to US instead of Russia but failed to do so. It’s actually a very interesting and long story. But nah, not for today.

 

So, let’s move to more pleasant themes. At February, 18 2014 NVIDIA announces new architecture — Maxwell. While GTX 750 and GTX 750 Ti GPUs being middle-ranged they are quite interesting. I was able to purchase 750 Ti OC version at April, 1, made simple tests with it (welcome back, 4+ years old ighashgpu!) and was simply amazed. 640 ALU running at 1.15Ghz shows 2100 M/s speed for single MD5 and 918M for single SHA1. Thus beating GTX580, GTX680 and even questioning GTX780 performance! I’ve updated my GPU estimation page but it was unclear to me what exactly NVIDIA changed in architecture to make Maxwell that fast.

Apparently SHF (circular shift == AMD’s bitalign) implemented in Titan GPUs presents in Maxwell as well (as Titan being SM 3.5 and Maxwell 5.0 — it’s no surprise). This boost performance a lot but MD5 hashing speed is not increased that much as SHA1 speed. I made an assumption that this is because AMD’s MD5 implementation using bitselect (== BFI_INT) instructions a lot while SHA1 is not that depends on it. So I thought that Maxwell being missing bitselect() at all but shows good performance with SHA1 because all 128 ALU units within SMM can perform  circular shift.

Well, apparently I was wrong. And right at some point. Recently NVIDIA released “CUDA 6 Production Release” with updated documentations. According to it, SMM (containing 128 ALUs) can perform “only” 64 shifts per clock cycle. The same amount as Titan’s SMX (but it containing 192 ALUs). OK. Being clueless I’ve decided to look at disassembly produced by cuobjdump.exe starting from MD5 kernel:

LOP3.LUT R18, R20, R18, R19, 0xac;
IADD R17, R9, R18;
SHF.L.W R17, R17, 0×11, R17;

IADD R17, R17, R20;
LOP3.LUT R21, R19, R17, R20, 0xb8;
IADD R18, R6, R21;

SHF.L.W R18, R18, 0×16, R18;
IADD R18, R18, R17;
LOP3.LUT R21, R20, R18, R17, 0xb8;

IADD3 R19, R2, R19, R21;
SHF.L.W R19, R19, 0×7, R19;
IADD R19, R19, R18;

IADD3? LOP3.LUT?! What is this? Well, first one is not quite a puzzle — integer addition of 3 elements with placing results into 4th. But is it done in one clock cycle? Really?

LOP3.LUT produces even more questions for me — what LUT doing here? It was simple expression:

#define F(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))

Which “translates” into bitselect but why LUT used? Assuming it’s “Look Up Table”. And LOP3 being “logical operation with 3 operands”. Then it suddenly hits me — it’s really a LUT :). We’re providing 3 inputs to some logic functions, we have 8 possible outputs for all inputs. And these outputs being coded directly in instruction as “truth table” with 8-bit immediate! So, no Maxwell have no bitselect instruction. It have even more powerful LOP3.LUT one!

We can replace more than 3 logical instructions with single LUT in Maxwell. And SHA1 implementation is actually perfect for LOP3.LUT — transformation functions coded as:

#define F_00_19(b,c,d) ((((c) ^ (d)) & (b)) ^ (d))
#define F_20_39(b,c,d) ((b) ^ (c) ^ (d))
#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))
#define F_60_79(b,c,d) F_20_39(b,c,d)

So, that’s 3 inputs and 1 output. Even for F_40_59 containing 5 instructions. One problem is that… compiler is not recognizing it :). SHA1 rounds from 40 to 59 compiled into:

LOP.AND R31, R24, R18;
LOP.OR R17, R24, R18;

LOP3.LUT R10, R12, R25, R26, 0×96;
SHF.L.W R26, R24, 0×5, R24;
LOP3.LUT R17, R31, R16, R17, 0xf8;

So, it’s better than previous code because 5 instructions transformed into 3 ones (AND + OR + LUT) but we don’t need 5 while one is enough. Solution? Ok, let’s bit hack executable kernel as it was done with AMD :D.

Instead of (correct):

#define F_40_59(b,c,d) (((b) & (c)) | (((b)|(c)) & (d)))

I’ve used (incorrect but “LUT-able”):

#define F_40_59(b,c,d) (((b) | (c)) & (d))

This define compiles into something like:

LOP3.LUT R30, R25, R26, R20, 0xe0;

And now we only need to replace 0xe0 immediate (which represents (b | c) & d) to 0xe8 (representing the truth table for correct F_40_59 function). For single SHA1 hash there are exactly 20 places to applythis patch.

After these modifications SHA1 kernel performance increased from 918M to 981M == almost 7%! Patching pbkdf2/sha1 kernel provides 4-5% speed-up. But some tuning is still required.

And this should apply to all MD/SHA-based hashing schemes. Including (I guess, haven’t seen how compiler acts in that case) SHA256. Meaning — bitcoin mining of course ;). Need to make some more tests with it.

All in all, Maxwell chip is awesome. And as GTX 750 being “entry/mid level” GPU it really rising the question — what will top-end GPU from NVIDIA based on Maxwell chip show? If GTX880 will contain 3200 CUDA cores (there are such speculations) it will be a bomb!

 

Accent password recovery product line was recently updated to support Maxwell-based GPUs. No bit hacking of kernels was used though, thus — there is still place to improvements.

Posted in GPU programming, Hash cracking | Tagged , , , , | Leave a comment

Back to business

Recently I’ve released IGPRS — tool to recover passwords for Apple iOS 4.x & 5.x and BlackBerry 5.x & 6.x backups, TrueCrypt containers and WPA/WPA2 handshakes. IGPRS x64 version was added today with CPU AVX and XOP optimizations for SHA-512 used in TrueCrypt containers.

IGPRS x64 running at HD7970 + HD4850

Initial release back in February was using my old approach to support AMD GPUs — CAL API calls and kernels written with IL. However with new GCN architecture I’ve faced several problems — firstly AMD removed global buffer for GCN GPUs (instead of emulate it via UAV — after all it was not a problem to emulate UAV with global buffer back in 4xxx days). I was forced to waste some time to figure out how to deal with UAV but OK, it is not that hard (don’t use INT_4, just INT_1, etc). However later things became worse — with Catalyst 12.3 I’ve got several random lock ups with simple kernels and I was not able to run PBKDF2/SHA512 kernel for TrueCrypt at all — system just locks up, no matter what. After several days of programming and debugging I’ve got really annoyed by all these things and decided to give up CAL/IL and finally switched to OpenCL.

Things got better since last time I’ve took a look at OpenCL, after an year (of very “hard” work I guess) AMD made possible to use BFI_INT, BIT_ALIGN_INT directly from OpenCL kernels (via bitselect() and amd_bitalign()). I was amazed how easy to write GPU kernels for AMD cards now while their performance is nearly the same as hand-written IL kernels… but I felt that way for a very short time :D.

I faced nearly all kind of bugs once I’ve tried to implement more advanced algorithms — AMD OpenCL compiler producing ineffective code, it simply locks up on complex kernels, it doesn’t know how to use hardware capabilities of GPUs properly, some kernels (after “optimizations” done by compiler) simply producing incorrect results. It even replaces vector calculations to scalar ones (trying to favor GCN architecture I guess) which results in very poor performance on VLIW4/5 GPUs. Now I can’t decide which is more annoying — to fight with OpenCL compiler checking intermediate IL/ISA hoping for proper code generation or still write kernels with IL because there you can control a bit more things at least. Or my old idea to write my own GPU assembler to deal with AMD GPUs was (very time consuming but) a much better thing to do after all?..

 

After I got question about SHA-512 performance in my blog I’ve decided to take more closer look to ISA produced by AMD’s OpenCL compiler and was totally disappointed with results. More information about SHA-512 performance on CPUs and GPUs will be in my next post.

 

Posted in GPU programming, Password Recovery | Tagged , , , , , , | 55 Comments

Another Big One

Almost a year ago I’ve wrote post about 5970 and this week I’ve finally grabbed 6990 for tests by my own. As title states:

Same ruler, almost the same size as 5970. I’ve already had several tests results from 6990 owners and they looks kinda weird — while 6990 was faster than 5970 it still was slower than my expectations. First tests by my own produced values like:

 

That’s 10% slower. I’ve tested with SHA1 kernels, SL3, Office 07-10, WPA — everything were slower than expectations. I’ve grabbed my old program to measure GFLOPS of ATI GPUs and started series of experiments.  Apparently lowering GPU core frequency resulting in “closer to estimations” performance. My first guess was that there is internal throttling in 6990 and so overheating causing performance drop. I’ve even posted in official forum about this but some more experiments reveals that I wasn’t totally true. Answer was pretty simple:

Yep, by default it isn’t enough power provided for 6990 to make it work with 100% performance! This adjust results in:

Thus, power usage for 2nd core must also be tweaked and we’ll see:

At last the value I was expecting! Apparently, 5s running time is not good enough for precise measurements, so I’ve increased charset size and, well, hardware monitoring value as it reaching 90C in no time (and 95C is also happens very soon).

Mystery solved. My several months old estimations were correct.

 

If you’re going to repeat above steps with your 6990 make sure you have proper cooling and PSU as looks like official 375W TDP can easily became 450W and this means A LOT of heat you’re need to deal with somehow.

Latest ighashgpu can be downloaded here.

Posted in GPU programming, Hash cracking | Tagged , , , , , | 40 Comments

Another GPU?.. Ha!

Instead of buying another GPU month ago I’ve chosen other thing:

Well, when I was buying guitar I just couldn’t resist to buy tambourine. I’m not sure about English speaking community but in Russian tambourine and IT are very closely related :D.

And, yes, 6990 is going to be released soon but from first sights it won’t be anything revolutionary (except for power usage, 450W!) — good ol’ 5970 is still an awesome option, by first estimation 6990 will be just about 15-20% faster than 5970.

But, anyway, who cares about GPUs, my next aim are… DRUMS! :D

Posted in Uncategorized | 62 Comments

Про WPA и “облака” Amazon

Для разнообразия — на русском. Попалась мне тут на глаза небольшая заметка: http://habrahabr.ru/blogs/infosecurity/111488/.

Оригинал тут, хотя настоящий оригинал должен быть вообще-то где-то на Black Hat. Сработали все естественно в духе испорченного телефона — чем дальше, тем меньше правды.

Суть: Amazon стал предлагать в аренду сервера с GPU. Машинка с 2x Tesla M2050 стоит 2.1 $/час. Что такое Tesla M2050 в срезе хэширования/кракинга? Это замедленная по шейдеру версия GTX470 с 3ГГб памяти, поддержкой ECC и полноскоростной double precision floating point. Ничего из перечисленного для хэширования не требуется, однако цену разгоняет до $2600 за одну штуку. Ну, учёные такие деньги и заплатят, ибо для серьёзных научных расчётов конкурентов у Tesla (среди GPU) просто нету. Однако использовать Tesla для хэш кракинга… всё равно что покупать Феррари для грузовых перевозок — типа мощности больше, чем у КамАЗа, стоит дороже — ну значит и мебель на ней можно возить быстрее! Однозначно! Табличка со скоростями перебора на разных карточках всё там же.

Ладно, пусть нам не так важна конечная стоимость решения, нас волнует только скорость работы удалённой системы — предполагается что мы сидим в кустах с маломощным ноутбуком, перехватываем им WPA handshake, засылаем его на удалённый сервер (Amazon или какой другой) и ждём результата.

Thomas Roth утверждает что получил скорость в 400К/секунду. При этом платил по 28 центов в минуту. То есть 0.28*60 = 16.8$/час / $2.1 = арендовал он 8 систем. Или 16x Tesla M2050. По данным из приведённой выше таблички скорость вполне совпадает с расчётной — 25К с одной теслы. Далее он говорит, что взлом соседней сети (в оригинале ничего про то, что сеть была защищена профессионалами в области безопасности я не заметил — просто “protected network in his neighborhood”) занял 20 минут, которые он потом сократил до 6-ти минут. OK, за 6 минут он перебрал 6*60*400000 = 144M паролей. Такой диапазон можно перебрать за час на одной ATI 5770 стоимостью в $130. Но не в этом суть.

144М паролей это примерно половина диапазона all small latins, 6 symbols long (который составляет 308.9M). То есть вполне вероятно, что он действительно “забрутфорсил” какую-то соседнюю WiFi сеть с паролём вроде “miguel”. Но как назвать того человека, который ставит такие пароли?..

При более-менее приличном выборе (большие+маленькие латинские+цифры 8 символов длиной) диапазон перебора составит 62^8 = 218 340 105 584 896 паролей. Или 17.3 лет перебора на скорости в 400К. Или $2.5 миллиона долларов за аренду тесл на всё это время. Прорыв, что и говорить!

Если же идти дальше — пусть есть люди в сером, которые сильно заинтересованы во взломе WiFi сетей. Люди эти умные и используют самые эффективные для взлома GPU — ATI 5970.  Максимально известный мне размер GPU фермы (не Ферми :)) на 5970 — 200 штук. Соотвественно на полной нагрузке они покажут скорость примерно 200 * 131 000 ~= (округлим чтобы учесть накладные расходы) 25M PMK/s. При этом потребление всей фермы будет где-то под 100 кВт.

Тот же “нормальный” пароль из 8-ми символов потребует 62^8/25M = 101 день перебора. Стоимость сожжённой электроэнергии надо считать исходя из локальных тарифов. Не говоря уж об отдельном охлаждении.

Какие же из всего этого выводы? Самый главный — ставьте на свои WiFi нормальные пароли и никакие кракеры вам страшны не будут. Ну и думать своей головой тоже чертовски полезно, пЕарщики-то не спят…

Всех с Наступившим и успехов в нём!

Posted in Hash cracking, In Russian, WPA | Tagged , , , , | 16 Comments

Happy New Year

It was incorrect to end this year with post like my previous :)

So Happy New Year to everyone!

However, can’t leave post without any information related to blog’s tagline. So just some information I’ve found interesting in last days.

1. I’ve got question recently about AVX extension for upcoming CPUs, checked latest docs available and suddenly found that  (quote) “VI: “Vector Integer” instructions are not promoted to 256-bit” applied to every instruction needed for MD4/MD5/SHA1 hashes. This means AVX will be as useless as SSE was for password cracking as vector size increased from 128 to 256 bit only for floating point values. Simply ridiculous.

2. Somehow I’ve read disassembly of Cayman’s ISA like it capable of doing 32-bit integer multiplications with each of XYZW units. Actually I was wrong and in reality all 4 these units required now to perform one 32-bit multiplication. So with previous architecture it was possible to perform 4x additions/logic/bit-aligns AND multiplication and now multiplication requires 4x more instructions. Not very good for classic ZIP encryption…

3. …which currently not supported for ATI GPUs at all because they missing some functionality presents in NVIDIA GPUs. So right now AccentZPR (v2.0 final released recently) shows millions of passwords per second for NVIDIA GPUs and ZERO for ATI ones. Good counterexample for “Oh, ATI’s GPUs are so good and fast and cheap”… :P

4. For SHA-1 with fixed charset (let’s say 10 symbols) and fixed password length (like 15 + 9 byte salt) it’s possible to optimize algorithm a lot. I’ve got 800M/s speed compared with ighashgpu v0.80′s 680M/s at single 5770.

More information will come next year, so stay tuned!

Posted in GPU programming, Password Recovery, Uncategorized | Tagged , , , , | 41 Comments

Спи$дили

Pardon my French.

I understand that many people disrespect ighashgpu’s license agreement (and so disrespect me in fact) by using it in commercial environment. It clearly states that it’s free only for personal, non-commercial use but nobody cares.

However it is nothing compared with some motherfu$kers who took ighashgpu, removed all copyright notices, included it into their package and started selling it as WORLD FIRST GPU SOLUTION FOR SL3 UNLOCKING. Seriously, are they that brain-damaged?..

I was planning to release SHA-1/SL3 version for some time already as I’ve been constantly asking about it (well, it’s just single SHA-1 iteration, so surely ighashgpu is ideal for this) but now… what’s the point after all?

No download link.

***

One more screenshot — dump of sl3bf.exe contained in MX-KEY package:

So “many” changes, they even renamed ighashgpu to mxhashgpu. Seriously, did they thought nobody will notice this?!

***

More updates. Inside sl3bf.exe at offset 0xd10f4 starts… ighashgpu executable. Simply as that.

Comparing ighashgpu.exe v0.80.16.1 with data inside sl3bf.exe:

[ighashgpu.exe] 524800 bytes
[ighashgpu_inside_sl3bf.exe_at_0xd10f4_offset] 522752 bytes
00000111 02 ( ) da (Ú)
00000112 06 ( ) 05 ( )
0000017c 88 (ˆ) 10 ( )
0000017d 09 ( ) 00 ( )
000002b8 88 (ˆ) 10 ( )
000002b9 09 ( ) 00 ( )
000002c1 0a ( ) 02 ( )
000002ed da (Ú) d2 (Ò)
00024870 69 (i) 6d (m)
00024872 67 (g) 78 (x)
00024924 69 (i) 6d (m)
00024926 67 (g) 78 (x)
0007c010 2a (*) 20 ( )
0007c012 2a (*) 20 ( )
0007c014 2a (*) 20 ( )
0007c016 2a (*) 20 ( )
0007c018 2a (*) 20 ( )

0007d966 54 (T) 00 ( )
0007d968 72 (r) 00 ( )
0007d96a 61 (a) 00 ( )
0007d96c 6e (n) 00 ( )
0007d96e 73 (s) 00 ( )
0007d970 6c (l) 00 ( )
0007d972 61 (a) 00 ( )
0007d974 74 (t) 00 ( )
0007d976 69 (i) 00 ( )
0007d978 6f (o) 00 ( )
0007d97a 6e (n) 00 ( )
0007d980 09 ( ) 00 ( )
0007d981 04 ( ) 00 ( )
0007d982 b0 (°) 00 ( )
0007d983 04 ( ) 00 ( )
1553 bytes out of 522752 are different

Full diff file is here. Copyrights removed, ig changed to mx… Awesome work indeed!

***

27-Dec-2010 update.

To summarize everything:

1. ighashgpu’s code was stolen by MX-KEY authors. Evidences are above.

2. They clearly understand that they violating ighashgpu’s license agreement, that’s why they removed all copyright notices. Totally stupid move from their side.

3. It is OK to use ighashgpu in SL3 unlocking solutions if it isn’t violates ighashgpu’s license agreement. And this means:

a) you can’t use it in commercial environment without separate agreement with copyright holder (so you can’t charge money for GPU brute-forcing itself or use it in clusters selling results of GPU brute-forcing performed by ighashgpu)

b) you can’t modify any parts of executable file or other files contained in ighashgpu’s distribution package.

c)  if you’re including ighashgpu.exe in your package you must include ighashgpu’s license agreement in your package as well.

Right now I’m totally OK with CycloneBox realization of local sl3 unlock (except for non-included license agreement part).

This post locked for comments, if you wan’t to contact me — you should know how to find me.

Posted in Hash cracking, Password Recovery | Tagged , | 42 Comments

x4, x5, x6…

So, if you’ve read comments to previous post you’re know that there are some major speed-ups for MD5 — 5xhashes processing in a single thread, BFI_INT usage to replace 3 instructions with 1 (kinda the same optimization as it was with _rotl to bitalign change), etc. However implementing BFI_INT lowered total number of instructions required to perform single MD5 and so utilization percentage dropped again. But, as I wrote yesterday on ATI devforums, there are reserves — we can process 6x MD5 per thread despite the fact it’s VLIW5. Tried and and yes — other speed-up is possible though this time it’s just ~3%. However, x7 and x8 also looks like good candidates to test.

Fresh version is here.

Also, utilization percentage looks interesting:

As you can see, for newest Caymans with VLIW4 it nearly no difference how to process hashes. For RV770 we reached peak with x5 and x6 can’t significantly change this as 98.4% is really huge value. For RV870 there are still some options available though T unit of RV870′s VLIW cannot accept neigher BIT_ALIGN_INT or BFI_INT, that’s why I guess utilization stuck around 90%.

***

Updated table with x7 & x8 results. Probably by manually scheduling instruction it’s possible to push utilization further. Need to write simulator for this.

As Cayman’s results aren’t looking too impressive (with 1536@880 setup), 5970 will stay fastest GPU for hashing/cracking for several more months.

Posted in Uncategorized | Tagged , , | 46 Comments

More updates (including RAR)

In previous post I’ve wrote about 83.5% utilization percentage for MD5. This value while looking good enough in fact isn’t that good as it’s already was for SHA-1 (95.5%). Back in January I was tried to improve utilization by processing 5xMD5 hashes per thread (with 5D VLIW it should be ideal setup, obviously). But either I’ve done something wrong or CAL compiler wasn’t in mood, anyway 5xMD5 wasn’t the best option that days — I’ve only got slowdowns compared to (classic) 4xMD5 vectors.

But recently, Marc Bevand released Whitepixel and he claims 28.6B/sec speed for 4×5970. Obviously it isn’t possible to reach with 83.5% utilization, so I’ve made some tests with 5xMD5 again and this time speed-up is here. Simple IL kernel modifications ends in 95.5% utilization for inner/main cycle or, in other words, +12% performance. That’s 2.1B/s single MD5 for 5770 and around 7.1B for 5970 (I’m lazy right now to check this by my own). You can get latest version of ighashgpu here (still very limited for ATI GPUs).

***

In other news — we’ve finally got program icon from designer and so Accent RAR Password Recovery (beta) was released today. At last! There are a lot of things I want to write about RAR GPU (well, most of them are negative :D), so I’m planning to make a separate post about it. But as many peoples asking about Fermi/68xx support for rars I’ve decided to put here this announce at least.

And, yes, this program is fully commercial (cruel world indeed!) but discount coupon I’ve posted some time ago should works with it (if not — let me know).

Posted in GPU programming, Password Recovery | Tagged , , | 17 Comments

More about ATI 6XXX

It looks like recently released 6850/6870 are just slightly tuned 57XX GPUs. There no changes in supported instructions/cache size/double precision (not presented as it was before) or anything else. The only difference I’ve found so far reading forums and papers is — “flow control clauses don’t require as many cycles [as 5XXX]“. Meaning that complex kernels with large number of control clauses may works better at 68XX comparing with 57XX. But it doesn’t matters for hash calculations (well, may be a bit for huge muti-hash lists). Looks like marketing guys won and so it’s 68XX not 67XX family now, though from peak performance point of view it looks weird that 6870 slower than 5850. Of course it has nothing in common with 3D gaming benchmarks. But who buying modern GPUs for gaming these days? :D

Anyway, while 68XX looks totally boring from programming point of view the upcoming 69XX is a different story. It turns out that even Catalyst 10.6 can compile code for mysterious ISA id=15 and resulting disassembly looks very interesting — T unit indeed gone from ATI’s thread processors and XYWZ units now can process instructions they weren’t able to handle before, like 32-bit integer multiplies. It basically means that utilization percentage for ATI GPUs should grow and I’ve decided to check it.

I’ve took 2 GPU kernels to analyze — PBKDF2 (the algorithm core here is just 2xSHA1 transforms) and single MD5. Right now (for 5XXX family) utilization for PBKDF2 is already at 95.5%. After analyzing disassembly for ISA id=15 it turns out that it increased to 99.2%. Also number of instructions reduced by about 1% making final value of 4.6% estimated performance gain.

For MD5 results looks way more impressive — right now it’s really hard to fully utilize all 5 stream cores, I’ve made several tests with different numbers of hashes processed per thread simultaneously and ends with (first and default) value of 4. Utilization in this case is just around 83.5%. But with new 4x stream cores the 4xMD5 hashes can be perfectly vectorized, thus hitting 98.6% utilization value. So it’s ~18% speed-up just from 4+1 to 4x stream cores architecture change. In other words, if they’ll be 69XX with 768 SP @ 850Mhz it should show about 2100M single MD5 speed compared to current 1870M with 5770 (800 SP @ 850Mhz).

Of course, it’s very premature assumptions and there are some chances that ISA id=15 will be even just a myth, who knows. But if you’re planning to update your GPU right now (and use it mainly for GPGPU not games) I suggest to wait for 69XX release.

Updated speed estimations (with above assumptions for 69XX) available here.

Updated ighashgpu available here. It should works with 68XX family now, if it isn’t — run it with /debuglog switch and send “CAL device N, target = XX” value to me.

Posted in GPU programming, Hash cracking | Tagged , , , | 40 Comments