So, if you’ve read comments to previous post you’re know that there are some major speed-ups for MD5 — 5xhashes processing in a single thread, BFI_INT usage to replace 3 instructions with 1 (kinda the same optimization as it was with _rotl to bitalign change), etc. However implementing BFI_INT lowered total number of instructions required to perform single MD5 and so utilization percentage dropped again. But, as I wrote yesterday on ATI devforums, there are reserves — we can process 6x MD5 per thread despite the fact it’s VLIW5. Tried and and yes — other speed-up is possible though this time it’s just ~3%. However, x7 and x8 also looks like good candidates to test.

Fresh version is here.

Also, utilization percentage looks interesting:

As you can see, for newest Caymans with VLIW4 it nearly no difference how to process hashes. For RV770 we reached peak with x5 and x6 can’t significantly change this as 98.4% is really huge value. For RV870 there are still some options available though T unit of RV870′s VLIW cannot accept neigher BIT_ALIGN_INT or BFI_INT, that’s why I guess utilization stuck around 90%.

***

Updated table with x7 & x8 results. Probably by manually scheduling instruction it’s possible to push utilization further. Need to write simulator for this.

As Cayman’s results aren’t looking too impressive (with 1536@880 setup), 5970 will stay fastest GPU for hashing/cracking for several more months.