# About CPU and GPU usage for password recovery

Earlier I wrote article about CPU & GPU but it was only available in russian. Now it's good time to summarize most important things in english.

### Peak performance

As password recovery (you can also call it password cracking if you're younger than 15 or password auditing if you're older than 30 but in any case it'll be the same thing) is embarrassingly parallel problem the only thing that really matters is peak performance of the computing device we're using and how good are we at programming to reach that peak performance.

#### CPUs

Let's start with CPUs. Intel Core 2 (and its successor Core i7) is capable to perform up to three ALU instructions per clock cycle. As we want to reach highest performance possible we'll definitely need to use SSE extension (so everything written below assumes that we're using SSE). Thus we have 3 instructions with 4 32-bit operands each == 12 instructions per clock. Unfortunately not every instruction can run on this highest rate. In fact it's only possible for simple logic like OR, XOR, AND. Integer additions already limited to 2 per clock. And only one shift is possible per clock. Single precision floating point (SPFP) also can be done at rate of one instruction per clock, this applies to multiple and addition (this is important for comparison with GPU and will be used later).

AMD K10 family processors capable to do up to 2 SSE ALU instructions per clock cycle. Performance for SPFP is the same as for Intel's CPUs (though latency is significantly higher but we don't care about latency now) -- one instruction per clock. So as SPFP performance doesn't matter for password recovery we'll use only Intel's CPU later as its faster on integer operations than AMD ones.

Now computing peak performance for CPU is a trivial thing. For example, let's take **Q6600**, we have four
cores running on 2.4Ghz and each core can perform 12 instructions with 32-bit integers per clock, so we
have 4 * 2.4 * 12 = **115.2 * 10^9** operations with 32-bit integers
per second. And computing SPFP performance we'll end with 4 * 2.4 * 4 = **38.4 GFLOPS** (i.e.
10^9 floating point operations per second).

#### GPUs

##### nVidia

It was very misleading (at least for me) to read first version of nVidia's CUDA documentation when it was like "one instruction takes 4 clocks to complete". Also having 3 different clocks within single GPU (core, shader and memory) wasn't good to understand how nVidia computes their peak performance.

Fortunately documentation changed and now everything is clear. Now it sounds like "it takes 4 clocks for one multiprocessor to handle one warp". Each multiprocessor has 8 scalar processors (SP) and each warp has 32 threads, so it's 4 clocks for 8 SPs to handle 32 threads or in other words each SP can do one instruction per clock. Simple as that!

For example, peak performance for nVidia GTX260 with 192 SPs will be 192 * 1.242Ghz (it's shader clock
that matters not core clock) = **238.464 * 10^9** operations per second. It doesn't looks that good as
expected (or said by PR guys), right? But there are two more tricks to make this result looks good.
GPU can perform "multiply and add" as single instruction, thus doubles the peak performance. Also it's
possible to issue one more SPFP multiply instruction at same clock cycle, so peak performance triples and
finally reaching 238.464 * 3 = **715.392 GFLOPS**. That's the number nVidia using to when talking about
FLOPS.

Unfortunately, MAD (multiply and add) and dual-issued SPFP MUL cannot be used at all with crypto functions
like MD4, MD5, SHA1, AES, etc. We need only integer operations and thus peak performance for
**GTX260/192SP** will be **238.464 * 10^9** operations with 32-bit integers per second.

##### ATI

Information here only related to HD4800 series GPUs (anything earlier is significantly slower with integer
operations). While it usually named as HD4850 with 800 SPs in fact it's different SPs from nVidia's ones.
ATI GPU has several SIMD engines, each engine has 16 thread processors, and each processor has 5 stream cores.
Thus for HD4850 we have 10 SIMD engines * 16 * 5 = 800 stream cores. Each stream core can perform one
instruction per clock, so peak performance for **HD4850**
will be 800 * 625Mhz = **500 * 10^9** instructions per second. It's possible to do SPFP MAD as single
instruction as for nVidia's GPU so peak performance for SPFP reaches **1000 GFLOPS**. It's not possible
to dual-issue SPFP MUL on ATI GPUs and it actually not needed as FLOPS ratio already high enough.

### How many instructions needed for single MD5?

It's not hard question. One MD5_Transform (which is core of MD5 function) requires 64 iterations,
there are 4 subfunctions named F, G, H, I; each used 16 times (thus 4*16 = 64). More detailed information about
MD5 hashing can be found here. Each subfunction requires
3 (2 for H) logic operations, 4 additions and one cyclic rotation. As there no cyclic rotation in SSE and GPUs
it must be replaced by two shifts and logic OR, thus we have 4(3) logic operations, 4 additions, 2 shifts per
subfunction.
So finally we need to do 16*10 + 16*10 + 16*9 + 16*10 = **624 operations
with 32-bit integers** to perform one MD5_Transform. (In fact we also need to initialize 4 values at
beginning and finalize them with 4 additions at the end but this value is good enough for estimation as
first iteration will be optimized out due to constant initializers used, etc).

One important thing is that CPUs using two operands per instruction, that means SSE instructions performed
as *destination = destination [operation] source* while GPUs using three operands per instruction,
so it's *destination = source1 [operation] source2*. It may looks not that important but actually
it means that we need to perform 2 more MOVE instructions per each subfunction (one for logic, one
for cyclic rotation). And with such low (only 10) instructions overall it produces +20% penalty for CPUs.
So we're ends with **752 instructions with 32-bit integers** required to perform one MD5_Transform
when there no three operands instructions available.

Another important thing is that for hash cracking we don't need to perform full 64 iterations. As hash divided into 4 32-bit values it's enough to perform 61 iterations to fully compute one of this values. And abandon 3 more iterations if this value doesn't match hash we're looking for. It's pretty common optimization done by almost every hash cracker.

But moreover, if we're attacking single hash we can compute it backwards storing some intermediate computations. Without complex details it means that we can compute only 45 out of 64 iterations to check correctness of input. It usually called «hash reversing». Compared with above 61 iterations it ends as 61/45 ~= 35% speed-up.

And even more -- when generating passwords most of the input bytes for MD5 are zero. For example, for passwords with less than 8 symbols in length only 3 out of 16 input 32-bit integers are non-zero and one of them is constant. So we can safely remove one addition from almost every subfunction.

To summarize this, checking passwords with less than 8 symbols in length for single md5 hash will
require only 16*9 + 2 + 16*9 + 2 + 13*8 + 2 =
**398 instructions with 32-bit integers** (488 for CPUs).

### Merging everything together

Now we have all information needed to estimate how fast it's possible to perform MD5 on different CPUs and GPUs and compare it with real results.

Clock rates and SP count can be found here for nVidia and here for ATI GPUs.

Practical results were measured with ighashgpu program.

Name |
Cores & Speed (Ghz) |
Peak performance with integers (billions per second) |
Peak performance with SPFP (GFLOPS) |
Theoretical md5 speed (61/64 rounds) |
Theoretical single md5 hash speed (45/64 rounds) |
Practical single md5 hash speed |
Percentage of theoretical performance gained |

Intel Core2 Q6600 |
4 * 2.4 |
115.2 |
38.4 |
160.9M |
236.0M |
188M |
79.7% |

nVidia GTX260 |
192 * 1.242 |
238.464 |
715.392 |
383.7M |
599.1M |
553M |
92.3% |

nVidia GTS250 |
128 * 1.836 |
235.008 |
705.024 |
378.2M |
590.5M |
557M |
94.3% |

ATI HD4770 |
640 * 0.75 |
480 |
960 |
808.0M |
1206M |
949M |
78.7% |

ATI HD4850 |
800 * 0.625 |
500 |
1000 |
841.8M |
1256M |
980M |
78.0% |

ATI 2xHD4870x2 |
2 * 800 * 2 * 0.750 |
2400 |
4800 |
4040M |
6030M |
4600M |
76.2% |

*(Note the huge difference (especially for nVidia's GPUs) between peak integer and floating point performance.
People tends to use FLOPS term while talking about hash cracking but FLOPS means nothing here, there no
floating point calculations at all (I guess it's just shorter to write FLOPS than anything else))*.

And now it's easy to estimate speed of upcoming GT300 and HD5800 series as their peak performance already known and there no difference at architecture for integer calculations.

*(While major GT300 feature is MIMD (multiple instructions multiple data) for embarrassingly parallel
problem like hash cracking MIMD doesn't matters at all as most of the time we're executing one instruction
for all data we have).*

Name |
Cores & Speed (Ghz) |
Peak performance with integers (billions per second) |
Expected utilization percentage |
Expected speed for single md5 hash |
Expected price on release |

Radeon HD 5850 |
1440 * 0.725 |
1044 |
78% |
2046M |
$299 |

Radeon HD 5870 |
1600 * 0.850 |
1360 |
78% |
2665M |
$399 |

GeForce 350 |
384 * 1.4 |
537.6 |
95% |
1283M |
$279 |

GeForce 380 |
512 * 1.6 |
819.2 |
95% |
1955M |
$499 |

If you curious why practical results with nVidia's GPUs are better (i.e. closer to theoretical results) there is simple answer. nVidia's GPUs consists of scalar processors (can performs one instruction per clock) while Intel CPUs are superscalar processors (up to 4 instructions per clock) and ATI's GPUs are VLIW ones (5 instructions coded into one VLIW), so it's harder to code Intel and ATI to reach peak performance. From other side it means that results for Intel and ATI can be noticeably improved with better programming skills/ideas.

### Some additional comments from September 25, 2009.

1. It's plain stupid to apply above benchmarks (and estimations) to 3D software like games. Games using SPFP calculations, hash cracking -- integers only. Also for hash cracking we don't care about ROPs, TMUs, MSAA, etc, etc.

2. Release of GT300 is far in future, so shaders speed/config can change of course. However (if you've read this article), it's easy to re-estimate GT300 performance with new values.

3. ATI making awesome hardware but ATI totally don't care about GPGPU. There is nearly zero GPGPU software
available for ATI GPUs, there is zero support for developers from ATI. HD5870 already released and there still
no new CAL SDK, no OpenCL for **GPUs** available (what's the point to release OpenCL for CPUs only?
I have no idea, I guess OpenCL won't work for GPUs earlier than HD5800 and it's simply not ready yet for HD5800).
It's been CAL beta for year +, now ATI switches to OpenCL and I guess it'll be another year, now with OpenCL
beta. For CPUs only.

If you want to do some serious GPGPU calculations **right now** there are no other options --
nVidia's GPUs & CUDA is the only real choice.

4. It's really amazing for me that some people have read this article as
«OMG! He know something about **real** GT300 core clocks!!!11».
For GT300 estimations I've used the clocks from the wiki page I've
already linked above.

### Some additional comments from February 16, 2010.

More information about MD5 and SHA1 speed on ATI HD 5XXX GPUs here.

More estimations (and real results for ATI HD 5XXX GPUs) here.

**(c)Ivan Golubev http://www.golubev.com**

September, 13-19 2009

+Some comments. September, 25 2009

+Some comments. February, 16 2010

September, 13-19 2009

+Some comments. September, 25 2009

+Some comments. February, 16 2010