src/ztex/fpga-sha256crypt/CHANGES.md
The new version has improved performance and power efficiency.
Performance (at 175 MHz) in different modes, with different key lengths, with a comparison to CPUs is shown in table 1.
+--------------+--------+------------+------------+----------------+
| | | ZTEX board | i5-4210M | Intel Celeron |
| | | 1.15y | OMPx4,AVX2 |Stepping 6,SSSE3|
+--------------+--------+------------+------------+----------------+
| key_len=5 | --mask | 85.3 Kc/s | 4.22 Kc/s | 0.58 Kc/s |
| rounds=5000 | --inc | 82.1 Kc/s | 4.12 Kc/s | 0.57 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 80.3 Kc/s | 3.96 Kc/s | 0.38 Kc/s |
| rounds=5000 | --inc | 77.4 Kc/s | 3.70 Kc/s | 0.37 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=20 | --mask | 68.7 Kc/s | 3.37 Kc/s | 0.31 Kc/s |
| rounds=5000 | --inc | 66.0 Kc/s | 3.16 Kc/s | 0.31 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 8.05 Kc/s | 398 c/s | 38 c/s |
| rounds=50000 | --inc | 8.03 Kc/s | 390 c/s | 38 c/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 804 c/s | 39 c/s | N/A (it does |
| rounds=500000| --inc | 802 c/s | 39 c/s | not respond) |
+--------------+--------+------------+------------+----------------+
--------------+-------------+--------------
/ / / \
+--------+ +--------+ +--------+ |
| | | | | | |
| SHA256 | | SHA256 | | SHA256 | |
| core | | core | | core | |
| #0 | | #1 | | #2 | |
| | | | | | |
+--------+ +--------+ +--------+ |
^ ^ ^ |
\ | / |
+----------+----------- |
| |
. . . . . | . . . . . . . . . . . . . . . . . . . . . |. .
| /
| /
+---------------+ ------------- /
| | / \ /
| process_bytes | / "main" \ <------
| |<--+--| memory |
+---------------+ | \ (6 x 128 B) / <------
| ^ | \ / \
+-------+ | / ------------- |
| procb | | / +------------+
| saved | | / +------------------+ | unit_input |
| state | | / | thread_state(x6) | +------------+
+-------+ | | +------------------+ ^
| | |
+-----------+ \ +-----------------------------+ |
| procb_buf | -->| C.P.U. | |
+-----------+ | +-----------------------+ | |
^ | | integer ops. incl. | | |
\ | | 6 x 16 registers(x16b)| | |
\ | +-----------------------+ | |
---------| | |
| +----------------------+ | |
| | thread switch (x6) | | |
+--------+ | +----------------------+ | |
| unit | | | instruction pointers | | |
| output |<-------| +----------------------+ | |
| buf | | | instruction memory | | |
+--------+ | +----------------------+ | |
| +-----------------------------+ |
| /
\ /
---> to arbiter_rx from arbiter_tx
SHA256 computations are performed using specialized circuits ("cores"). Each cycle, a core computes one of 64 algorithm rounds. Additionally 8 cycles it's busy with additions at the end of the block. Several cycles before a computations is finished and output, next computation starts loading Initial Values (IVs) and pre-fetching data from core's input buffer.
internally cores store result of SHA256, allow to use previously stored result as IVs for subsequent block. That way, the core is able to handle input of any length.
So each core performs 2 blocks in (64+8)*2 = 144 cycles.
CPU runs same program in 6 independent hardware threads. Each thread has 128 bytes of "main" memory (accessible by SHA256 subsystem), 16 x 16-bit registers. Data movement, integer, execution flow control operations are available. However there's only a subset if operations typically implemented in general-purpose CPUs, enough for the task.
CPU is heavily integrated with SHA256 subsystem. It has INIT_CTX, PROCESS_BYTES, FINISH_CTX instructions that are almost equivalent to init_ctx(), process_bytes(), finish_ctx() from software library. Each instruction takes 1 cycle to execute.
SHA256 instructions store instruction data in internal buffer (procb_buf). A dedicated unit (process_bytes) takes intermediate data, fetches input data from the memory, creates 16 x 32-bit data blocks for cores, adds padding and total where necessary. It saves the state of an unfinished computation, switches to the next core after each block.
The program for the CPU is available <a href='https://github.com/openwall/john/blob/bleeding-jumbo/src/ztex/fpga-sha256crypt/sha256crypt/cpu/program.vh'>here</a>.
------------------+--------+--+--+--------+---------
/ / / / / / \
+-----------+ +-----------+ +-----------+ |
| | | | X X X | | |
| Computing | | Computing | | Computing | |
| Unit #0 | | Unit #1 | | Unit N | |
| | | | X X X | | |
+-----------+ +-----------+ +-----------+ |
^ ^ ^ ^ ^ ^ |
| | | | | / |
+---------------+---------+--+--+-------- |
| |
. . . | . . . . . . . . . . . . . . . . . . . . . . . . . .| .
| |
| |
+-----------------+ +----------------+ /
| Arbiter | | Arbiter |<---------
| (transmit part) |<-------->| (receive part) |
+-----------------+ +----------------+
^ |
| +------------+ | +------
| +---------+ | |<-+->| mode \
| | cmp. |-->| comparator | | cmp |--
| | config. | | |---->| ? / \
| +---------+ +------------+ +------ |
| ^ |
. . . | . . . . .|. . . . . . . . . . . . . . . . . . . . | .
| | Communication framework |
| | /
+-----------+ | /
| candidate | | /
| generator | | /
+-----------+---------+----------------------+ /
| input pkt. handling | output pkt. creation |<-----
+---------------------+----------------------+
| input FIFO | output FIFO |
+---------------------+----------------------+
| Prog. clocks | USB I/O | FPGA reset |
+--------------------------------------------+
fig.2. Overview, FPGA application
Each computing unit uses ~2,450 LUT. Other types of hardware resources are not limiting. Here's a breakdown of resource usage by individual components in a unit:
The remaining 440 LUT (17.9%) is used mostly by logic that transforms PROCESS_BYTES instructions into SHA256 data blocks. That includes:
+--------------+ +--------------+
| unit #0 | | unit #11 |
+--------------+ |--------------+
| unit #1 |---------+ unit #13 |
+--------------+ unit #12|--------------+
| unit #2 | | unit #15 |
+--------------+---------+--------------+
| unit #3 | unit #14| unit #17 |
+--------------+ |--------------+
| unit #4 |---------+ unit #18 |
+--------------+ unit #16|--------------+
| unit #5 | | unit #19 |
+--------------+---------+--------------+
| unit #6 | unit #20| unit #21 |
+--------------+ |--------------+
| unit #7 |---------+ unit #23 |
+--------------+ unit #22|--------------+
| unit #8 | | unit #24 |
+--------------+---------+--------------+
| unit #9 |--------+ communication |
+--------------+ \ framework |
| unit #10 | +-------------+
+--------------+
fig.3. Area allocation in the FPGA