src/ztex/fpga-sha256crypt/README.md
Performance (at 175 MHz) measured in different modes, with different key lengths, with a comparison to CPUs is shown in table 1.
+--------------+--------+------------+------------+----------------+
| | | ZTEX board | i5-4210M | Intel Celeron |
| | |4x XC6SLX150| OMP x4,AVX2|Stepping 6,SSSE3|
+--------------+--------+------------+------------+----------------+
| salt length = 8
+--------------+--------+------------+------------+----------------+
| key_len=7 | --mask | 133.2 Kc/s | 6.39 Kc/s | 0.88 Kc/s |
| rounds=5000 | --inc | 127.2 Kc/s | 6.32 Kc/s | 0.87 Kc/s |
+--------------+--------+------------+------------+----------------+
| salt length = 16
+--------------+--------+------------+------------+----------------+
| key_len=5 | --mask | 85.3 Kc/s | 4.22 Kc/s | 0.58 Kc/s |
| rounds=5000 | --inc | 82.1 Kc/s | 4.12 Kc/s | 0.57 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 80.3 Kc/s | 3.96 Kc/s | 0.38 Kc/s |
| rounds=5000 | --inc | 77.4 Kc/s | 3.70 Kc/s | 0.37 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=20 | --mask | 68.7 Kc/s | 3.37 Kc/s | 0.31 Kc/s |
| rounds=5000 | --inc | 66.0 Kc/s | 3.16 Kc/s | 0.31 Kc/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 8.05 Kc/s | 398 c/s | 38 c/s |
| rounds=50000 | --inc | 8.03 Kc/s | 390 c/s | 38 c/s |
+--------------+--------+------------+------------+----------------+
| key_len=10 | --mask | 804 c/s | 39 c/s | N/A (it does |
| rounds=500000| --inc | 802 c/s | 39 c/s | not respond) |
+--------------+--------+------------+------------+----------------+
--------------+-------------+--------------
/ / / \
+--------+ +--------+ +--------+ |
| | | | | | |
| SHA256 | | SHA256 | | SHA256 | |
| core | | core | | core | |
| #0 | | #1 | | #2 | |
| | | | | | |
+--------+ +--------+ +--------+ |
^ ^ ^ |
\ | / |
+----------+----------- |
| |
. . . . . | . . . . . . . . . . . . . . . . . . . . . |. .
| /
| /
+---------------+ ------------- /
| | / \ /
| process_bytes | / "main" \ <------
| |<--+--| memory |
+---------------+ | \ (12 x 128 B) / <------
| ^ | \ / \
+-------+ | / ------------- |
| procb | | / +------------+
| saved | | / +-------------------+ | unit_input |
| state | | / | thread_state(x12) | +------------+
+-------+ | | +-------------------+ ^
| | |
+-----------+ \ +-----------------------------+ |
| procb_buf | -->| C.P.U. | |
+-----------+ | +-----------------------+ | |
^ | | integer ops. incl. | | |
\ | |12 x 16 registers(x16b)| | |
\ | +-----------------------+ | |
---------| | |
| +----------------------+ | |
| | thread switch (x12) | | |
+--------+ | +----------------------+ | |
| unit | | | instruction pointers | | |
| output |<-------| +----------------------+ | |
| buf | | | instruction memory | | |
+--------+ | +----------------------+ | |
| +-----------------------------+ |
| /
\ /
---> to arbiter_rx from arbiter_tx
SHA256 computations are performed using specialized circuits ("cores"). Each cycle, a core computes one of 64 algorithm rounds. Additionally 8 cycles it's busy with additions at the end of the block. On some cycle, a round from computation #0 is computed and on the next cycle, a round from computation #1 is computed. Several cycles before 2 computations are finished and output, next 2 computations start loading Initial Values (IVs) and pre-fetching data from core's input buffer.
Internally cores store result of SHA256, allow to use previously stored result as IVs for subsequent block. That way the core is able to handle input of any length.
Each core performs 4 independent computations in parallel, performs 4 blocks in (64+8)x4 = 288 cycles.
CPU runs same program in 12 independent hardware threads. Each thread has 128 bytes of "main" memory (accessible by SHA256 subsystem), 16 x 16-bit registers. Data movement, integer, execution flow control operations are available. However there's only a subset if operations typically implemented in general-purpose CPUs, enough for the task.
CPU is heavily integrated with SHA256 subsystem. It has INIT_CTX, PROCESS_BYTES, FINISH_CTX instructions that are almost equivalent to init_ctx(), process_bytes(), finish_ctx() from software library. Each instruction takes 1 cycle to execute.
SHA256 instructions store instruction data in internal buffer (procb_buf). A dedicated unit (process_bytes) takes intermediate data, fetches input data from the memory, creates 16 x 32-bit data blocks for cores, adds padding and total where necessary. It saves the state of an unfinished computation, switches to the next core after each block.
The program for the CPU is available <a href='https://github.com/openwall/john/blob/bleeding-jumbo/src/ztex/fpga-sha256crypt/sha256crypt/cpu/program.vh'>here</a>.
------------------+--------+--+--+--------+---------
/ / / / / / \
+-----------+ +-----------+ +-----------+ |
| | | | X X X | | |
| Computing | | Computing | | Computing | |
| Unit #0 | | Unit #1 | | Unit N | |
| | | | X X X | | |
+-----------+ +-----------+ +-----------+ |
^ ^ ^ ^ ^ ^ |
| | | | | / |
+---------------+---------+--+--+-------- |
| |
. . . | . . . . . . . . . . . . . . . . . . . . . . . . . .| .
| |
| |
+-----------------+ +----------------+ /
| Arbiter | | Arbiter |<---------
| (transmit part) |<-------->| (receive part) |
+-----------------+ +----------------+
^ |
| +------------+ | +------
| +---------+ | |<-+->| mode \
| | cmp. |-->| comparator | | cmp |--
| | config. | | |---->| ? / \
| +---------+ +------------+ +------ |
| ^ |
. . . | . . . . .|. . . . . . . . . . . . . . . . . . . . | .
| | Communication framework |
| | /
+-----------+ | /
| candidate | | /
| generator | | /
+-----------+---------+----------------------+ /
| input pkt. handling | output pkt. creation |<-----
+---------------------+----------------------+
| input FIFO | output FIFO |
+---------------------+----------------------+
| Prog. clocks | USB I/O | FPGA reset |
+--------------------------------------------+
fig.2. Overview, FPGA application
There's a clock input to the FPGAs, generated by Cypress USB device controller (you see 24 MHz quartz resonator on the board). Each FPGA internally generates a number of clocks for various parts of the application:
Each computing unit uses ~2,650 LUT. Other types of hardware resources are not limiting. Here's a breakdown of resource usage by individual components in a unit:
The remaining 520 LUT (19.5%) is used mostly by logic that transforms PROCESS_BYTES instructions into SHA256 data blocks. That includes:
+--------------+ +--------------+
| unit #0 | | unit #11 |
+--------------+ |--------------+
| unit #1 |---------+ unit #13 |
+--------------+ unit #12|--------------+
| unit #2 | | unit #15 |
+--------------+---------+--------------+
| unit #3 | unit #14| unit #17 |
+--------------+ |--------------+
| unit #4 |---------+ unit #18 |
+--------------+ unit #16|--------------+
| unit #5 | | unit #19 |
+--------------+---------+--------------+
| unit #6 | unit #20| unit #21 |
+--------------+ |--------------+
| unit #7 |---------+ unit #22 |
+--------------+ unit #10|--------------+
| unit #8 | / |
+--------------+--------/ communication |
| unit #9 |--------+ framework |
+--------------+ +---------------+
fig.3. Area allocation in the FPGA