src/ztex/fpga-sha512crypt/README.md
--------------+-------------+-------------+-------
/ / / / \
+--------+ +--------+ +--------+ +--------+ |
| | | | | | | | |
| SHA512 | | SHA512 | | SHA512 | | SHA512 | |
| core | | core | | core | | core | |
| #0 | | #1 | | #2 | | #3 | |
| | | | | | | | |
+--------+ +--------+ +--------+ +--------+ |
^ ^ ^ ^ |
\ | | / |
---+-------+------------+------------ |
| |
. . . . . | . . . . . . . . . . . . . . . . . . . . . . .| .
| sha512_engine |
| |
+---------------+ ------------ |
| | / \ /
| process_bytes | / "main" \ <----------
| |<--+-----| memory |
+---------------+ | \ (16 x 256B) / <--------
| ^ | \ / \
+-------+ | / ------------ |
| procb | | / +---------------+
| saved | | / +-------------------+ | unit_input |
| state | | / | thread_state(x16) | +---------+ |
+-------+ | | +-------------------+ | program | |
| | | selector| |
| | +---------+-----+
| | | ^
+-----------+ \ +------------------------------+ | |
| procb_buf | -->| C.P.U. | | |
+-----------+ | +------------------------+ | | |
^ | | integer ops. incl. | | | |
\ | | 16x16 registers(x16bit)| | | |
\ | +------------------------+ | | |
---------| | | |
| +----------------------+ | | |
| | thread switch (x16) | | / |
+--------+ | +----------------------+ | / |
| unit | | | instruction pointers |<-- | |
| output |<-------| +----------------------+ | |
| buf | | | instruction memory | | |
+--------+ | +----------------------+ | |
| +------------------------------+ |
| /
\ /
---> to arbiter_rx from arbiter_tx
SHA512 computations are performed using specialized circuits ("cores"). Each cycle, a core computes one of 80 algorithm rounds. Additionally 8 cycles it's busy with additions at the end of the block. Each core runs 2 computations at the same time. On some cycle, a round from computation #0 is computed and on the next cycle, a round from computation #1 is computed. Several cycles before 2 computations are finished and output, next 2 computations start loading Initial Values (IVs) and pre-fetching data from core's input buffer.
internally cores store result of SHA512, allow to use previously stored result as IVs for subsequent block. That way, the core is able to handle input of any length.
So each core runs 4 independent computations in parallel, performs 4 blocks in (80+8)*4 = 352 cycles.
CPU runs same program in 16 independent hardware threads. Each thread has 256 bytes of "main" memory (accessible by SHA512 subsystem), 16 x 16-bit registers. 16-bit data movement, integer, execution flow control operations are available. However there's only a subset if operations typically implemented in general-purpose CPUs, enough for the task.
The reference implementation of the algorithm uses modulo operation. Since "generic" modulo appeared to be area consuming, that was replaced with 'if A equals to X then A=0, else A=A+1' operation.
CPU is heavily integrated with SHA512 subsystem. It has INIT_CTX, PROCESS_BYTES, FINISH_CTX instructions that are almost equivalent to init_ctx(), process_bytes(), finish_ctx() from software library. Each instruction takes 1 cycle to execute.
SHA512 instructions store instruction data in internal buffer (procb_buf). A dedicated unit (process_bytes) takes intermediate data, fetches input data from the memory, creates 16 x 64-bit data blocks for cores, adds padding and total where necessary. It saves the state of an unfinished computation, switches to the next core after each block.
It allows several programs hardcoded in CPU's instruction memory. Required program is selected at runtime. Currently there're 2 programs: sha512crypt and Drupal7 CMS hashes.
The program for the CPU is available <a href='https://github.com/openwall/john/blob/bleeding-jumbo/src/ztex/fpga-sha512crypt/sha512crypt/cpu/program.vh'>here</a>.
------------------+--------+--+--+--------+---------
/ / / / / / \
+-----------+ +-----------+ +-----------+ |
| | | | X X X | | |
| Computing | | Computing | | Computing | |
| Unit #0 | | Unit #1 | | Unit N | |
| | | | X X X | | |
+-----------+ +-----------+ +-----------+ |
^ ^ ^ ^ ^ ^ |
| | | | | / |
+---------------+---------+--+--+-------- |
| |
. . . | . . . . . . . . . . . . . . . . . . . . . . . . . .| .
| |
| |
+-----------------+ +----------------+ /
| Arbiter | | Arbiter |<---------
| (transmit part) |<-------->| (receive part) |
+-----------------+ +----------------+
^ |
| +------------+ | +------
| +---------+ | |<-+->| mode \
| | cmp. |-->| comparator | | cmp |--
| | config. | | |---->| ? / \
| +---------+ +------------+ +------ |
| ^ |
. . . | . . . . .|. . . . . . . . . . . . . . . . . . . . | .
| | Communication framework |
| | /
+-----------+ | /
| candidate | | /
| generator | | /
+-----------+---------+----------------------+ /
| input pkt. handling | output pkt. creation |<-----
+---------------------+----------------------+
| input FIFO | output FIFO |
+---------------------+----------------------+
| Prog. clocks | USB I/O | FPGA reset |
+--------------------------------------------+
fig.2. Overview, FPGA application
define SIMULATION in sha512.vh uncommented. +--------+ +--------+
| +--------+ |
| | | |
| unit1 | unit2 | unit3 |
| | | |
+--------+----+---+--------+
| | |
| unit4 | unit8 |
| | |
+-------------+------------+
| | |
| unit 10 | unit 11 |
| | |
+-------------+------------+
| | |
| unit5 | unit6 |
| | |
+--------+----+------------+
| | | |
| | unit9 | |
| unit0 | | unit7 |
| +--------+ |
| |comm.f/w| |
+--------+ +--------+
fig.3. Area allocation in the FPGA