openhwgroup · davideschiavone · Oct 10, 2023 · Oct 4, 2023 · Oct 10, 2023
diff --git a/docs/source/instruction_set_extensions.rst b/docs/source/instruction_set_extensions.rst
@@ -1349,9 +1349,9 @@ SIMD ALU operations
   +------------------------------------------------------------+------------------------------------------------------------------+
   | **Mnemonic**                                               | **Description**                                                  |
   +============================================================+==================================================================+
-  | **cv.add[.sc,.sci]{.h,.b} rD, rs1, [rs2, Imm6]**           | rD[i] = (rs1[i] + op2[i]) & 0xFFFF                               |
+  | **cv.add[.sc,.sci]{.h,.b} rD, rs1, [rs2, Imm6]**           | rD[i] = (rs1[i] + op2[i]) & {0xFFFF, 0xFF}                       |
   +------------------------------------------------------------+------------------------------------------------------------------+
-  | **cv.sub[.sc,.sci]{.h,.b} rD, rs1, [rs2, Imm6]**           | rD[i] = (rs1[i] - op2[i]) & 0xFFFF                               |
+  | **cv.sub[.sc,.sci]{.h,.b} rD, rs1, [rs2, Imm6]**           | rD[i] = (rs1[i] - op2[i]) & {0xFFFF, 0xFF}                       |
   +------------------------------------------------------------+------------------------------------------------------------------+
   | **cv.avg[.sc,.sci]{.h,.b} rD, rs1, [rs2, Imm6]**           | rD[i] = ((rs1[i] + op2[i]) & {0xFFFF, 0xFF}) >> 1                |
   |                                                            |                                                                  |
@@ -2146,11 +2146,11 @@ No carry, overflow is generated. Instructions are rounded up as the mask & 0xFFF
   |                                       | Note: Arithmetic shift right.                                                         |
   +---------------------------------------+---------------------------------------------------------------------------------------+
 
-SIMD Complex-numbers Encoding
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+SIMD Complex-number Encoding
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. table:: SIMD ALU encoding
-  :name: SIMD ALU encoding
+.. table:: SIMD Complex-number encoding
+  :name: SIMD Complex-number encoding
   :widths: 11 4 4 9 7 8 8 13 36
   :class: no-scrollbar-table
 

diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst
@@ -29,15 +29,25 @@ Pipeline Details
 CV32E40P has a 4-stage in-order completion pipeline, the 4 stages are:
 
 Instruction Fetch (IF)
-  Fetches instructions from memory via an aligning prefetch buffer, capable of fetching 1 instruction per cycle if the instruction side memory system allows. This prefetech buffer is able to store 2 32-b data. The IF stage also pre-decodes RVC instructions into RV32I base instructions. See :ref:`instruction-fetch` for details.
+  Fetches instructions from memory via an aligning prefetch buffer, capable of fetching 1 instruction per cycle if the instruction side memory system allows. This prefetch buffer is able to store 2 32-b data.
+  The IF stage also pre-decodes RVC instructions into RV32I base instructions. See :ref:`instruction-fetch` for details.
 
 Instruction Decode (ID)
   Decodes fetched instruction and performs required register file reads. Jumps are taken from the ID stage.
 
 Execute (EX)
-  Executes the instructions. The EX stage contains the ALU, Multiplier and Divider. Branches (with their condition met) are taken from the EX stage. Multi-cycle instructions will stall this stage until they are complete. The ALU, Multiplier and Divider instructions write back their result to the register file from the EX stage. The address generation part of the load-store-unit (LSU) is contained in EX as well.
+  Executes the instructions. The EX stage contains the ALU, Multiplier and Divider. Branches (with their condition met) are taken from the EX stage. Multi-cycle instructions will stall this stage until they are complete.
+  The ALU, Multiplier and Divider instructions write back their result to the register file from the EX stage. The address generation part of the load-store-unit (LSU) is contained in EX as well.
 
-  The FPU writes back its result from EX stage as well when FPU_*_LAT is either 0 cycle or more than 1 cycle. It is reusing register file ALU/Mult/Div write port and it has the highest priority so it will stall EX stage if there is a conflict (when FPU_*_LAT > 1).
+  The FPU writes back its result at EX stage as well through this ALU/Mult/Div register file write port when FPU_*_LAT is either 0 cycle or greater than 1 cycle.
+  When FPU_*_LAT > 1, FPU write-back has the highest priority so it will stall EX stage if there is a conflict. There are few exceptions to this FPU priority over ALU/Mult/Div.
+
+  They are:
+
+  * There is a multi-cycle MULH in EX.
+  * There is a Misaligned LOAD/STORE in EX.
+  * There is a Post-Increment LOAD/STORE in EX.
+  In those 3 exceptions, EX will not be stalled, FPU result (and flags) are memorized and will be written back in the register file (and FPU CSR) as soon as there is no conflict anymore.
 
 Writeback (WB)
   Writes the result of Load instructions back to the register file.
@@ -68,7 +78,8 @@ Those cycles penalty can be hidden if the compiler is able to add instructions b
 Single- and Multi-Cycle Instructions
 ------------------------------------
 
-:numref:`Cycle counts per instruction type` shows the cycle count per instruction type. Some instructions have a variable time, this is indicated as a range e.g. 1..32 means that the instruction takes a minimum of 1 cycle and a maximum of 32 cycles. The cycle counts assume zero stall on the instruction-side interface and zero stall on the data-side memory interface.
+:numref:`Cycle counts per instruction type` shows the cycle count per instruction type. Some instructions have a variable time, this is indicated as a range e.g. 1..32 means that the instruction takes a minimum of 1 cycle and a maximum of 32 cycles.
+The cycle counts assume zero stall on the instruction-side interface and zero stall on the data-side memory interface.
 
 .. _instructions_latency_table:
 .. table:: Cycle counts per instruction type

diff --git a/rtl/cv32e40p_ex_stage.sv b/rtl/cv32e40p_ex_stage.sv
@@ -413,6 +413,7 @@ module cv32e40p_ex_stage
       assign apu_read_dep_for_jalr_o = 1'b0;
       assign apu_write_dep_o         = 1'b0;
       assign fpu_fflags_o            = '0;
+      assign fpu_fflags_we_o         = '0;
     end
   endgenerate
 

diff --git a/scripts/lec/synopsys_formality/check_lec.tcl b/scripts/lec/synopsys_formality/check_lec.tcl
@@ -15,6 +15,8 @@ set_dont_verify_point -type port  i:WORK/cv32e40p_core/apu_flags_o*
 
 verify > ./reports/verify.rpt
 
+report_aborted_points > ./reports/aborted_points.rpt
+report_failing_points > ./reports/failing_points.rpt
 analyze_points -failing > ./reports/analyze.rpt
 
 exit