Initial JPEG-LS FPGA encoder baseline with tooling and timeout fix
This commit is contained in:
435
docs/jls_module_interfaces.md
Normal file
435
docs/jls_module_interfaces.md
Normal file
@@ -0,0 +1,435 @@
|
||||
# JPEG-LS RTL Module Interface Draft
|
||||
|
||||
This document freezes the first-pass RTL interface plan before implementation.
|
||||
The requirement source is `fpga/srs/jpeg_ls.md`; this file is an execution
|
||||
artifact and must be updated if the SRS changes an interface.
|
||||
|
||||
## Global Rules
|
||||
|
||||
- Single clock domain: `clk`, 250 MHz target.
|
||||
- Synchronous active-high reset: `rst`.
|
||||
- All RTL ports use SystemVerilog `logic`.
|
||||
- Simple direct `assign` is allowed; multi-level combinational logic in `assign`
|
||||
is not allowed.
|
||||
- Internal pipeline interfaces use `valid` plus explicit stall/backpressure only
|
||||
where the receiving stage can block.
|
||||
- Stage outputs should be registered unless a local timing review proves the path
|
||||
is trivial.
|
||||
- Pixel coordinates are zero-based: `x = 0..active_pic_col-1`,
|
||||
`y = 0..active_pic_row-1`.
|
||||
- Strip coordinates are zero-based: `strip_index = y / SCAN_ROWS`.
|
||||
|
||||
## Top-Level Module
|
||||
|
||||
Module: `jpeg_ls_encoder_top`
|
||||
|
||||
Parameters:
|
||||
|
||||
| Name | Default | Description |
|
||||
| --- | ---: | --- |
|
||||
| `PIX_WIDTH` | 16 | Compile-time grayscale sample precision: 8, 10, 12, 14, or 16 bits. |
|
||||
| `DEFAULT_PIC_COL` | 6144 | Default image width used when runtime dimensions are invalid. |
|
||||
| `DEFAULT_PIC_ROW` | 256 | Default image height used when runtime dimensions are invalid. |
|
||||
| `MAX_PIC_COL` | 6144 | Maximum supported runtime image width. |
|
||||
| `MAX_PIC_ROW` | 4096 | Maximum supported runtime image height. |
|
||||
| `SCAN_ROWS` | 16 | Number of source rows in one standalone JPEG-LS strip frame. |
|
||||
| `MAX_NEAR` | 31 | Maximum dynamic NEAR value. |
|
||||
| `OUT_BUF_BYTES` | 8192 | Internal byte output buffer size. |
|
||||
| `OUT_BUF_AFULL_MARGIN` | 256 | Input pause margin for the internal output buffer. |
|
||||
|
||||
Ports:
|
||||
|
||||
| Name | Direction | Width | Description |
|
||||
| --- | --- | ---: | --- |
|
||||
| `clk` | input | 1 | Main clock. |
|
||||
| `rst` | input | 1 | Synchronous active-high reset. |
|
||||
| `cfg_pic_col` | input | 13 | Runtime image width sampled at input SOF. |
|
||||
| `cfg_pic_row` | input | 13 | Runtime image height sampled at input SOF. |
|
||||
| `ratio` | input | 4 | Runtime compression target sampled at input SOF. |
|
||||
| `ififo_rclk` | output | 1 | Input FIFO read clock, tied to `clk`. |
|
||||
| `ififo_rd` | output | 1 | Input FIFO read request. Data is valid one cycle later. |
|
||||
| `ififo_rdata` | input | `ceil(PIX_WIDTH/8)*9` | Packed SOF flag and pixel value. |
|
||||
| `ififo_empty` | input | 1 | Input FIFO empty flag. |
|
||||
| `ififo_alempty` | input | 1 | Input FIFO almost-empty flag for read optimization. |
|
||||
| `ofifo_wclk` | output | 1 | Output FIFO write clock, tied to `clk`. |
|
||||
| `ofifo_wr` | output | 1 | Output FIFO write enable. |
|
||||
| `ofifo_wdata` | output | 9 | Output byte stream. `[8]` marks original-image start; `[7:0]` is byte. |
|
||||
| `ofifo_full` | input | 1 | Reserved and ignored by RTL; simulation checks unsafe writes. |
|
||||
| `ofifo_alfull` | input | 1 | Reserved and ignored by RTL. |
|
||||
|
||||
## Internal Data Types
|
||||
|
||||
These names are descriptive contracts; actual RTL may use packed `logic`
|
||||
signals rather than `typedef` if that better matches the coding style.
|
||||
|
||||
### Pixel Event
|
||||
|
||||
Carries one accepted input sample after FIFO timing alignment. The first RTL
|
||||
implementation emits coordinates and boundary flags directly from
|
||||
`jls_input_ctrl` so later stages do not need to repeat input-frame bookkeeping.
|
||||
|
||||
| Field | Width | Description |
|
||||
| --- | ---: | --- |
|
||||
| `valid` | 1 | Pixel event is valid. |
|
||||
| `sof` | 1 | Original input image start marker from FIFO sideband. |
|
||||
| `sample` | `PIX_WIDTH` | Original input sample value. |
|
||||
| `x` | 13 | Column coordinate inside original image. |
|
||||
| `y` | 13 | Row coordinate inside original image. |
|
||||
| `strip_first_pixel` | 1 | First pixel of the current strip frame. |
|
||||
| `strip_last_pixel` | 1 | Last pixel of the current strip frame. |
|
||||
| `image_first_pixel` | 1 | First pixel of the original input image. |
|
||||
| `image_last_pixel` | 1 | Last pixel of the original input image. |
|
||||
|
||||
### Strip Control Event
|
||||
|
||||
Controls header generation and per-strip context reset.
|
||||
|
||||
| Field | Width | Description |
|
||||
| --- | ---: | --- |
|
||||
| `valid` | 1 | Control event is valid. |
|
||||
| `start` | 1 | Start a standalone JPEG-LS strip frame. |
|
||||
| `finish` | 1 | Finish the current standalone JPEG-LS strip frame. |
|
||||
| `original_image_first_strip` | 1 | Set `ofifo_wdata[8]` on this strip frame's first SOI byte. |
|
||||
| `original_image_last_strip` | 1 | Last strip of the current original image. Internal only. |
|
||||
| `strip_width` | 13 | JPEG-LS frame width for this strip. |
|
||||
| `strip_height` | 13 | JPEG-LS frame height, normally `SCAN_ROWS`. |
|
||||
| `near` | 6 | NEAR value used in the strip `SOS`. |
|
||||
| `pixel_width` | 5 | Sample precision copied from `PIX_WIDTH`. |
|
||||
|
||||
### Encoded Byte Event
|
||||
|
||||
Carries bytes into the internal output buffer.
|
||||
|
||||
| Field | Width | Description |
|
||||
| --- | ---: | --- |
|
||||
| `valid` | 1 | Byte event is valid. |
|
||||
| `byte` | 8 | JPEG-LS byte in marker-stream order. |
|
||||
| `original_image_start` | 1 | Copied to `ofifo_wdata[8]` for exactly one byte. |
|
||||
|
||||
## Module Contracts
|
||||
|
||||
### `jls_input_ctrl`
|
||||
|
||||
Responsibilities:
|
||||
- Drive `ififo_rd` based on FIFO state and internal pause requests.
|
||||
- Align synchronous FIFO read latency.
|
||||
- Wait for `SOF=1` before accepting an image.
|
||||
- Sample `cfg_pic_col`, `cfg_pic_row`, and `ratio` at input SOF.
|
||||
- Replace invalid dimensions with defaults.
|
||||
- Generate zero-based original-image coordinates.
|
||||
- Mark strip-frame first/last pixels and original-image first/last pixels.
|
||||
|
||||
Outputs:
|
||||
- Pixel event stream to `jls_scan_ctrl`.
|
||||
- Latched image configuration to downstream control.
|
||||
|
||||
Stall sources:
|
||||
- `ififo_empty`.
|
||||
- Conservative mode when `ififo_alempty=1`.
|
||||
- Internal output buffer near-full pause request.
|
||||
- Multi-cycle entropy or bit-packer stall propagated from downstream.
|
||||
|
||||
### `jls_scan_ctrl`
|
||||
|
||||
Responsibilities:
|
||||
- Convert input-controller boundary flags into strip control events.
|
||||
- Emit one strip start event when `strip_first_pixel=1`.
|
||||
- Emit one strip finish event after the strip's last pixel has entered the
|
||||
entropy/bit-pack pipeline.
|
||||
- Request context and line-buffer reset at each strip start.
|
||||
- Preserve strict original-image pixel order.
|
||||
- Register and forward `enc_row_last_pixel` with each encoded pixel. This is
|
||||
the last-column flag for the current row, distinct from
|
||||
`enc_strip_last_pixel`, and is used to keep width comparison out of the
|
||||
neighbor-provider RAM-read path.
|
||||
|
||||
Key rule:
|
||||
- A strip frame is a complete standalone JPEG-LS frame. It is not a second
|
||||
scan inside the previous JPEG-LS frame.
|
||||
|
||||
### `jls_header_writer`
|
||||
|
||||
Responsibilities:
|
||||
- Emit `SOI`, `SOF55`, `LSE`, `SOS`, and `EOI`.
|
||||
- Encode marker fields in big-endian byte order.
|
||||
- Write `SOF55` width as `strip_width` and height as `strip_height`.
|
||||
- Write `SOS` NEAR from the current strip control event.
|
||||
- Write `LSE` preset coding parameters from explicit preset inputs.
|
||||
- Assert `original_image_start` only on the first byte of the first strip frame
|
||||
of an original image.
|
||||
- Accept strip start and strip finish commands only while idle.
|
||||
|
||||
Open implementation note:
|
||||
- Keep the LSE output policy configurable in the module structure. The first
|
||||
implementation emits LSE before each strip `SOS`.
|
||||
|
||||
### `jls_preset_defaults`
|
||||
|
||||
Responsibilities:
|
||||
- Convert `PIX_WIDTH` and the current strip `NEAR` into JPEG-LS default
|
||||
preset coding parameters: `MAXVAL`, `T1`, `T2`, `T3`, and `RESET`.
|
||||
- Clamp a defensive out-of-range `NEAR` input to 31.
|
||||
- Use the simplified `MAXVAL >= 128` equations that cover all supported
|
||||
8/10/12/14/16-bit precisions.
|
||||
- Provide the same preset values to `jls_header_writer` and the later
|
||||
predictor/context pipeline so header syntax and model thresholds stay aligned.
|
||||
|
||||
### `jls_coding_params`
|
||||
|
||||
Responsibilities:
|
||||
- Convert compile-time `PIX_WIDTH` and current strip `NEAR` to JPEG-LS `RANGE`.
|
||||
- Output `qbpp = ceil(log2(RANGE))`.
|
||||
- Output regular-mode `LIMIT = 2 * (PIX_WIDTH + max(8, PIX_WIDTH))`.
|
||||
- Use a lookup table for `NEAR=0..31` instead of synthesized runtime division.
|
||||
|
||||
### `jls_near_ctrl`
|
||||
|
||||
Responsibilities:
|
||||
- Initialize NEAR to 0 at original image start.
|
||||
- Keep NEAR at 0 for `ratio=0` and invalid ratio values.
|
||||
- For `ratio=1/2/3`, update the next strip's NEAR after the current strip
|
||||
output byte count is known.
|
||||
- Clamp NEAR to `0..31`.
|
||||
- Report a sticky target-miss condition when the cumulative actual bits still
|
||||
exceed cumulative target bits while NEAR is already 31.
|
||||
|
||||
Counters:
|
||||
- Target bits are computed from actual source pixel width, not from storage
|
||||
width.
|
||||
- Actual bits count every byte generated into the internal output buffer.
|
||||
|
||||
### `jls_predictor`
|
||||
|
||||
Responsibilities:
|
||||
- Accept local reconstructed neighbors `Ra`, `Rb`, `Rc`, and `Rd` from the
|
||||
line-buffer neighbor provider.
|
||||
- Compute MED prediction `Px`.
|
||||
- Register and forward the original sample, coordinates, strip boundary flags,
|
||||
and neighbors to the context/error stage.
|
||||
- Keep top/left strip boundary handling in the line-buffer stage so this module
|
||||
remains a short compare/add pipeline stage.
|
||||
|
||||
Line-buffer companion:
|
||||
- Provide `Ra/Rb/Rc/Rd` from reconstructed samples.
|
||||
- Handle top, left, and right edge pixels as JPEG-LS frame boundaries.
|
||||
- Delay or bank current-line writes so previous-row `Rc/Rb/Rd` reads are not
|
||||
corrupted by reconstructed samples from the current row.
|
||||
|
||||
### `jls_neighbor_provider`
|
||||
|
||||
Responsibilities:
|
||||
- Provide reconstructed-neighbor samples `Ra`, `Rb`, `Rc`, and `Rd` to
|
||||
`jls_predictor`.
|
||||
- Use two row banks so the current-row reconstructed write does not corrupt the
|
||||
previous-row reads needed by `Rb/Rc/Rd`.
|
||||
- Apply strip-frame top-row and left/right-edge boundary rules:
|
||||
top-row previous samples are zero, `x=0` uses `Ra=Rb`, `Rc` from the
|
||||
previous line's left-edge extension sample, and the last column uses `Rd=Rb`.
|
||||
- Accept reconstructed-sample writeback `Rx` from the later error stage.
|
||||
- Consume the registered `pixel_row_last` flag from `jls_scan_ctrl` for
|
||||
right-edge and NEAR>0 row-transition handling; do not recompute row-last from
|
||||
strip width on the `Rd` read path.
|
||||
|
||||
Current implementation note:
|
||||
- In `NEAR=0` lossless strips, `Rx == X`; the provider commits the accepted
|
||||
original sample to line history immediately and does not wait for the later
|
||||
reconstructed-sample path. In `NEAR>0` strips, it keeps one outstanding
|
||||
pixel and waits for true reconstructed writeback before accepting the next
|
||||
sample. For regular-mode pixels, the top-level returns that writeback from
|
||||
`jls_regular_error_quantizer` after `Errval/Rx` acceptance instead of waiting
|
||||
for downstream Golomb completion.
|
||||
- A non-EOL `NEAR>0` writeback can overlap the next same-row pixel accept:
|
||||
the returned `Rx` is bypassed as the next pixel's `Ra`. Row-end transitions
|
||||
still wait a clock so the bank selector and left-edge extension state update
|
||||
before `x=0` of the next row.
|
||||
- A later version must still address the remaining `NEAR>0` one-pixel feedback
|
||||
dependency, run-segment ordering, or input buffering to reach the 200
|
||||
MPixel/s goal without committing non-standard neighbor values.
|
||||
|
||||
### `jls_mode_router`
|
||||
|
||||
Responsibilities:
|
||||
- Consume neighbor events and decide whether the local gradients select regular
|
||||
mode or run mode.
|
||||
- Forward regular-mode events to `jls_predictor`.
|
||||
- Accumulate run pixels while `|Ix - Ra| <= NEAR` and reconstruct those run
|
||||
pixels as `Ra` for line-buffer writeback.
|
||||
- Use gradients only to enter run mode. Once `run_length_accum` is non-zero,
|
||||
remain in the Annex A.7 run loop and treat the first nonmatching sample as a
|
||||
run interruption, even if that sample's gradients would not independently
|
||||
select run mode.
|
||||
- Emit a run segment to `jls_run_mode` when the run reaches EOL or an
|
||||
interruption sample.
|
||||
- After issuing a run segment, continue accepting later non-EOL matching run
|
||||
pixels because they do not emit entropy immediately, but stall before any
|
||||
regular event, run interruption, or EOL run segment until `jls_run_mode`
|
||||
reports `segment_done`.
|
||||
|
||||
Current implementation note:
|
||||
- This is still conservative around entropy ordering: non-EOL matching run
|
||||
pixels can overlap an outstanding run segment, but any event that would emit
|
||||
entropy remains blocked until the prior segment completes. That preserves
|
||||
entropy order while reducing run-only stalls, but remains a throughput risk
|
||||
for the full 200 MPixel/s target.
|
||||
|
||||
### `jls_context_model`
|
||||
|
||||
Responsibilities:
|
||||
- Consume the quantized context event from `jls_context_quantizer`.
|
||||
- Use `jls_context_memory` to store regular-mode variables `A`, `B`, `C`, and
|
||||
`N` for 365 contexts.
|
||||
- Use `jls_context_update` as the regular-mode update arithmetic core after
|
||||
`Errval` is known.
|
||||
- Bypass/forward updated context values when a later pipeline stage needs the
|
||||
same context before table writeback completes.
|
||||
- Track in-flight context indices so a same-context read either waits for the
|
||||
matching writeback or uses the same-cycle write/read bypass values.
|
||||
|
||||
Stop-for-confirmation trigger:
|
||||
- If bypass cannot maintain standard semantics without frequent stalls that
|
||||
threaten the 200 MPixel/s target, raise the issue before changing the target.
|
||||
|
||||
### `jls_context_update`
|
||||
|
||||
Responsibilities:
|
||||
- Compute regular-mode Golomb parameter `k` from pre-update `A` and `N`.
|
||||
- Update one regular-mode context's `A`, `B`, `C`, and `N` after `Errval`.
|
||||
- Apply the `RESET` halving rule.
|
||||
- Apply the JPEG-LS bias correction bounds for `B` and `C`.
|
||||
- Forward `Errval`, context index, strip-last flag, `LIMIT`, `qbpp`, and
|
||||
mapping-inversion metadata to downstream writeback/error-mapping logic.
|
||||
- Stay independent from the 365-entry context RAM so table hazards can be
|
||||
handled in a wrapper with explicit bypass rules.
|
||||
|
||||
### `jls_context_memory`
|
||||
|
||||
Responsibilities:
|
||||
- Apply lazy initialization for all 365 regular contexts at standalone
|
||||
strip-frame start by clearing a written-bit vector.
|
||||
- Initialize untouched contexts as `A[Q] = max(2, (RANGE + 32) / 64)`,
|
||||
`B[Q] = 0`, `C[Q] = 0`, and `N[Q] = 1` when they are read.
|
||||
- Latch the initialization A value when the init command is accepted.
|
||||
- Provide a registered read result and a simple writeback port.
|
||||
- Leave same-context read-after-write forwarding to the `jls_context_model`
|
||||
wrapper so the RAM stays a simple registered storage primitive.
|
||||
|
||||
### `jls_context_quantizer`
|
||||
|
||||
Responsibilities:
|
||||
- Compute standard local gradients `D1 = Rd - Rb`, `D2 = Rb - Rc`, and
|
||||
`D3 = Rc - Ra`.
|
||||
- Quantize gradients to `Q1/Q2/Q3` using current strip `T1/T2/T3/NEAR`.
|
||||
- Compute the signed context value `(Q1 * 9 + Q2) * 9 + Q3`.
|
||||
- Emit `context_index = abs(context_value)`, `context_negative`, and
|
||||
`run_mode_context`.
|
||||
- Register and forward sample, prediction, neighbors, and strip boundary flags.
|
||||
|
||||
### `jls_prediction_corrector`
|
||||
|
||||
Responsibilities:
|
||||
- Accept `Px`, context variable `C[Q]`, and the quantized context sign.
|
||||
- Apply the context sign to `C[Q]`.
|
||||
- Clamp `Px +/- C[Q]` to `0..MAXVAL` using the JPEG-LS
|
||||
`correct_prediction` behavior.
|
||||
- Forward sample, coordinates, context metadata, strip boundary flags, and the
|
||||
pre-update `A/B/C/N` variables needed by context update.
|
||||
|
||||
### `jls_regular_error_quantizer`
|
||||
|
||||
Responsibilities:
|
||||
- Compute regular-mode `Errval` from the original sample and corrected
|
||||
prediction.
|
||||
- Apply NEAR-dependent quantization and RANGE modulo normalization.
|
||||
- Compute the reconstructed sample `Rx` used by the encoder-side line history.
|
||||
- Forward `A/B/C/N`, context index, strip-last flag, `LIMIT`, and `qbpp` to the
|
||||
context update and entropy pipeline.
|
||||
|
||||
Current implementation note:
|
||||
- `NEAR=0` takes the direct path. `NEAR>0` uses an exact reciprocal-LUT multiply
|
||||
plus quotient-correction pipeline, covering the supported `NEAR=1..31` range
|
||||
without a single-cycle combinational divider.
|
||||
- In the integrated top level, regular-mode `Rx` is returned to line history
|
||||
when this module's result is accepted. Annex A.6 context update and Annex
|
||||
G.2 entropy coding consume the same `Errval` later, but they do not modify
|
||||
`Rx`.
|
||||
|
||||
### `jls_run_mode`
|
||||
|
||||
Responsibilities:
|
||||
- Accept one upstream-detected run segment: `run_length`, EOL flag, and optional
|
||||
run-interruption sample.
|
||||
- Emit direct run-length code events before any run-interruption mapped-error
|
||||
event for the same segment.
|
||||
- Compute run-interruption `RItype`, signed `Errval`, `MErrval`, `k`, and
|
||||
`LIMIT - J[RUNindex] - 1`.
|
||||
- Maintain `RUNindex` and the two run-interruption contexts for `RItype=0` and
|
||||
`RItype=1`.
|
||||
- Emit the reconstructed run-interruption sample `Rx` for line-buffer writeback.
|
||||
|
||||
Current integration note:
|
||||
- The module is a run-segment entropy helper, not the upstream run scanner. The
|
||||
top-level integration uses `jls_mode_router` as that run scanner; the router
|
||||
consumes run pixels, emits reconstructed run pixels, and feeds this module
|
||||
with complete run segments.
|
||||
|
||||
### `jls_golomb_encoder`
|
||||
|
||||
Responsibilities:
|
||||
- Accept already computed standard variables `MErrval`, `k`, `LIMIT`, and
|
||||
`qbpp` from the regular-mode or run-interruption pipeline.
|
||||
- Generate left-aligned variable-length Golomb code events for the bit packer.
|
||||
- Emit the regular Golomb path and the JPEG-LS LIMIT fallback path.
|
||||
- Allow multi-cycle handling for extreme long codes while preserving standard
|
||||
ordering.
|
||||
- Keep prediction-error mapping and `k` calculation in the upstream
|
||||
context/regular-mode stage so those timing paths can be split independently.
|
||||
|
||||
### `jls_error_mapper`
|
||||
|
||||
Responsibilities:
|
||||
- Accept standard signed `Errval` after quantization and context-sign handling.
|
||||
- Apply the context-correction inversion used before mapping when requested.
|
||||
- Map `Errval` to the non-negative standard variable `MErrval`.
|
||||
- Forward `k`, `LIMIT`, and `qbpp` to `jls_golomb_encoder`.
|
||||
|
||||
### `jls_bit_packer`
|
||||
|
||||
Responsibilities:
|
||||
- Pack variable-length code events into JPEG-LS byte stream.
|
||||
- Apply JPEG-LS marker/zero-bit stuffing rules.
|
||||
- Flush to byte boundary before `EOI` for each strip frame.
|
||||
- Accept left-aligned code events; the first bit is
|
||||
`code_bits[MAX_CODE_BITS-1]`.
|
||||
- Emit at most one scan payload byte per cycle to the internal output buffer.
|
||||
|
||||
### `jls_byte_arbiter`
|
||||
|
||||
Responsibilities:
|
||||
- Merge marker/header bytes and scan-payload bytes into one encoded byte stream.
|
||||
- Give header/EOI bytes priority over payload bytes so each strip frame remains
|
||||
`SOI/SOF55/LSE/SOS`, payload, then `EOI`.
|
||||
- Forward `original_image_start` only from the header stream.
|
||||
|
||||
### `jls_output_buffer`
|
||||
|
||||
Responsibilities:
|
||||
- Buffer generated bytes before the 9-bit output FIFO.
|
||||
- Drain one byte per cycle to `ofifo`.
|
||||
- Ignore `ofifo_full` and `ofifo_alfull` in RTL behavior.
|
||||
- Produce simulation error if `ofifo_full=1` and `ofifo_wr=1`.
|
||||
- Raise internal near-full pause request when free space is below
|
||||
`OUT_BUF_AFULL_MARGIN`.
|
||||
- Provide `byte_accepted` and `buffer_level` for statistics, dynamic NEAR
|
||||
accounting, and verification reports.
|
||||
|
||||
## First RTL Smoke Target
|
||||
|
||||
The first RTL smoke target should not implement full entropy coding. It should
|
||||
verify safe sequencing before algorithmic complexity is added:
|
||||
|
||||
- Input SOF detection and dimension latch.
|
||||
- Strip frame boundary generation.
|
||||
- Header writer emits one minimal strip frame per strip.
|
||||
- Output buffer emits bytes with correct `ofifo_wdata[8]` placement.
|
||||
- Testbench captures output stream and uses the reference decode script once
|
||||
entropy payload is valid.
|
||||
115
docs/jls_pipeline_mermaid.md
Normal file
115
docs/jls_pipeline_mermaid.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# JPEG-LS RTL Pipeline Mermaid Flow
|
||||
|
||||
This document describes the current RTL algorithm pipeline implemented around
|
||||
`jpeg_ls_encoder_top`. The requirement source is `fpga/srs/jpeg_ls.md`.
|
||||
|
||||
The figure is an implementation trace, not a replacement for the standard. The
|
||||
standard references identify the algorithmic step that each RTL stage must be
|
||||
equivalent to after pipelining, lookup, speculation, or multi-cycle splitting.
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
%% JPEG-LS FPGA encoder pipeline, current RTL implementation trace.
|
||||
|
||||
subgraph EXT["External interfaces"]
|
||||
IFIFO["Input FIFO\nIn: ififo_empty/alempty/rdata, cfg_pic_col/row, ratio\nOut: synchronous read data\nStd: image sample order before Annex A encoding"]
|
||||
OFIFO["Output FIFO\nIn: ofifo_wr, ofifo_wdata[8:0]\nOut: external byte stream\nStd: Annex C marker stream bytes"]
|
||||
end
|
||||
|
||||
subgraph CTRL["Strip control, parameters, and headers"]
|
||||
IN["S00 jls_input_ctrl\nIn: FIFO word, SOF sideband, cfg, ratio, pause_req\nDo: SOF gate, 1-cycle FIFO align, dimension fallback, x/y and strip flags\nOut: pixel event, active_pic_col/row, active_ratio\nStd: Annex A.8 control procedure, Annex D scan order"]
|
||||
SCAN["S01 jls_scan_ctrl\nIn: pixel event, current_near\nDo: start/finish one standalone strip frame, choose first-strip NEAR=0\nOut: enc pixel event, strip_start/finish, strip width/height/near\nStd: Annex A.8, Annex D.1-D.3"]
|
||||
PRESET["S02 jls_preset_defaults + jls_coding_params\nIn: PIX_WIDTH, strip NEAR\nDo: default MAXVAL/T1/T2/T3/RESET, RANGE/qbpp/LIMIT lookup\nOut: header preset fields, active coding params\nStd: Annex A.2, Annex C.2.4.1.1, Annex G.2"]
|
||||
HDR["S03 jls_header_writer\nIn: strip_start/finish, strip size, NEAR, preset params\nDo: emit SOI/SOF55/LSE/SOS and EOI, big-endian marker fields\nOut: header/eoi byte stream, original_image_start sideband\nStd: Annex C.1-C.4, Annex D.3"]
|
||||
NEAR["S04 jls_near_ctrl\nIn: image_start, strip_done, strip pixels, output bytes, ratio\nDo: cumulative actual-vs-target NEAR update and clamp 0..31\nOut: current_near, target_miss_at_max_near\nStd: NEAR usage in Annex A/C/D; dynamic policy is project-specific"]
|
||||
end
|
||||
|
||||
subgraph PIX["Pixel neighborhood and mode decision"]
|
||||
NBR["S10 jls_neighbor_provider\nIn: enc pixel event, reconstructed writeback Rx, NEAR\nDo: two-bank line history, strip edge handling, Ra/Rb/Rc/Rd selection; NEAR=0 commits X as Rx immediately; NEAR>0 same-row non-EOL Rx bypasses to next Ra\nOut: neighbor event with X,x,y,Ra,Rb,Rc,Rd\nStd: Annex A.3 local gradients, Annex A.4 prediction neighbors"]
|
||||
ROUTE["S11 jls_mode_router\nIn: neighbor event, strip_width, NEAR\nDo: run/regular decision, run_length accumulation, run pixel Rx=Ra\nOut: regular event OR run segment, direct run-pixel writeback\nStd: Annex A.3 context determination, Annex A.7 run mode"]
|
||||
end
|
||||
|
||||
subgraph REG["Regular-mode pipeline"]
|
||||
PRED["S20 jls_predictor\nIn: X,Ra,Rb,Rc,Rd\nDo: MED predictor Px\nOut: predicted event with Px and neighbors\nStd: Annex A.4 MED prediction"]
|
||||
QTZ["S21 jls_context_quantizer\nIn: Ra,Rb,Rc,Rd,T1,T2,T3,NEAR,Px\nDo: D1=Rd-Rb, D2=Rb-Rc, D3=Rc-Ra; quantize Q1/Q2/Q3; context sign/index\nOut: context event, run_mode_context flag\nStd: Annex A.3, Annex G.1"]
|
||||
CMODEL["S22 jls_context_model + jls_context_memory\nIn: context_index, strip init, writeback A/B/C/N\nDo: lazy-init 365 contexts, read/bypass pre-update A/B/C/N, stall same-context hazards\nOut: vars event with A,B,C,N,C[Q]\nStd: Annex A.2 initialization, Annex A.6 variables"]
|
||||
CORR["S23 jls_prediction_corrector\nIn: Px,C[Q],context sign, MAXVAL\nDo: bias-correct prediction and clamp to 0..MAXVAL\nOut: corrected Px and pre-update context vars\nStd: Annex A.5 prediction correction, Annex A.6 bias variables"]
|
||||
ERRQ["S24 jls_regular_error_quantizer\nIn: X, corrected Px, RANGE, NEAR\nDo: Errval quantization/modulo normalization, reconstruct Rx, reciprocal-LUT NEAR division; regular Rx may return before Golomb done\nOut: Errval, Rx, context metadata, qbpp/LIMIT\nStd: Annex A.5 prediction error encoding, Annex A.2 RANGE"]
|
||||
CUPDATE["S25 jls_context_update\nIn: Errval,A,B,C,N,NEAR,RESET\nDo: compute k from pre-update vars; update A/B/C/N; map inversion flag\nOut: k, updated A/B/C/N, Errval for mapping\nStd: Annex A.5 Golomb parameter, Annex A.6 update variables"]
|
||||
EMAP["S26 jls_error_mapper\nIn: Errval, map_invert, k, LIMIT, qbpp\nDo: signed Errval to non-negative MErrval\nOut: regular MErrval,k,LIMIT,qbpp,last flag\nStd: Annex A.5 mapped error value, Annex G.2"]
|
||||
end
|
||||
|
||||
subgraph RUN["Run-mode pipeline"]
|
||||
RUNCORE["S30 jls_run_mode\nIn: run_length, EOL flag, interruption X,x,y,Ra,Rb,RANGE,qbpp,LIMIT,NEAR,RESET\nDo: RUNindex/J run-length code, RItype, reciprocal-LUT interruption Errval/MErrval/k, RI context update, interruption Rx\nOut: direct run code bits, run MErrval,k,limit,qbpp, interruption Rx\nStd: Annex A.7 run mode, Annex A.5 mapped error, Annex G.3"]
|
||||
end
|
||||
|
||||
subgraph ENT["Entropy, packing, and byte output"]
|
||||
MMERGE["S40 mapped-error arbiter in top\nIn: regular MErrval stream, run-interruption MErrval stream\nDo: prioritize pending run mapped event while preserving conservative order\nOut: selected MErrval,k,limit,qbpp\nStd: engineering ordering wrapper for Annex A.5/A.7 events"]
|
||||
GOL["S41 jls_golomb_encoder\nIn: MErrval,k,LIMIT,qbpp\nDo: Golomb-Rice prefix/suffix and LIMIT fallback code events\nOut: left-aligned variable-length code bits\nStd: Annex A.5, Annex G.2"]
|
||||
CMERGE["S42 code-event arbiter in top\nIn: direct run-length code, Golomb code events\nDo: run-length code before same-segment interruption code\nOut: ordered code_bits/code_bit_count\nStd: Annex A.7 run-length code order, Annex A.5 error code order"]
|
||||
PACK["S43 jls_bit_packer\nIn: ordered code events, flush request\nDo: bit-to-byte packing, JPEG-LS zero-bit stuffing after 0xFF, flush before EOI\nOut: scan payload bytes\nStd: Annex C.1-C.4, Annex H.2"]
|
||||
BARB["S44 jls_byte_arbiter\nIn: header/eoi bytes, payload bytes\nDo: header/EOI priority over payload, preserve original_image_start sideband\nOut: encoded byte event\nStd: Annex C marker and entropy-coded segment ordering"]
|
||||
OUTBUF["S45 jls_output_buffer\nIn: encoded byte event\nDo: byte FIFO buffering, one byte/cycle drain, internal near-full pause\nOut: ofifo_wr/ofifo_wdata, byte_accepted, buffer_level\nStd: Annex C byte stream delivery"]
|
||||
end
|
||||
|
||||
IFIFO --> IN --> SCAN --> NBR --> ROUTE
|
||||
SCAN --> PRESET
|
||||
SCAN --> HDR
|
||||
PRESET --> HDR
|
||||
PRESET --> CMODEL
|
||||
PRESET --> ERRQ
|
||||
PRESET --> RUNCORE
|
||||
PRESET --> QTZ
|
||||
NEAR --> SCAN
|
||||
OUTBUF --> NEAR
|
||||
SCAN --> NEAR
|
||||
|
||||
ROUTE -- "regular event: X,x,y,Ra/Rb/Rc/Rd" --> PRED --> QTZ --> CMODEL --> CORR --> ERRQ --> CUPDATE --> EMAP --> MMERGE
|
||||
CUPDATE -- "updated A/B/C/N writeback" --> CMODEL
|
||||
|
||||
ROUTE -- "run segment: run_length,EOL,interruption X,Ra,Rb" --> RUNCORE
|
||||
ROUTE -- "run pixel Rx=Ra" --> NBR
|
||||
RUNCORE -- "interruption Rx" --> NBR
|
||||
RUNCORE -- "direct run code bits" --> CMERGE
|
||||
RUNCORE -- "run MErrval,k,limit,qbpp" --> MMERGE
|
||||
|
||||
ERRQ -- "regular Rx after Errval/Rx accept" --> NBR
|
||||
|
||||
MMERGE --> GOL --> CMERGE --> PACK --> BARB --> OUTBUF --> OFIFO
|
||||
HDR --> BARB
|
||||
```
|
||||
|
||||
## Stage Detail
|
||||
|
||||
| Stage | RTL module | Main inputs | Main processing | Main outputs | Standard / pseudocode mapping |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| S00 | `jls_input_ctrl` | `ififo_rdata`, `ififo_empty`, `ififo_alempty`, `cfg_pic_col`, `cfg_pic_row`, `ratio`, `pause_req` | Wait for SOF, align synchronous FIFO read latency, latch runtime config, validate/fallback image size, generate `x/y` and strip/image flags. | Pixel event: `sample`, `x`, `y`, `strip_first_pixel`, `strip_last_pixel`, `image_first_pixel`, `image_last_pixel`, active config. | Annex A.8 control procedure; Annex D scan order. |
|
||||
| S01 | `jls_scan_ctrl` | Pixel event, `current_near`, downstream readiness. | Split the original image into standalone strip frames, emit strip start/finish commands, force the first strip NEAR to 0. | Encode pixel event, strip start/finish event, `strip_width`, `strip_height`, `strip_near`, strip pixel count. | Annex A.8; Annex D.1-D.3 scan control. |
|
||||
| S02 | `jls_preset_defaults`, `jls_coding_params` | `PIX_WIDTH`, strip `NEAR`. | Compute default `MAXVAL/T1/T2/T3/RESET`; lookup `RANGE/qbpp/LIMIT` for the active strip. | Preset fields for LSE/header and active coding parameters for regular/run mode. | Annex A.2; Annex C.2.4.1.1 preset parameters; Annex G.2 coding parameters. |
|
||||
| S03 | `jls_header_writer` | Strip start/finish, strip size, `NEAR`, preset fields. | Emit `SOI/SOF55/LSE/SOS` at strip start and `EOI` after payload flush. | Header/EOI byte stream and `original_image_start` sideband. | Annex C.1-C.4 marker syntax; Annex D.3 scan syntax. |
|
||||
| S04 | `jls_near_ctrl` | Image start ratio, strip output byte count, strip pixel count. | Project dynamic policy: cumulative actual-vs-target bits, step `NEAR` up/down, clamp to `0..31`. | `current_near`, cumulative bit counters, target-miss flag. | Standard NEAR usage is Annex A/C/D; dynamic ratio policy is project-specific. |
|
||||
| S10 | `jls_neighbor_provider` | Encode pixel event, reconstructed writeback `Rx`, active strip width, `NEAR`. | Maintain reconstructed line history in two banks, apply top/left/right edge rules, produce JPEG-LS neighbors. For `NEAR=0`, commit `X` as `Rx` immediately. For `NEAR>0`, a non-EOL writeback can overlap the next same-row pixel accept by bypassing returned `Rx` as that pixel's `Ra`; row transitions still wait one clock. Regular-mode true `Rx` returns immediately after S24 accepts the `Errval/Rx` result rather than after Golomb completion. | Neighbor event with `X`, `x/y`, `Ra/Rb/Rc/Rd`, strip flags. | Annex A.3 local gradients; Annex A.4 prediction neighborhood. |
|
||||
| S11 | `jls_mode_router` | Neighbor event, `strip_width`, `NEAR`. | Determine regular/run entry from gradients, then stay in the Annex A.7 run loop while `run_length_accum` is non-zero; accumulate matching run pixels, reconstruct them as `Ra`, and form run segments at EOL/interruption. Later non-EOL matching run pixels may overlap an outstanding run segment because they emit no entropy yet. | Regular event or run segment; direct run-pixel reconstruction. | Annex A.3 context determination; Annex A.7 run mode. |
|
||||
| S20 | `jls_predictor` | Regular event `X,Ra,Rb,Rc,Rd`. | Compute MED prediction `Px`. | Predicted event with `Px` and neighbor metadata. | Annex A.4 MED predictor pseudocode. |
|
||||
| S21 | `jls_context_quantizer` | `Ra/Rb/Rc/Rd`, `T1/T2/T3`, `NEAR`, `Px`. | Compute `D1/D2/D3`, quantize to `Q1/Q2/Q3`, derive context sign and index. | Context event, `context_index`, `context_negative`, `run_mode_context`. | Annex A.3; Annex G.1 context quantization. |
|
||||
| S22 | `jls_context_model`, `jls_context_memory` | Context event, strip init, regular context writeback. | Lazy-initialize/read 365 regular contexts; track in-flight context indices; bypass same-cycle write/read values to prevent stale `A/B/C/N`. | Vars event with `A/B/C/N`, `C[Q]`, context metadata. | Annex A.2 initialization; Annex A.6 variables. |
|
||||
| S23 | `jls_prediction_corrector` | `Px`, `C[Q]`, context sign, `MAXVAL`, pre-update vars. | Apply bias correction to prediction and clamp to sample range. | Corrected `Px`, forwarded context vars and metadata. | Annex A.5 prediction correction; Annex A.6 bias variables. |
|
||||
| S24 | `jls_regular_error_quantizer` | `X`, corrected `Px`, `RANGE`, `NEAR`, `qbpp`, `LIMIT`. | Compute quantized/modulo `Errval`, reconstruct regular-mode `Rx`; `NEAR>0` uses reciprocal-LUT multiply plus quotient correction. The top-level may feed this `Rx` back to S10 before S25/S26/S41 finish because later stages do not alter `Rx`. | `Errval`, regular `Rx`, context index, `qbpp`, `LIMIT`, pre-update vars. | Annex A.5 prediction error encoding; Annex A.2 RANGE. |
|
||||
| S25 | `jls_context_update` | `Errval`, `A/B/C/N`, `NEAR`, `RESET`, context metadata. | Compute pre-update `k`, update `A/B/C/N`, apply RESET halving and bias bounds, compute map inversion flag. | Updated context writeback, `k`, `Errval`, map inversion metadata. | Annex A.5 Golomb parameter; Annex A.6 update variables. |
|
||||
| S26 | `jls_error_mapper` | `Errval`, map inversion flag, `k`, `LIMIT`, `qbpp`. | Convert signed regular `Errval` into non-negative `MErrval`. | Regular mapped event `MErrval/k/LIMIT/qbpp`. | Annex A.5 mapped error value; Annex G.2. |
|
||||
| S30 | `jls_run_mode` | Run segment `run_length`, EOL, interruption `X`, `Ra/Rb`, `RANGE`, `qbpp`, `LIMIT`, `NEAR`, `RESET`. | Emit run-length code via `RUNindex/J`; compute `RItype`, run-interruption `Errval/MErrval/k`, update RI contexts, reconstruct interruption `Rx`; `NEAR>0` uses the same reciprocal-LUT division pipeline. | Direct run code event, run mapped event, interruption `Rx`, segment done flags. | Annex A.7 run mode; Annex A.5 mapped error; Annex G.3 run interruption context. |
|
||||
| S40 | top mapped-error arbiter | Regular mapped event, run mapped event. | Select the next `MErrval/k/limit/qbpp` event while preserving conservative ordering; if one mapped event completes while a new run mapped event is accepted, the new run busy state wins. | Mapped event for Golomb encoder. | Engineering wrapper preserving Annex A.5/A.7 entropy order. |
|
||||
| S41 | `jls_golomb_encoder` | `MErrval`, `k`, `LIMIT`, `qbpp`. | Generate Golomb-Rice prefix/suffix code events and LIMIT fallback path. | Left-aligned variable-length code bits. | Annex A.5; Annex G.2. |
|
||||
| S42 | top code-event arbiter | Direct run-length code event, Golomb code event. | Emit run-length code before the same segment's interruption Golomb event; block reordering. | Ordered code event stream. | Annex A.7 run-length code order; Annex A.5 error code order. |
|
||||
| S43 | `jls_bit_packer` | Ordered code bits, flush request. | Pack bits into bytes, insert JPEG-LS zero bit after data byte `0xFF`, flush partial byte before `EOI`. | Scan payload bytes. | Annex C.1-C.4 entropy-coded segment syntax; Annex H.2 examples. |
|
||||
| S44 | `jls_byte_arbiter` | Header/EOI bytes, payload bytes. | Prioritize marker bytes and preserve `original_image_start` sideband only from header stream. | Encoded byte event. | Annex C marker and scan payload ordering. |
|
||||
| S45 | `jls_output_buffer` | Encoded byte event, output FIFO status ports. | Buffer generated bytes, drain one byte per cycle, ignore external full flags in RTL behavior, report internal pause. | `ofifo_wr`, `ofifo_wdata`, byte count/watermark signals. | Annex C byte stream delivery; project FIFO contract. |
|
||||
|
||||
## Speculation Rule
|
||||
|
||||
The implementation may add a future speculation path before S10/S11 to precompute
|
||||
gradients, prediction candidates, or context read addresses from original or old
|
||||
values. Such values are only hints. Final `Ra/Rb/Rc/Rd`, `Px`, context index,
|
||||
run/regular selection, `Errval/MErrval/k`, context updates, run-state updates,
|
||||
and reconstructed history must be recomputed or checked against the true
|
||||
encoder-side reconstructed neighbor history before commit.
|
||||
204
docs/jls_traceability.md
Normal file
204
docs/jls_traceability.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# JPEG-LS RTL 标准可追溯说明
|
||||
|
||||
本文档用于在 RTL 实现过程中记录 JPEG-LS 标准条款、伪代码变量、RTL 代码片段和
|
||||
示例之间的对应关系。实现代码中的关键处理过程必须引用本文档的对应小节。
|
||||
|
||||
## 1. 引用标准
|
||||
|
||||
- 标准名称:ITU-T T.87 (06/1998) / ISO/IEC 14495-1 JPEG-LS Baseline
|
||||
- 官方页面:https://www.itu.int/rec/T-REC-T.87-199806-I
|
||||
- 参考实现:https://github.com/team-charls/charls
|
||||
|
||||
## 2. RTL 注释模板
|
||||
|
||||
```systemverilog
|
||||
// Standard : ITU-T T.87 (06/1998) / ISO/IEC 14495-1 JPEG-LS Baseline
|
||||
// Clause : Annex A.4 Prediction
|
||||
// Figure : N/A
|
||||
// Table : N/A
|
||||
// Pseudocode : MED predictor / Px calculation
|
||||
// Trace : docs/jls_traceability.md#med-predictor
|
||||
// Notes : Pipelined implementation; equivalent to the standard step.
|
||||
```
|
||||
|
||||
规则:
|
||||
|
||||
- `Clause`、`Figure`、`Table` 必须来自正式标准文档或官方目录。
|
||||
- 没有对应图或表时写 `N/A`。
|
||||
- 禁止凭记忆填写图号、表号或章节号。
|
||||
- 不在 RTL 注释中大段复制标准原文,只写引用位置、变量对应和工程说明。
|
||||
- 流水化、查表、旁路、多周期处理必须说明与标准伪代码的等价关系。
|
||||
|
||||
## 3. 处理过程对照表
|
||||
|
||||
| 处理过程 | RTL 模块 | 标准章节 | 图 | 表 | RTL 片段 ID | 备注 |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 编码总体流程 | `jpeg_ls_encoder_top`, `jls_scan_ctrl` | Clause 4.4, Annex A.8, Annex D.1-D.3 | N/A | N/A | `JLS_TOP_PIPELINE`, `JLS_SCAN_CONTROL` | 见 `docs/jls_pipeline_mermaid.md` |
|
||||
| 单分量编码参数和压缩数据 | `jls_scan_ctrl`, `jls_header_writer` | Annex A.1 | N/A | N/A | `JLS_SINGLE_COMPONENT_PARAMS` | 灰度单分量,`Nf=1` |
|
||||
| 初始化和约定 | `jls_scan_ctrl`, `jls_context_model` | Annex A.2 | N/A | N/A | `JLS_CONTEXT_INIT`, `JLS_CODING_PARAMS` | 条带 frame 边界重新初始化 |
|
||||
| 上下文确定 | `jls_context_quantizer`, `jls_context_model` | Annex A.3, Annex G.1 | N/A | N/A | `JLS_CONTEXT_QUANTIZER` | 见 `fpga/verilog/jls_context_quantizer.sv` |
|
||||
| MED 预测 | `jls_predictor` | Annex A.4 | N/A | N/A | `MED_PREDICTOR` | 见第 4.1 节 |
|
||||
| 预测误差编码 | `jls_error_mapper`, `jls_golomb_encoder` | Annex A.5, Annex G.2 | N/A | N/A | `JLS_ERROR_MAPPER`, `JLS_GOLOMB_ENCODER` | 见 `fpga/verilog/jls_error_mapper.sv` 和 `fpga/verilog/jls_golomb_encoder.sv` |
|
||||
| 上下文变量更新 | `jls_context_memory`, `jls_context_update`, `jls_context_model` | Annex A.2, Annex A.6 | N/A | N/A | `JLS_CONTEXT_MEMORY`, `JLS_CONTEXT_UPDATE` | 见 `fpga/verilog/jls_context_memory.sv` 和 `fpga/verilog/jls_context_update.sv` |
|
||||
| 预测值偏差修正 | `jls_prediction_corrector` | Annex A.5, Annex A.6 | N/A | N/A | `JLS_PREDICTION_CORRECTOR` | 见 `fpga/verilog/jls_prediction_corrector.sv` |
|
||||
| run mode 编码 | `jls_run_mode` | Annex A.7, Annex G.3 | N/A | N/A | `JLS_RUN_MODE` | RUNindex/J 使用标准伪代码表项;见第 4.4 节 |
|
||||
| JPEG-LS 码流格式和 marker | `jls_header_writer`, `jls_bit_packer` | Annex C.1-C.4 | N/A | N/A | `JLS_HEADER_MARKERS` | 见 `fpga/verilog/jls_header_writer.sv` |
|
||||
| LSE 默认 preset 参数 | `jls_preset_defaults`, `jls_header_writer` | Annex C.2.4.1.1 | Figure C.3 | Table C.1, Table C.2, Table C.3 | `JLS_PRESET_DEFAULTS` | 见 `fpga/verilog/jls_preset_defaults.sv` |
|
||||
| RANGE/qbpp/LIMIT 参数 | `jls_coding_params` | Annex A.2, Annex G.2 | N/A | N/A | `JLS_CODING_PARAMS` | 见 `fpga/verilog/jls_coding_params.sv` |
|
||||
| 输出 FIFO 字节交付 | `jls_output_buffer` | Annex C.1-C.4 | N/A | N/A | `JLS_OUTPUT_BUFFER` | 见 `fpga/verilog/jls_output_buffer.sv` |
|
||||
| scan 控制流程 | `jls_scan_ctrl`, `jls_header_writer` | Annex D.3 | N/A | N/A | `JLS_SCAN_CONTROL` | 见 `fpga/verilog/jls_scan_ctrl.sv` |
|
||||
| bitstream 输出示例 | `jls_bit_packer` | Annex H.2 | N/A | N/A | `JLS_BIT_PACKER` | 见 `fpga/verilog/jls_bit_packer.sv` |
|
||||
| 详细编码示例 | 多模块联合说明 | Annex H.3 | N/A | N/A | `JLS_TRACE_EXAMPLES` | 本文第 4 节给出小规模变量示例 |
|
||||
| 解码一致性验证 | 验证脚本、CharLS 和 libjpeg 对比 | Annex F.1 | N/A | N/A | `JLS_REFERENCE_COMPARE` | 见 `tools/jls_compat/reference_decode_compare.py` |
|
||||
|
||||
## 4. 示例说明模板
|
||||
|
||||
### 4.1 MED Predictor
|
||||
|
||||
- 标准章节:Annex A.4
|
||||
- RTL 模块:`jls_predictor`
|
||||
- 标准变量:`Ra`, `Rb`, `Rc`, `Px`
|
||||
- RTL 片段 ID:`MED_PREDICTOR`
|
||||
- 输入示例:`Ra=10`, `Rb=20`, `Rc=15`
|
||||
- 中间变量示例:`Rc` 位于 `Ra/Rb` 区间内,选择 `Ra+Rb-Rc`
|
||||
- 输出示例:`Px=15`
|
||||
- 工程说明:`jls_predictor` 只实现 MED 比较/加减并寄存输出;`Ra/Rb/Rc/Rd` 的行缓存读取和边界处理放在单独流水级,保持与标准 `Px` 计算等价并降低单级逻辑深度。`NEAR=0` 时 lossless 重建值 `Rx` 等于输入样本 `X`,RTL 可将 `X` 立即提交到 line history;`NEAR>0` 时必须等待真实重建样本或使用已校验等价的重放机制。当前实现允许非行尾 writeback 与下一同一行像素同周期接受,并把刚返回的 `Rx` 旁路为下一像素 `Ra`;行尾到下一行 `x=0` 的状态切换不做旁路。
|
||||
|
||||
### 4.2 Context Update
|
||||
|
||||
- 标准章节:Annex A.3, Annex A.6, Annex G.1
|
||||
- RTL 模块:`jls_context_quantizer`, `jls_context_model`
|
||||
- 标准变量:`D1`, `D2`, `D3`, `Q1`, `Q2`, `Q3`, `A`, `B`, `C`, `N`, `Nn`
|
||||
- RTL 片段 ID:`JLS_CONTEXT_QUANTIZER`, `JLS_CONTEXT_MEMORY`, `JLS_CONTEXT_UPDATE`
|
||||
- 输入示例:`Rd=32`, `Rb=10`, `Rc=2`, `Ra=0`, `T1=3`, `T2=7`, `T3=21`, `NEAR=0`
|
||||
- 中间变量示例:`D1=22`, `D2=8`, `D3=2`,量化后 `Q1=4`, `Q2=3`, `Q3=1`
|
||||
- 输出示例:`context_index=352`, `context_negative=0`, `run_mode_context=0`
|
||||
- 工程说明:`jls_context_quantizer` 只做梯度量化和 context 编号;`jls_context_memory` 保存 365 个 regular context,并用 written-bit 惰性初始化返回条带默认 `A/B/C/N`,避免条带开始逐项清表;`jls_context_update` 只做单个 context 的 `A/B/C/N` 算术更新,并把 Annex A.6 的 `B[Q] += Errval*(2*NEAR+1)` 拆成 DSP 输入操作数、乘积和累加流水级。`jls_context_model` 使用 in-flight busy 位跟踪 context 读后待写状态;连续像素访问同一 context 时,若写回尚未到达则暂停,若写回与新读同周期发生则旁路 `write_A/B/C/N`,禁止读旧 context。
|
||||
|
||||
### 4.3 Golomb-Rice Encoding
|
||||
|
||||
- 标准章节:Annex A.5, Annex G.2
|
||||
- RTL 模块:`jls_error_mapper`, `jls_golomb_encoder`
|
||||
- 标准变量:`Errval`, `MErrval`, `k`, `LIMIT`, `qbpp`
|
||||
- RTL 片段 ID:`JLS_ERROR_MAPPER`, `JLS_GOLOMB_ENCODER`
|
||||
- 输入示例:`Errval=-3`, `map_invert=0`;随后 `MErrval=5`, `k=1`, `LIMIT=32`, `qbpp=8`
|
||||
- 中间变量示例:`MErrval=5` 时 `high_bits=2`,普通路径 prefix 为 `0,0,1`
|
||||
- 输出示例:两个 left-aligned code event,先输出 prefix `001`,再输出 suffix `1`
|
||||
- 工程说明:`jls_error_mapper` 完成 signed `Errval` 到 non-negative `MErrval` 的映射;`jls_golomb_encoder` 从已经映射好的 `MErrval` 开始生成码字。`Errval` 量化和 `k` 计算放在上游流水级,便于拆分逻辑深度。极端长码允许多周期处理,但不得牺牲主时钟频率和目标吞吐率。
|
||||
|
||||
### 4.3a Regular Prediction Correction
|
||||
|
||||
- 标准章节:Annex A.5, Annex A.6
|
||||
- RTL 模块:`jls_prediction_corrector`
|
||||
- 标准变量:`Px`, `C[Q]`
|
||||
- RTL 片段 ID:`JLS_PREDICTION_CORRECTOR`
|
||||
- 输入示例:`Px=20`, `C=-3`, `context_negative=0`
|
||||
- 中间变量示例:context sign 不取反时修正量为 `-3`
|
||||
- 输出示例:`corrected_Px=17`
|
||||
- 工程说明:该模块只实现 prediction correction 和 `0..MAXVAL` 限幅;`Errval` 量化、重建样本和 context 变量更新放在后续流水级。
|
||||
|
||||
### 4.3b Regular Error Quantization And Reconstruction Feedback
|
||||
|
||||
- 标准章节:Annex A.5, Annex A.6, Annex G.2
|
||||
- RTL 模块:`jls_regular_error_quantizer`, `jpeg_ls_encoder_top`
|
||||
- 标准变量:`Errval`, `Rx`, `MErrval`, `k`, `A`, `B`, `C`, `N`
|
||||
- RTL 片段 ID:`JLS_REGULAR_ERROR_QUANTIZER`, `JLS_TOP_REGULAR_RX_FEEDBACK`
|
||||
- 输入示例:`X=22`, `corrected_Px=17`, `NEAR=0`, `RANGE=256`
|
||||
- 中间变量示例:`Errval=5`,无损场景中 `Rx=X=22`
|
||||
- 输出示例:`jls_neighbor_provider` 在 regular 误差量化结果被接受后的下一拍收到 `Rx=22`
|
||||
- 工程说明:`Rx` 在 Annex A.5 的 `Errval` 量化和 modulo 规范化后已经确定;Annex A.6 context update、`MErrval` 映射和 Annex G.2 Golomb 码字生成不会修改 `Rx`。顶层因此把 regular-mode `Rx` 提前反馈给 line history,同时保持 context 写回和 entropy 事件的标准顺序。
|
||||
|
||||
### 4.4 Run Mode
|
||||
|
||||
- 标准章节:Annex A.7, Annex G.3
|
||||
- RTL 模块:`jls_run_mode`
|
||||
- 标准变量:`RUNindex`, `RUNval`, `RItype`, `EMErrval`
|
||||
- RTL 片段 ID:`JLS_RUN_MODE`
|
||||
- 输入示例:`NEAR=31`, `RANGE=6`, `run_length=0`, `Ra=0`, `Rb=0`, `X=200`
|
||||
- 中间变量示例:`RItype=1`, `Errval=floor((200+31)/63)=3`,随后按 `RANGE=6` modulo 规范化为 `-3`;重建 `Rx=-3*63+6*63=189`
|
||||
- 输出示例:`RUNindex=0` 的 zero-length run 输出 1 个 `0` bit;`MErrval=4`, `k=1`
|
||||
- 工程说明:`jls_mode_router` 只用梯度判断进入 run mode;一旦 `run_length_accum` 非零,就保持在 Annex A.7 run loop 中,后续非匹配样本按 run interruption 编码,而不是重新按 regular 梯度分类。`jls_run_mode` 对 run-length code 和 run-interruption mapped event 分开输出,并维护 RItype 0/1 上下文。`NEAR>0` 中断误差量化采用倒数查表乘法和商校正流水,避免长组合除法。
|
||||
|
||||
### 4.5 Bit Packing And Stuffing
|
||||
|
||||
- 标准章节:Annex C.1-C.4, Annex H.2
|
||||
- RTL 模块:`jls_bit_packer`
|
||||
- 标准变量:entropy-coded segment bitstream, marker byte `0xFF`
|
||||
- RTL 片段 ID:`JLS_BIT_PACKER`
|
||||
- 输入示例:`code_bits[63:56]=8'hFF`, `code_bit_count=8`,随后输入 7 个数据 bit `1111111`
|
||||
- 中间变量示例:第一个 payload byte 为 `0xFF` 后,按 JPEG-LS 规则插入 1 个 zero stuffed bit,再继续装入后续 7 个数据 bit
|
||||
- 输出示例:payload bytes `FF 7F`
|
||||
- 工程说明:必须按 JPEG-LS marker/zero-bit stuffing 规则处理,禁止简化为普通
|
||||
JPEG `0xFF 0x00` byte stuffing。
|
||||
|
||||
### 4.6 Dynamic NEAR Control
|
||||
|
||||
- 标准章节:JPEG-LS `NEAR` 使用见 Annex A/C/D;动态控制参考 CN102088602A。
|
||||
- RTL 模块:`jls_near_ctrl`
|
||||
- 标准变量:`NEAR`
|
||||
- RTL 片段 ID:`JLS_NEAR_CONTROL`
|
||||
- 输入示例:`ratio=2`, `current_near=3`, `actual_bits_cumulative=9000`, `target_bits_cumulative=8192`
|
||||
- 中间变量示例:累计实际 bit 数大于累计目标 bit 数,下一条带 `NEAR` 尝试加 1,并钳位到 `0..31`
|
||||
- 输出示例:`next_near=4`,同时输出当前条带的累计 bit 统计供报告使用
|
||||
- 工程说明:第一版按条带 frame 结束后的累计实际 bit 与累计目标 bit 简单步进调节;
|
||||
后续可按专利方法优化。
|
||||
|
||||
### 4.7 JLS Header Markers
|
||||
|
||||
- 标准章节:Annex C.2.2, Annex C.2.3, Annex C.2.4.1
|
||||
- RTL 模块:`jls_header_writer`
|
||||
- 标准变量:`P`, `Y`, `X`, `Nf`, `Ci`, `NEAR`, `ILV`, `MAXVAL`, `T1`, `T2`, `T3`, `RESET`
|
||||
- RTL 片段 ID:`JLS_HEADER_MARKERS`
|
||||
- 输入示例:`PIX_WIDTH=8`, `strip_width=32`, `strip_height=16`, `NEAR=0`
|
||||
- 输出示例:`SOI/SOF55/LSE/SOS` header 后,条带 payload flush 完成时输出 `EOI`
|
||||
- 工程说明:marker 字段按大端字节输出;`ofifo_wdata[8]` 只映射到首条带 `SOI` 的第一个 `0xFF` 字节。
|
||||
|
||||
### 4.8 JLS Preset Defaults
|
||||
|
||||
- 标准章节:Annex C.2.4.1.1
|
||||
- RTL 模块:`jls_preset_defaults`
|
||||
- 标准变量:`MAXVAL`, `T1`, `T2`, `T3`, `RESET`, `NEAR`
|
||||
- RTL 片段 ID:`JLS_PRESET_DEFAULTS`
|
||||
- 输入示例:`PIX_WIDTH=8`, `NEAR=0`
|
||||
- 输出示例:`MAXVAL=255`, `T1=3`, `T2=7`, `T3=21`, `RESET=64`
|
||||
- 工程说明:本项目只支持 `PIX_WIDTH=8/10/12/14/16` 且 `NEAR<=31`,默认阈值计算退化为浅层 shift-add;若后续扩大范围,需要重新评审 clamp 路径。
|
||||
|
||||
### 4.9 JLS Output Buffer
|
||||
|
||||
- 标准章节:Annex C.1-C.4
|
||||
- RTL 模块:`jls_output_buffer`
|
||||
- 标准变量:JPEG-LS marker stream byte order
|
||||
- RTL 片段 ID:`JLS_OUTPUT_BUFFER`
|
||||
- 输入示例:`original_image_start=1`, `byte_data=8'hFF`
|
||||
- 输出示例:`ofifo_wdata=9'h1FF`
|
||||
- 工程说明:外部 `ofifo_full/ofifo_alfull` 不参与 RTL 流控;若仿真中 `ofifo_full=1` 时仍写出,模块报告错误,用于暴露外部 FIFO 深度或系统级流控问题。
|
||||
|
||||
### 4.9a JLS Coding Parameters
|
||||
|
||||
- 标准章节:Annex A.2, Annex G.2
|
||||
- RTL 模块:`jls_coding_params`
|
||||
- 标准变量:`RANGE`, `qbpp`, `LIMIT`, `NEAR`
|
||||
- RTL 片段 ID:`JLS_CODING_PARAMS`
|
||||
- 输入示例:`PIX_WIDTH=8`, `NEAR=0`
|
||||
- 输出示例:`RANGE=256`, `qbpp=8`, `LIMIT=32`
|
||||
- 工程说明:本项目 `NEAR` 限制为 `0..31`,因此使用查表替代运行时除法;该路径属于条带级控制参数,但仍按高主频设计约束处理。
|
||||
|
||||
### 4.10 JLS Scan Control
|
||||
|
||||
- 标准章节:Annex A.8, Annex D.1-D.3
|
||||
- RTL 模块:`jls_scan_ctrl`
|
||||
- 标准变量:scan start/end control, `NEAR`
|
||||
- RTL 片段 ID:`JLS_SCAN_CONTROL`
|
||||
- 输入示例:`strip_first_pixel=1`, `image_first_pixel=1`, `current_near=7`
|
||||
- 输出示例:`strip_start_valid=1`, `original_image_first_strip=1`, `strip_near=0`
|
||||
- 工程说明:第一幅图像首条带强制使用 `NEAR=0`,防止上一幅图像的动态 `NEAR` 状态影响新图像 header;后续条带使用 `jls_near_ctrl` 输出。
|
||||
|
||||
### 4.11 JLS Bit Packer
|
||||
|
||||
- 标准章节:Annex C.1-C.4, Annex H.2
|
||||
- RTL 模块:`jls_bit_packer`
|
||||
- 标准变量:JPEG-LS entropy-coded bitstream
|
||||
- RTL 片段 ID:`JLS_BIT_PACKER`
|
||||
- 输入示例:`code_bits[63:56]=8'hFF`, `code_bit_count=8`
|
||||
- 输出示例:若后续还有 7 个 `1` 数据 bit,则输出 payload bytes `FF 7F`
|
||||
- 工程说明:`0xFF` 后只插入 1 个 zero bit,不能简化为传统 JPEG byte stuffing;flush 时以 0 补齐当前字节,并保证 `EOI` marker 前不存在未完成 bit。
|
||||
310
docs/jls_verification_plan.md
Normal file
310
docs/jls_verification_plan.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# JPEG-LS FPGA Verification Plan
|
||||
|
||||
This document defines the verification ladder and report schema for the RTL
|
||||
encoder described by `fpga/srs/jpeg_ls.md`.
|
||||
|
||||
## Verification Ladder
|
||||
|
||||
### Smoke
|
||||
|
||||
Purpose: catch structural and integration failures quickly.
|
||||
|
||||
Required checks:
|
||||
- Reference tool path can decode a known-good project RTL output (`.rtljls`) or a generic `.jls`.
|
||||
- Concatenated standalone strip-frame stream can be split.
|
||||
- Split strip frames decode independently with CharLS.
|
||||
- Recombined decoded strips match the reference PGM.
|
||||
- If jpeg.org/libjpeg `jpeg` executable is available, it decodes the same strip
|
||||
frames and matches CharLS.
|
||||
|
||||
Current command:
|
||||
|
||||
```powershell
|
||||
$env:JLS_COMPAT_PYDEPS = (Resolve-Path tools/jls_compat/.deps).Path
|
||||
python tools/jls_compat/make_strip_stream_smoke.py --width 32 --height 32 --strip-rows 16 --bit-depth 8 --name strip_smoke_8b
|
||||
python tools/jls_compat/reference_decode_compare.py tools/jls_compat/out/strip_smoke_8b.jls --split-frames --expected-frames 2 --reference-pgm tools/jls_compat/out/strip_smoke_8b.pgm
|
||||
```
|
||||
|
||||
### RTL Smoke
|
||||
|
||||
Purpose: verify the first RTL integration before full entropy coding.
|
||||
|
||||
Required checks:
|
||||
- `ififo_rd` obeys synchronous FIFO timing.
|
||||
- SOF detection starts exactly one original image.
|
||||
- Runtime dimensions and `ratio` are sampled at SOF.
|
||||
- Invalid dimensions fall back to `6144 x 256`.
|
||||
- Strip boundaries occur every `SCAN_ROWS` rows.
|
||||
- Scan controller emits strip start/finish events and forwards pixels in order.
|
||||
- Neighbor provider emits reconstructed `Ra/Rb/Rc/Rd` for top-row, left-edge,
|
||||
middle-column, and right-edge cases. It covers the immediate `Rx == X`
|
||||
commit path for `NEAR=0` lossless strips, including a no-recon consecutive
|
||||
input case, and retains the true reconstructed writeback wait path for
|
||||
`NEAR>0`. It also covers the `NEAR>0` same-row writeback-to-next-`Ra`
|
||||
bypass and verifies that row transitions do not bypass bank/edge-state
|
||||
updates.
|
||||
- Mode router sends non-run contexts to the regular path, reconstructs matching
|
||||
run pixels as `Ra`, accumulates run length, and emits interruption/EOL run
|
||||
segments. It must remain in the Annex A.7 run loop while `run_length_accum`
|
||||
is non-zero, so a later nonmatching sample is encoded as a run interruption
|
||||
even if its gradients would not independently enter run mode.
|
||||
- MED predictor computes `Px` for the three standard Ra/Rb/Rc comparison
|
||||
cases and stalls cleanly when the downstream stage is not ready.
|
||||
- Context quantizer computes `Q1/Q2/Q3`, absolute context index, context sign,
|
||||
and run-mode flag for zero, positive, negative, and NEAR-zero gradients.
|
||||
- Prediction corrector applies context variable `C[Q]` with context sign and
|
||||
clamps the corrected prediction to `0..MAXVAL`.
|
||||
- Regular error quantizer covers `Errval` normalization, reconstructed `Rx`,
|
||||
and the `NEAR=31` reciprocal-LUT division path. Top-level compatibility
|
||||
smokes must keep passing after regular-mode `Rx` is fed back at quantizer
|
||||
acceptance instead of `mapped_done`.
|
||||
- JPEG-LS default LSE preset parameters match the supported
|
||||
8/10/12/14/16-bit threshold equations and NEAR clamp rule.
|
||||
- Coding parameter lookup returns `RANGE`, `qbpp`, and `LIMIT` for representative
|
||||
supported precisions, `NEAR=31`, and defensive NEAR clamp cases.
|
||||
- Header writer emits exact `SOI/SOF55/LSE/SOS` bytes and trailing `EOI`.
|
||||
- Dynamic NEAR controller updates by cumulative actual-vs-target bits, forces
|
||||
ratio=0/invalid to NEAR=0, and reports the MAX_NEAR miss condition.
|
||||
- Context memory lazily initializes all 365 regular contexts, latches the
|
||||
A-init value, returns defaults for untouched contexts, supports registered
|
||||
readback, and overwrites old state on re-init.
|
||||
- Context model stalls same-context hazards until writeback and bypasses
|
||||
same-cycle write/read values so a later event cannot read stale `A/B/C/N`.
|
||||
- Context update arithmetic computes pre-update `k` and next `A/B/C/N` for
|
||||
positive, negative, RESET-halving, and C-saturation cases.
|
||||
- Error mapper converts positive, negative, and context-inverted `Errval`
|
||||
values to `MErrval` and forwards `k/LIMIT/qbpp`.
|
||||
- Run mode encodes zero-length run interruptions for `RItype=0/1`, emits
|
||||
EOL run chunks from `RUNindex/J`, updates run-interruption contexts, and
|
||||
preserves code-event ordering under its direct run-code interface. It also
|
||||
covers a `NEAR=31` run-interruption case through the reciprocal division
|
||||
pipeline.
|
||||
- Golomb encoder emits the regular and LIMIT-path code events from `MErrval`,
|
||||
`k`, `LIMIT`, and `qbpp` with left-aligned bit order.
|
||||
- Bit packer packs left-aligned variable-length code events, handles JPEG-LS
|
||||
0-bit stuffing after `0xFF`, and flushes partial bytes before EOI.
|
||||
- Byte arbiter gives header/EOI bytes priority over payload bytes and preserves
|
||||
`original_image_start` sideband only for the header stream.
|
||||
- Output buffer preserves byte order and places `original_image_start` on
|
||||
`ofifo_wdata[8]` for the corresponding byte event.
|
||||
- Top-level idle smoke elaborates the integrated RTL and verifies that empty
|
||||
input produces no FIFO read/write activity.
|
||||
- Top-level all-zero run-mode smoke consumes a small 8-bit image, emits one
|
||||
complete `SOI...EOI` strip frame, and checks that `ofifo_wdata[8]` appears
|
||||
exactly once.
|
||||
- `ofifo_wdata[8]` is high only on the first byte of the first strip frame.
|
||||
- Output byte stream preserves strip order.
|
||||
|
||||
Current standalone RTL smoke commands:
|
||||
|
||||
```powershell
|
||||
fpga/sim/run_jls_smoke.ps1
|
||||
```
|
||||
|
||||
Current top-level compatibility smoke command:
|
||||
|
||||
```powershell
|
||||
fpga/sim/run_jls_top_compat_smoke.ps1
|
||||
```
|
||||
|
||||
Current staged throughput command:
|
||||
|
||||
```powershell
|
||||
fpga/sim/run_jls_throughput_regression.ps1
|
||||
```
|
||||
|
||||
Notes:
|
||||
- The throughput script uses `tb_jpeg_ls_encoder_top_run_smoke` with
|
||||
`+CHECK_THROUGHPUT=1` and writes `tools/jls_compat/out/rtl_throughput_stats.csv`.
|
||||
- Its default is staged and narrow: `PIX_WIDTH=8`, `ratio=1/2/3`,
|
||||
`6144 x 256`, `IMAGE_COUNT=10`, and `PATTERN=9`.
|
||||
- `PATTERN=9` rotates ten deterministic representative images across the
|
||||
10-image stream. It covers smooth, gradient, checker, edge, low-gradient,
|
||||
stripe, texture, and pseudo-noise style inputs.
|
||||
- Full regression should pass `-BitsList 8,10,12,14,16` and then run the
|
||||
reference decoder comparison flow on the generated streams.
|
||||
- Smoke/compatibility scripts scan simulator output for `** Fatal` and non-zero
|
||||
`Errors:` counts instead of relying only on process exit codes, because
|
||||
QuestaSim can finish with exit code 0 after a testbench `$fatal` when the
|
||||
command script ends with `quit`.
|
||||
|
||||
Current top-level compatibility status:
|
||||
- `tb_jpeg_ls_encoder_top_run_smoke` writes
|
||||
16x16 zero and row-major ramp outputs for `PIX_WIDTH=8/10/12/14/16`.
|
||||
- `run_jls_top_compat_smoke.ps1` generates the matching reference PGMs and
|
||||
verifies all smoke outputs with CharLS through `reference_decode_compare.py`.
|
||||
- The same script also runs an 8-bit 16x32 ramp image as two 16-row strip
|
||||
frames, splits the concatenated `SOI...EOI` stream, decodes both frames with
|
||||
CharLS, and compares the stitched image with the reference PGM.
|
||||
- The script runs a small 8-bit two-image all-zero stream with two SOF sideband
|
||||
events, splits the two standalone JPEG-LS frames, and checks the stitched
|
||||
decoded result against a 16x32 zero reference PGM. This is a smoke precursor
|
||||
for the later 10-image throughput regression.
|
||||
- The compatibility script also runs an 8-bit 16x32 `ratio=2` dynamic-NEAR ramp
|
||||
case, splits and decodes both strip frames with CharLS, and compares against
|
||||
the reference PGM with a bounded absolute-difference tolerance.
|
||||
- The staged throughput script has been bring-up tested on a small
|
||||
8-bit 16x16x10 `PATTERN=9` stream for `ratio=1/2/3` with the hard throughput
|
||||
assertion disabled. This verifies the script, CSV report, and mixed-pattern
|
||||
run/regular control path; it is not a 200 MPixel/s result.
|
||||
- This is still a narrow compatibility smoke. It covers all-zero run-heavy
|
||||
behavior, mixed regular/run ramp cases, one lossless two-strip ramp case, and
|
||||
one near-lossless dynamic-NEAR two-strip ramp case, but it does not replace
|
||||
the later larger-image and throughput regressions.
|
||||
|
||||
Equivalent manual commands:
|
||||
|
||||
```powershell
|
||||
vlog -sv fpga/verilog/jls_preset_defaults.sv fpga/sim/tb_jls_preset_defaults.sv
|
||||
vsim -c tb_jls_preset_defaults -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_coding_params.sv fpga/sim/tb_jls_coding_params.sv
|
||||
vsim -c tb_jls_coding_params -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_common_pkg.sv fpga/verilog/jls_input_ctrl.sv fpga/sim/tb_jls_input_ctrl.sv
|
||||
vsim -c tb_jls_input_ctrl -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_scan_ctrl.sv fpga/sim/tb_jls_scan_ctrl.sv
|
||||
vsim -c tb_jls_scan_ctrl -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_neighbor_provider.sv fpga/sim/tb_jls_neighbor_provider.sv
|
||||
vsim -c tb_jls_neighbor_provider -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_mode_router.sv fpga/sim/tb_jls_mode_router.sv
|
||||
vsim -c tb_jls_mode_router -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_predictor.sv fpga/sim/tb_jls_predictor.sv
|
||||
vsim -c tb_jls_predictor -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_context_quantizer.sv fpga/sim/tb_jls_context_quantizer.sv
|
||||
vsim -c tb_jls_context_quantizer -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_prediction_corrector.sv fpga/sim/tb_jls_prediction_corrector.sv
|
||||
vsim -c tb_jls_prediction_corrector -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_common_pkg.sv fpga/verilog/jls_header_writer.sv fpga/sim/tb_jls_header_writer.sv
|
||||
vsim -c tb_jls_header_writer -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_near_ctrl.sv fpga/sim/tb_jls_near_ctrl.sv
|
||||
vsim -c tb_jls_near_ctrl -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_context_memory.sv fpga/sim/tb_jls_context_memory.sv
|
||||
vsim -c tb_jls_context_memory -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_context_update.sv fpga/sim/tb_jls_context_update.sv
|
||||
vsim -c tb_jls_context_update -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_error_mapper.sv fpga/sim/tb_jls_error_mapper.sv
|
||||
vsim -c tb_jls_error_mapper -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_run_mode.sv fpga/sim/tb_jls_run_mode.sv
|
||||
vsim -c tb_jls_run_mode -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_golomb_encoder.sv fpga/sim/tb_jls_golomb_encoder.sv
|
||||
vsim -c tb_jls_golomb_encoder -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_bit_packer.sv fpga/sim/tb_jls_bit_packer.sv
|
||||
vsim -c tb_jls_bit_packer -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_byte_arbiter.sv fpga/sim/tb_jls_byte_arbiter.sv
|
||||
vsim -c tb_jls_byte_arbiter -do "run -all; quit"
|
||||
vlog -sv fpga/verilog/jls_output_buffer.sv fpga/sim/tb_jls_output_buffer.sv
|
||||
vsim -c tb_jls_output_buffer -do "run -all; quit"
|
||||
vlog -sv -f fpga/verilog/jpeg_ls_rtl.f fpga/sim/tb_jpeg_ls_encoder_top_idle.sv
|
||||
vsim -c tb_jpeg_ls_encoder_top_idle -do "run -all; quit"
|
||||
vlog -sv -f fpga/verilog/jpeg_ls_rtl.f fpga/sim/tb_jpeg_ls_encoder_top_run_smoke.sv
|
||||
vsim -c tb_jpeg_ls_encoder_top_run_smoke -do "run -all; quit"
|
||||
```
|
||||
|
||||
### Small Regression
|
||||
|
||||
Purpose: verify basic JPEG-LS algorithm correctness.
|
||||
|
||||
Image matrix:
|
||||
|
||||
| Bit depth | Ratio | Pattern | Size |
|
||||
| ---: | ---: | --- | --- |
|
||||
| 8 | 0 | gradient | 16 x 16 |
|
||||
| 8 | 1 | checker | 32 x 32 |
|
||||
| 10 | 1 | gradient | 32 x 32 |
|
||||
| 12 | 2 | edge | 32 x 32 |
|
||||
| 14 | 2 | ramp | 32 x 32 |
|
||||
| 16 | 0 | gradient | 16 x 16 |
|
||||
| 16 | 3 | checker | 32 x 32 |
|
||||
|
||||
Pass/fail:
|
||||
- `ratio=0`: decoded pixels exactly match input.
|
||||
- `ratio=1/2/3`: per-pixel error is less than or equal to that strip frame's
|
||||
actual NEAR.
|
||||
- All output strip frames decode with CharLS.
|
||||
- If libjpeg executable is present, all output strip frames decode with libjpeg.
|
||||
|
||||
### Full Regression
|
||||
|
||||
Purpose: enforce the hard SRS requirements.
|
||||
|
||||
Required cases:
|
||||
- All supported `PIX_WIDTH`: 8, 10, 12, 14, 16.
|
||||
- `ratio=0/1/2/3`.
|
||||
- Default image size `6144 x 256`.
|
||||
- Maximum row width `6144`.
|
||||
- Maximum row count boundary `4096`.
|
||||
- Minimum legal size `16 x 16`.
|
||||
- At least 10 representative images covering smooth, gradient, noise, edge, and
|
||||
texture scenes.
|
||||
- Continuous 10-image throughput test for `ratio=1/2/3`.
|
||||
- The staged throughput script is the executable entry point for this test, but
|
||||
the full pass criterion also requires CharLS and jpeg.org/libjpeg reference
|
||||
decode after the long simulations are considered mature enough to run.
|
||||
|
||||
Pass/fail:
|
||||
- Complete `.rtljls`/`.jls` stream is split into the expected number of strip frames.
|
||||
- Every strip frame decodes with CharLS.
|
||||
- Every strip frame decodes with jpeg.org/libjpeg; if the executable is missing
|
||||
in full regression, the run is FAIL.
|
||||
- CharLS and libjpeg decoded pixels match each other.
|
||||
- Decoded pixels meet the lossless or near-lossless error rule.
|
||||
- Compression-ratio error is within the SRS threshold or reported FAIL when
|
||||
`NEAR=31` cannot satisfy the target.
|
||||
- Average input throughput is at least 200 MPixel/s for `ratio=1/2/3`, excluding
|
||||
upstream `ififo_empty` waits and including internal stalls.
|
||||
|
||||
## Report Schema
|
||||
|
||||
The regression report should be JSON or CSV with equivalent fields.
|
||||
|
||||
Image-level fields:
|
||||
|
||||
| Field | Description |
|
||||
| --- | --- |
|
||||
| `case_id` | Stable test case name. |
|
||||
| `pix_width` | RTL `PIX_WIDTH`. |
|
||||
| `ratio` | Runtime ratio port value. |
|
||||
| `cfg_pic_col` | Runtime configured width. |
|
||||
| `cfg_pic_row` | Runtime configured height. |
|
||||
| `active_pic_col` | Actual effective width after fallback. |
|
||||
| `active_pic_row` | Actual effective height after fallback. |
|
||||
| `strip_rows` | `SCAN_ROWS`. |
|
||||
| `strip_count` | Number of standalone JPEG-LS strip frames. |
|
||||
| `output_bytes` | Total generated byte count across all strips. |
|
||||
| `raw_bits` | `active_pic_col * active_pic_row * PIX_WIDTH`. |
|
||||
| `actual_bits` | `output_bytes * 8`. |
|
||||
| `target_bits` | Target bit count from `ratio`. |
|
||||
| `compression_ratio` | `raw_bits / actual_bits`. |
|
||||
| `max_error` | Whole-image maximum absolute reconstruction error. |
|
||||
| `charls_status` | `PASS`, `FAIL`, or `SKIP`. |
|
||||
| `libjpeg_status` | `PASS`, `FAIL`, or `SKIP`. |
|
||||
| `compat_status` | `PASS` only if all required reference decoders agree. |
|
||||
| `throughput_mpix_s` | Average input throughput for the case. |
|
||||
| `pause_cycles_total` | Total internal pause cycles included in throughput. |
|
||||
| `ififo_empty_wait_cycles` | Upstream-empty cycles excluded from throughput. |
|
||||
| `outbuf_max_watermark` | Maximum internal output-buffer occupancy. |
|
||||
| `result` | Overall `PASS` or `FAIL`. |
|
||||
|
||||
Strip-level fields:
|
||||
|
||||
| Field | Description |
|
||||
| --- | --- |
|
||||
| `case_id` | Parent test case name. |
|
||||
| `strip_index` | Zero-based strip index. |
|
||||
| `strip_y0` | First original-image row in the strip. |
|
||||
| `strip_height` | Strip frame height. |
|
||||
| `near` | NEAR used in this strip frame's `SOS`. |
|
||||
| `output_bytes` | Byte count for this standalone JPEG-LS frame. |
|
||||
| `actual_bits_cumulative` | Cumulative actual bit count after this strip. |
|
||||
| `target_bits_cumulative` | Cumulative target bit count after this strip. |
|
||||
| `max_error` | Maximum reconstruction error in this strip. |
|
||||
| `pause_cycles` | Internal pause cycles attributed to this strip. |
|
||||
| `outbuf_max_watermark` | Maximum buffer occupancy during this strip. |
|
||||
|
||||
## Compatibility Notes
|
||||
|
||||
- `tools/jls_compat/reference_decode_compare.py --split-frames` is the reference
|
||||
tool for concatenated strip-frame streams.
|
||||
- Smoke runs may skip libjpeg when no `jpeg` executable is available.
|
||||
- Full regression must use `--require-libjpeg`.
|
||||
- If CharLS and libjpeg disagree, mark `compat_status=FAIL` and preserve the
|
||||
`.rtljls`/`.jls`, decoded images, command log, and tool versions.
|
||||
357
docs/jls_work_plan.md
Normal file
357
docs/jls_work_plan.md
Normal file
@@ -0,0 +1,357 @@
|
||||
# JPEG-LS FPGA Long-Term Work Plan
|
||||
|
||||
This plan is the execution companion to `fpga/srs/jpeg_ls.md`. The SRS is the
|
||||
source of requirements. This file records the planned execution order, current
|
||||
status, and the points where user confirmation is required.
|
||||
|
||||
## Working Rule
|
||||
|
||||
- Raise user-confirmation questions as early as possible.
|
||||
- Do not stop when a reasonable engineering assumption is safe and reversible.
|
||||
- Stop for user confirmation only when the decision changes an external
|
||||
interface, JPEG-LS stream structure, hard performance target, verification
|
||||
pass/fail rule, licensing/dependency boundary, or resource parameter that the
|
||||
SRS says requires review.
|
||||
- Every new or changed user requirement must be added to `fpga/srs/jpeg_ls.md`.
|
||||
- `fpga/srs/jpeg_ls_design.drawio` is maintained only when the user explicitly
|
||||
asks for draw.io updates.
|
||||
|
||||
## Current Confirmed Direction
|
||||
|
||||
- First RTL path: one original image is split into horizontal strip frames.
|
||||
- Each strip frame is a complete standalone grayscale JPEG-LS frame:
|
||||
`SOI ... EOI`.
|
||||
- `ofifo_wdata[8]` is asserted only on the first byte of the first strip frame
|
||||
of the original input image.
|
||||
- Primary smoke decoder: CharLS via `tools/jls_compat/reference_decode_compare.py`.
|
||||
- Additional reference decoder: jpeg.org/libjpeg command line tool when
|
||||
available.
|
||||
|
||||
## Phase 1: Requirements And Compatibility
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Deliverables:
|
||||
- `fpga/srs/jpeg_ls.md` requirement baseline.
|
||||
- `tools/jls_compat/duplicate_sos_probe.py`.
|
||||
- `tools/jls_compat/reference_decode_compare.py`.
|
||||
- `third_party/charls`.
|
||||
- `third_party/libjpeg`.
|
||||
|
||||
Remaining work:
|
||||
- Add a concatenated strip-frame smoke test once a simple local encoder or RTL
|
||||
bitstream generator exists.
|
||||
- Build or provide jpeg.org/libjpeg `jpeg` executable for full reference
|
||||
comparison.
|
||||
|
||||
## Phase 2: Architecture And Interfaces
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Deliverables:
|
||||
- Module interface specification.
|
||||
- Mermaid algorithm/pipeline flow: `docs/jls_pipeline_mermaid.md`.
|
||||
- Top-level port list frozen against the SRS.
|
||||
- Internal valid/stall contract for the high-throughput pipeline.
|
||||
- Header/output-buffer contract for strip-frame sequencing.
|
||||
- Context table and line-buffer access contract.
|
||||
|
||||
Planned module order:
|
||||
- `jpeg_ls_encoder_top`
|
||||
- `jls_input_ctrl`
|
||||
- `jls_preset_defaults`
|
||||
- `jls_scan_ctrl`
|
||||
- `jls_header_writer`
|
||||
- `jls_near_ctrl`
|
||||
- `jls_predictor`
|
||||
- `jls_context_model`
|
||||
- `jls_neighbor_provider`
|
||||
- `jls_golomb_encoder`
|
||||
- `jls_bit_packer`
|
||||
- `jls_byte_arbiter`
|
||||
- `jls_output_buffer`
|
||||
- `jls_run_mode`
|
||||
|
||||
Stop-for-confirmation triggers:
|
||||
- Changing any external port.
|
||||
- Changing `OUT_BUF_BYTES` or `OUT_BUF_AFULL_MARGIN`.
|
||||
- Changing strip-frame output semantics.
|
||||
- Dropping a supported `PIX_WIDTH`.
|
||||
- Reducing the `ratio=1/2/3` 200 MPixel/s performance requirement.
|
||||
|
||||
## Phase 3: Verification Scaffold
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Deliverables:
|
||||
- Raw image generator for 8/10/12/14/16-bit grayscale tests.
|
||||
- Reference decode and compare runner.
|
||||
- Strip-frame splitter/combiner for validation reports.
|
||||
- Report schema for per-strip and whole-image metrics.
|
||||
- QuestaSim smoke test scripts and smoke testbenches.
|
||||
|
||||
Planned smoke set:
|
||||
- Tiny 8-bit lossless image.
|
||||
- Tiny 16-bit lossless image.
|
||||
- 16x16 near-lossless image.
|
||||
- Two-strip image to validate `ofifo_wdata[8]` and strip ordering.
|
||||
|
||||
Current progress:
|
||||
- `tools/jls_compat/make_strip_stream_smoke.py` generates concatenated
|
||||
standalone strip-frame streams for tool smoke tests.
|
||||
- `tools/jls_compat/reference_decode_compare.py` splits concatenated streams,
|
||||
decodes each strip with CharLS, optionally decodes with jpeg.org/libjpeg, and
|
||||
compares the vertically recombined image to a reference PGM.
|
||||
|
||||
## Phase 4: RTL Implementation
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Implementation order:
|
||||
- Package/constants and shared type definitions.
|
||||
- `jls_input_ctrl`.
|
||||
- `jls_header_writer`.
|
||||
- `jls_output_buffer`.
|
||||
- Minimal top-level path for header-only and strip-frame sequencing tests.
|
||||
- Predictor/context/run-mode/Golomb/bit-packer pipeline.
|
||||
- Dynamic `NEAR` update.
|
||||
|
||||
Coding rules:
|
||||
- Use SystemVerilog.
|
||||
- Use `always_ff` with nonblocking assignments only.
|
||||
- Use `always_comb` with blocking assignments.
|
||||
- Do not define variables inside procedural blocks.
|
||||
- Do not use `task`.
|
||||
- Avoid complex functions.
|
||||
- Split complex decisions across pipeline cycles.
|
||||
- Use meaningful English comments and standard traceability comments.
|
||||
- Keep RTL design files free of `ifdef/ifndef SYNTHESIS`,
|
||||
`translate_off/on`, and design-embedded pass/fail checks; verification-only
|
||||
checks belong in testbenches, monitors, scoreboards, or scripts.
|
||||
|
||||
Current progress:
|
||||
- `fpga/verilog/jls_common_pkg.sv` defines shared constants and simple enums.
|
||||
- `fpga/verilog/jpeg_ls_encoder_top.sv` now instantiates the input, scan,
|
||||
header, regular/run-mode entropy, bit-packer, byte-arbiter, output-buffer,
|
||||
and dynamic-NEAR modules as a functional top-level smoke integration.
|
||||
- `fpga/verilog/jls_preset_defaults.sv` computes JPEG-LS default LSE preset
|
||||
coding parameters for the supported grayscale bit depths and NEAR range.
|
||||
- `fpga/verilog/jls_coding_params.sv` looks up strip-level `RANGE`, `qbpp`, and
|
||||
`LIMIT` for supported `PIX_WIDTH` and `NEAR=0..31`, avoiding runtime division.
|
||||
- `fpga/verilog/jls_input_ctrl.sv` implements FIFO read alignment, SOF gating,
|
||||
runtime dimension fallback, coordinate generation, and strip/image boundary
|
||||
flags.
|
||||
- `fpga/verilog/jls_scan_ctrl.sv` converts input pixel boundary flags into
|
||||
strip start/finish commands and forwards pixels to the encode pipeline.
|
||||
- `fpga/verilog/jls_neighbor_provider.sv` provides reconstructed
|
||||
`Ra/Rb/Rc/Rd` samples using two line banks. For `NEAR=0`, it commits the
|
||||
original sample immediately because lossless `Rx == X`, removing the
|
||||
reconstruction feedback bubble. For `NEAR>0`, it keeps one pixel
|
||||
outstanding until the true reconstructed sample returns, but accepts the next
|
||||
same-row pixel on the same clock as a non-EOL writeback by bypassing that
|
||||
returned `Rx` as the next pixel's `Ra`.
|
||||
- `fpga/verilog/jls_mode_router.sv` performs the first regular/run decision,
|
||||
forwards regular pixels, accumulates run pixels, reconstructs run pixels as
|
||||
`Ra`, and emits complete run segments for `jls_run_mode`.
|
||||
- `fpga/verilog/jls_predictor.sv` implements a registered MED prediction stage
|
||||
from reconstructed neighbor inputs `Ra/Rb/Rc/Rd`; the separate
|
||||
`jls_neighbor_provider` supplies those neighbors from reconstructed history.
|
||||
- `fpga/verilog/jls_context_quantizer.sv` computes `D1/D2/D3`, quantizes them
|
||||
to `Q1/Q2/Q3`, and emits the absolute context index, sign, and run-mode flag.
|
||||
- `fpga/verilog/jls_prediction_corrector.sv` applies context variable `C[Q]`
|
||||
with context sign to `Px` and clamps the corrected prediction to `0..MAXVAL`.
|
||||
- `fpga/verilog/jls_context_memory.sv` stores the 365 regular-mode contexts,
|
||||
uses lazy strip initialization by clearing a written-bit vector and returning
|
||||
default `A/B/C/N` for untouched contexts, and provides registered read/write
|
||||
ports without a 365-cycle boundary sweep.
|
||||
- `fpga/verilog/jls_context_model.sv` wraps `jls_context_memory` and forwards
|
||||
quantized context events with the read `A/B/C/N` variables. It now tracks
|
||||
in-flight regular contexts, stalls same-context reads until writeback, and
|
||||
bypasses a same-cycle write/read pair so the next event cannot read stale
|
||||
Annex A.6 state.
|
||||
- `fpga/verilog/jls_context_update.sv` computes one regular context's
|
||||
pre-update `k`, next `A/B/C/N`, mapped-error inversion flag, context-index
|
||||
writeback metadata, and strip-last metadata.
|
||||
- `fpga/verilog/jls_regular_error_quantizer.sv` computes regular-mode
|
||||
`Errval`, reconstructed sample `Rx`, modulo error normalization, and forwards
|
||||
pre-update context variables. `NEAR>0` uses an exact reciprocal-LUT multiply
|
||||
and quotient-correction pipeline instead of a one-bit-per-cycle divider.
|
||||
- `fpga/verilog/jpeg_ls_encoder_top.sv` returns regular-mode `Rx` to
|
||||
`jls_neighbor_provider` as soon as the regular error-quantizer result is
|
||||
accepted. Context update, mapped-error generation, Golomb coding, and bit
|
||||
packing remain ordered by their own handshakes, but line history no longer
|
||||
waits for `mapped_done` on the regular path.
|
||||
- `fpga/verilog/jls_header_writer.sv` emits standalone strip-frame
|
||||
`SOI/SOF55/LSE/SOS` headers and trailing `EOI` markers.
|
||||
- `fpga/verilog/jls_near_ctrl.sv` applies the first-version cumulative
|
||||
actual-vs-target dynamic NEAR step and reports the MAX_NEAR miss condition.
|
||||
- `fpga/verilog/jls_error_mapper.sv` maps signed `Errval` into non-negative
|
||||
`MErrval` with the standard context-correction inversion and forwards
|
||||
`k/LIMIT/qbpp` to the Golomb encoder.
|
||||
- `fpga/verilog/jls_run_mode.sv` implements a run-segment entropy helper:
|
||||
direct run-length code events, standard `RUNindex/J` updates, RItype 0/1
|
||||
run-interruption context variables, `MErrval/k/limit` generation, and
|
||||
reconstructed interruption sample output. The run-interruption `NEAR>0`
|
||||
quantizer uses the same reciprocal-LUT multiply and quotient-correction
|
||||
pipeline as the regular path.
|
||||
- `fpga/verilog/jls_golomb_encoder.sv` generates left-aligned Golomb code
|
||||
events from standard `MErrval`, `k`, `LIMIT`, and `qbpp` inputs, including
|
||||
the LIMIT fallback path.
|
||||
- `fpga/verilog/jls_bit_packer.sv` packs left-aligned variable-length code
|
||||
events into JPEG-LS scan payload bytes, including 0-bit stuffing after
|
||||
`0xFF` data bytes and zero-padded flush before EOI.
|
||||
- `fpga/verilog/jls_byte_arbiter.sv` arbitrates header/EOI bytes ahead of
|
||||
payload bytes before the internal output buffer.
|
||||
- `fpga/verilog/jls_output_buffer.sv` buffers encoded byte events and drains
|
||||
them to the fixed 9-bit output FIFO interface while ignoring external full
|
||||
flags in RTL behavior.
|
||||
- `fpga/verilog/jpeg_ls_rtl.f` lists the current RTL compilation order.
|
||||
- RTL design files have had all previous `ifndef SYNTHESIS` diagnostic blocks
|
||||
removed so normal simulation and `+define+SYNTHESIS` compilation use the same
|
||||
design logic.
|
||||
- `fpga/sim/tb_jls_preset_defaults.sv`, `fpga/sim/tb_jls_coding_params.sv`,
|
||||
`fpga/sim/tb_jls_input_ctrl.sv`,
|
||||
`fpga/sim/tb_jls_scan_ctrl.sv`, `fpga/sim/tb_jls_neighbor_provider.sv`,
|
||||
`fpga/sim/tb_jls_neighbor_provider_near_bypass.sv`,
|
||||
`fpga/sim/tb_jls_mode_router.sv`, `fpga/sim/tb_jls_header_writer.sv`,
|
||||
`fpga/sim/tb_jls_predictor.sv`, `fpga/sim/tb_jls_context_quantizer.sv`,
|
||||
`fpga/sim/tb_jls_prediction_corrector.sv`, `fpga/sim/tb_jls_near_ctrl.sv`,
|
||||
`fpga/sim/tb_jls_context_memory.sv`, `fpga/sim/tb_jls_context_update.sv`,
|
||||
`fpga/sim/tb_jls_error_mapper.sv`, `fpga/sim/tb_jls_run_mode.sv`,
|
||||
`fpga/sim/tb_jls_golomb_encoder.sv`, `fpga/sim/tb_jls_bit_packer.sv`,
|
||||
`fpga/sim/tb_jls_byte_arbiter.sv`, `fpga/sim/tb_jls_output_buffer.sv`, and
|
||||
`fpga/sim/tb_jpeg_ls_encoder_top_idle.sv`,
|
||||
`fpga/sim/tb_jpeg_ls_encoder_top_run_smoke.sv` cover the current standalone,
|
||||
idle-integration, tiny run-mode top-level, and small back-to-back multi-image
|
||||
smoke checks.
|
||||
- `fpga/sim/run_jls_smoke.ps1` runs the current RTL smoke compile/sim sequence.
|
||||
- `fpga/sim/run_jls_top_compat_smoke.ps1` compiles the current top-level RTL,
|
||||
runs 16x16 all-zero and row-major ramp image smokes for
|
||||
`PIX_WIDTH=8/10/12/14/16`, also runs lossless and `ratio=2` dynamic-NEAR
|
||||
8-bit 16x32 two-strip ramp smokes plus a two-image zero stream smoke, and
|
||||
checks all results with CharLS against generated reference PGMs.
|
||||
- `fpga/sim/run_jls_throughput_regression.ps1` is the staged executable entry
|
||||
for the SRS continuous 10-image throughput check. It drives
|
||||
`6144 x 256`, `IMAGE_COUNT=10`, `ratio=1/2/3`, `+CHECK_THROUGHPUT=1`, and
|
||||
appends CSV stats to `tools/jls_compat/out/rtl_throughput_stats.csv`.
|
||||
- The RTL smoke scripts now treat simulator `$fatal`/non-zero error summaries
|
||||
as failures even when the simulator process exits with code 0.
|
||||
|
||||
Known implementation gaps:
|
||||
- Top-level run-mode now has a conservative functional path through
|
||||
`jls_mode_router` and `jls_run_mode`. It is covered by all-zero and ramp
|
||||
tiny-image smokes across all supported bit depths plus a two-strip 8-bit ramp
|
||||
smoke, a small dynamic-NEAR near-lossless smoke, and CharLS reference decode
|
||||
for those cases, but larger images and high-throughput ordering optimization
|
||||
remain open.
|
||||
- The run scanner can now overlap non-EOL matching run pixels with an
|
||||
outstanding run segment, while still blocking regular/interruption/EOL entropy
|
||||
emission until the previous segment completes. The 8-bit all-zero top smoke
|
||||
improved from 2670 ns to 2266 ns after this change.
|
||||
- The run scanner also remains in the Annex A.7 run loop while
|
||||
`run_length_accum` is non-zero. A checkerboard top-level smoke exposed that
|
||||
the previous implementation could reclassify the next nonmatching pixel by
|
||||
gradients and stall forever behind a pending run segment; this is now covered
|
||||
in `tb_jls_mode_router`.
|
||||
- The top-level Golomb busy tracker now handles the same-cycle case where a
|
||||
previous mapped event completes while a new run mapped event is accepted, so
|
||||
the new run segment cannot lose its busy ownership bit.
|
||||
- The lossless `NEAR=0` neighbor feedback bubble has been reduced by immediate
|
||||
`Rx == X` commit, and the regular/run `NEAR>0` arithmetic divider bottleneck
|
||||
has been reduced to a short pipelined reciprocal-LUT multiply/correction
|
||||
path. Regular-mode `NEAR>0` reconstructed-sample feedback now returns after
|
||||
error-quantizer acceptance instead of waiting for downstream Golomb coding.
|
||||
Same-row `NEAR>0` line-history feedback can also accept the next pixel on
|
||||
the same clock as a non-EOL reconstructed writeback by bypassing `Rx` to
|
||||
`Ra`. The 8-bit 16x32 two-strip `ratio=2` dynamic-NEAR smoke improved from
|
||||
22718 ns before these feedback changes to 15582 ns.
|
||||
The fixed 365-cycle context table clear at strip start has also been removed
|
||||
by lazy written-bit initialization. Remaining 200 MPixel/s risks are the
|
||||
`NEAR>0` one-pixel line-history feedback dependency, run-segment entropy
|
||||
ordering stalls, and missing full near-lossless/larger-image throughput
|
||||
regression.
|
||||
- Full 10-image default-size throughput and full CharLS/libjpeg decode
|
||||
regressions remain to be run; the throughput script is now present but has
|
||||
not been executed in this iteration because it is intentionally long.
|
||||
- A small 8-bit 16x16x10 mixed-pattern staged throughput bring-up passed for
|
||||
`ratio=1/2/3` with the hard throughput assertion disabled. The report showed
|
||||
2560 input pixels, 10297 input cycles, and `throughput_mpix_x1000=62154` for
|
||||
each ratio; because the test is single-strip per image and intentionally
|
||||
tiny, it is only a script/control-path check, not a performance conclusion.
|
||||
|
||||
## Phase 5: Integration And Regression
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Deliverables:
|
||||
- End-to-end RTL simulation producing `.rtljls` output.
|
||||
- CharLS decode and pixel compare.
|
||||
- libjpeg decode and pixel compare when executable is available.
|
||||
- Per-strip `NEAR`, bit count, max-error, compression-ratio report.
|
||||
- Throughput and stall report.
|
||||
|
||||
Pass/fail focus:
|
||||
- Standard decodability.
|
||||
- Lossless exactness for `ratio=0`.
|
||||
- Near-lossless error bound for `ratio=1/2/3`.
|
||||
- 200 MPixel/s average input throughput for the required multi-frame test.
|
||||
|
||||
## Phase 6: Synthesis And Timing
|
||||
|
||||
Status: in progress.
|
||||
|
||||
Deliverables:
|
||||
- Vivado project/scripts under `fpga/synthesis`.
|
||||
- Synthesis report.
|
||||
- Fmax report for 250 MHz target.
|
||||
- Resource report and timing-risk notes.
|
||||
|
||||
Current target:
|
||||
- FPGA part: `xc7vx690tffg1761-2`.
|
||||
- Quick synthesis scripts are kept under `fpga/synthesis`, but synthesis is not
|
||||
run as an automatic step until all RTL modules are fully implemented, unless
|
||||
the user explicitly requests it.
|
||||
- `fpga/synthesis/quick_synth.tcl` reads the complete RTL compilation list from
|
||||
`fpga/verilog/jpeg_ls_rtl.f`, so future top-level quick-synthesis reports are
|
||||
not based on an early partial RTL subset.
|
||||
- DSP timing work split the NEAR-dependent multiplier paths with registered
|
||||
operands and product stages. `jls_run_mode` registers the run-interruption
|
||||
reconstruction multiplier operands, and `jls_context_update` now stages the
|
||||
Annex A.6 `B[Q] += Errval*(2*NEAR+1)` multiplier as input operands, product,
|
||||
and accumulation stages.
|
||||
- `jls_scan_ctrl` now has a one-entry registered slot between the input
|
||||
controller and downstream strip/encode pipeline. It breaks the direct
|
||||
`pixel_valid` to strip-start/context CE control path while still allowing the
|
||||
slot to drain and refill in the same cycle for steady one-pixel-per-cycle
|
||||
operation.
|
||||
- Timing-violation triage now exports every negative-slack path with
|
||||
`fpga/synthesis/report_timing_violations.tcl`. The flow first classifies
|
||||
paths by DSP48 usage and logic level, then optimizes non-DSP paths with logic
|
||||
depth greater than one before rechecking whether DSP paths still dominate.
|
||||
- Kept non-DSP timing fixes: `jls_context_model` decouples `result_next_*`
|
||||
write enables from downstream ready; `jls_scan_ctrl` registers
|
||||
`enc_row_last_pixel` so `jls_neighbor_provider` does not recompute row-last
|
||||
from `strip_width` on the `Rd` RAM-read path; `jls_regular_error_quantizer`
|
||||
accepts the next input in `STATE_IDLE` and waits for output space only at
|
||||
`STATE_FINISH`.
|
||||
- Latest quick synthesis result: target part `xc7vx690tffg1761-2`,
|
||||
4.000 ns constraint, WNS -0.615 ns, TNS -175.754 ns,
|
||||
537 failing endpoints. Resource use is 22895 LUT, 6308 registers,
|
||||
3.5 BRAM tiles, and 14 DSPs. The rough OOC synthesis frequency estimate is
|
||||
about 216.7 MHz and must not be treated as final implementation Fmax.
|
||||
- The current worst path is still DSP-related: `context_update_i/s1_near_scale_reg[6]`
|
||||
to `context_update_i/s2_B_delta_reg/PCIN[*]`, with 3.557 ns data path delay.
|
||||
The latest violation export has 260 DSP paths and 277 non-DSP logic-level>1
|
||||
paths; the worst non-DSP path is now the `context_model_i/context_busy_reg[*]`
|
||||
to `predictor_i/*/CE` ready/hazard chain at -0.274 ns.
|
||||
Manual high/low half-word partial-product splitting of this 33x8 multiply
|
||||
worsened WNS to -1.468 ns and was reverted.
|
||||
- Naive extra buffering on the top-level context/prediction boundary and broad
|
||||
`max_fanout` attributes worsened WNS in quick synthesis and were reverted.
|
||||
Future control-path optimizations should be validated by quick synthesis
|
||||
before being kept.
|
||||
|
||||
Stop-for-confirmation triggers:
|
||||
- Requirement to increase output buffer default sizes.
|
||||
- Requirement to reduce throughput target.
|
||||
- Any architectural change that affects the external integration contract.
|
||||
Reference in New Issue
Block a user