Initial JPEG-LS FPGA encoder baseline with tooling and timeout fix

This commit is contained in:
2026-04-16 18:55:08 +08:00
commit e4fdbdfeec
150 changed files with 25796 additions and 0 deletions

View File

@@ -0,0 +1,435 @@
# JPEG-LS RTL Module Interface Draft
This document freezes the first-pass RTL interface plan before implementation.
The requirement source is `fpga/srs/jpeg_ls.md`; this file is an execution
artifact and must be updated if the SRS changes an interface.
## Global Rules
- Single clock domain: `clk`, 250 MHz target.
- Synchronous active-high reset: `rst`.
- All RTL ports use SystemVerilog `logic`.
- Simple direct `assign` is allowed; multi-level combinational logic in `assign`
is not allowed.
- Internal pipeline interfaces use `valid` plus explicit stall/backpressure only
where the receiving stage can block.
- Stage outputs should be registered unless a local timing review proves the path
is trivial.
- Pixel coordinates are zero-based: `x = 0..active_pic_col-1`,
`y = 0..active_pic_row-1`.
- Strip coordinates are zero-based: `strip_index = y / SCAN_ROWS`.
## Top-Level Module
Module: `jpeg_ls_encoder_top`
Parameters:
| Name | Default | Description |
| --- | ---: | --- |
| `PIX_WIDTH` | 16 | Compile-time grayscale sample precision: 8, 10, 12, 14, or 16 bits. |
| `DEFAULT_PIC_COL` | 6144 | Default image width used when runtime dimensions are invalid. |
| `DEFAULT_PIC_ROW` | 256 | Default image height used when runtime dimensions are invalid. |
| `MAX_PIC_COL` | 6144 | Maximum supported runtime image width. |
| `MAX_PIC_ROW` | 4096 | Maximum supported runtime image height. |
| `SCAN_ROWS` | 16 | Number of source rows in one standalone JPEG-LS strip frame. |
| `MAX_NEAR` | 31 | Maximum dynamic NEAR value. |
| `OUT_BUF_BYTES` | 8192 | Internal byte output buffer size. |
| `OUT_BUF_AFULL_MARGIN` | 256 | Input pause margin for the internal output buffer. |
Ports:
| Name | Direction | Width | Description |
| --- | --- | ---: | --- |
| `clk` | input | 1 | Main clock. |
| `rst` | input | 1 | Synchronous active-high reset. |
| `cfg_pic_col` | input | 13 | Runtime image width sampled at input SOF. |
| `cfg_pic_row` | input | 13 | Runtime image height sampled at input SOF. |
| `ratio` | input | 4 | Runtime compression target sampled at input SOF. |
| `ififo_rclk` | output | 1 | Input FIFO read clock, tied to `clk`. |
| `ififo_rd` | output | 1 | Input FIFO read request. Data is valid one cycle later. |
| `ififo_rdata` | input | `ceil(PIX_WIDTH/8)*9` | Packed SOF flag and pixel value. |
| `ififo_empty` | input | 1 | Input FIFO empty flag. |
| `ififo_alempty` | input | 1 | Input FIFO almost-empty flag for read optimization. |
| `ofifo_wclk` | output | 1 | Output FIFO write clock, tied to `clk`. |
| `ofifo_wr` | output | 1 | Output FIFO write enable. |
| `ofifo_wdata` | output | 9 | Output byte stream. `[8]` marks original-image start; `[7:0]` is byte. |
| `ofifo_full` | input | 1 | Reserved and ignored by RTL; simulation checks unsafe writes. |
| `ofifo_alfull` | input | 1 | Reserved and ignored by RTL. |
## Internal Data Types
These names are descriptive contracts; actual RTL may use packed `logic`
signals rather than `typedef` if that better matches the coding style.
### Pixel Event
Carries one accepted input sample after FIFO timing alignment. The first RTL
implementation emits coordinates and boundary flags directly from
`jls_input_ctrl` so later stages do not need to repeat input-frame bookkeeping.
| Field | Width | Description |
| --- | ---: | --- |
| `valid` | 1 | Pixel event is valid. |
| `sof` | 1 | Original input image start marker from FIFO sideband. |
| `sample` | `PIX_WIDTH` | Original input sample value. |
| `x` | 13 | Column coordinate inside original image. |
| `y` | 13 | Row coordinate inside original image. |
| `strip_first_pixel` | 1 | First pixel of the current strip frame. |
| `strip_last_pixel` | 1 | Last pixel of the current strip frame. |
| `image_first_pixel` | 1 | First pixel of the original input image. |
| `image_last_pixel` | 1 | Last pixel of the original input image. |
### Strip Control Event
Controls header generation and per-strip context reset.
| Field | Width | Description |
| --- | ---: | --- |
| `valid` | 1 | Control event is valid. |
| `start` | 1 | Start a standalone JPEG-LS strip frame. |
| `finish` | 1 | Finish the current standalone JPEG-LS strip frame. |
| `original_image_first_strip` | 1 | Set `ofifo_wdata[8]` on this strip frame's first SOI byte. |
| `original_image_last_strip` | 1 | Last strip of the current original image. Internal only. |
| `strip_width` | 13 | JPEG-LS frame width for this strip. |
| `strip_height` | 13 | JPEG-LS frame height, normally `SCAN_ROWS`. |
| `near` | 6 | NEAR value used in the strip `SOS`. |
| `pixel_width` | 5 | Sample precision copied from `PIX_WIDTH`. |
### Encoded Byte Event
Carries bytes into the internal output buffer.
| Field | Width | Description |
| --- | ---: | --- |
| `valid` | 1 | Byte event is valid. |
| `byte` | 8 | JPEG-LS byte in marker-stream order. |
| `original_image_start` | 1 | Copied to `ofifo_wdata[8]` for exactly one byte. |
## Module Contracts
### `jls_input_ctrl`
Responsibilities:
- Drive `ififo_rd` based on FIFO state and internal pause requests.
- Align synchronous FIFO read latency.
- Wait for `SOF=1` before accepting an image.
- Sample `cfg_pic_col`, `cfg_pic_row`, and `ratio` at input SOF.
- Replace invalid dimensions with defaults.
- Generate zero-based original-image coordinates.
- Mark strip-frame first/last pixels and original-image first/last pixels.
Outputs:
- Pixel event stream to `jls_scan_ctrl`.
- Latched image configuration to downstream control.
Stall sources:
- `ififo_empty`.
- Conservative mode when `ififo_alempty=1`.
- Internal output buffer near-full pause request.
- Multi-cycle entropy or bit-packer stall propagated from downstream.
### `jls_scan_ctrl`
Responsibilities:
- Convert input-controller boundary flags into strip control events.
- Emit one strip start event when `strip_first_pixel=1`.
- Emit one strip finish event after the strip's last pixel has entered the
entropy/bit-pack pipeline.
- Request context and line-buffer reset at each strip start.
- Preserve strict original-image pixel order.
- Register and forward `enc_row_last_pixel` with each encoded pixel. This is
the last-column flag for the current row, distinct from
`enc_strip_last_pixel`, and is used to keep width comparison out of the
neighbor-provider RAM-read path.
Key rule:
- A strip frame is a complete standalone JPEG-LS frame. It is not a second
scan inside the previous JPEG-LS frame.
### `jls_header_writer`
Responsibilities:
- Emit `SOI`, `SOF55`, `LSE`, `SOS`, and `EOI`.
- Encode marker fields in big-endian byte order.
- Write `SOF55` width as `strip_width` and height as `strip_height`.
- Write `SOS` NEAR from the current strip control event.
- Write `LSE` preset coding parameters from explicit preset inputs.
- Assert `original_image_start` only on the first byte of the first strip frame
of an original image.
- Accept strip start and strip finish commands only while idle.
Open implementation note:
- Keep the LSE output policy configurable in the module structure. The first
implementation emits LSE before each strip `SOS`.
### `jls_preset_defaults`
Responsibilities:
- Convert `PIX_WIDTH` and the current strip `NEAR` into JPEG-LS default
preset coding parameters: `MAXVAL`, `T1`, `T2`, `T3`, and `RESET`.
- Clamp a defensive out-of-range `NEAR` input to 31.
- Use the simplified `MAXVAL >= 128` equations that cover all supported
8/10/12/14/16-bit precisions.
- Provide the same preset values to `jls_header_writer` and the later
predictor/context pipeline so header syntax and model thresholds stay aligned.
### `jls_coding_params`
Responsibilities:
- Convert compile-time `PIX_WIDTH` and current strip `NEAR` to JPEG-LS `RANGE`.
- Output `qbpp = ceil(log2(RANGE))`.
- Output regular-mode `LIMIT = 2 * (PIX_WIDTH + max(8, PIX_WIDTH))`.
- Use a lookup table for `NEAR=0..31` instead of synthesized runtime division.
### `jls_near_ctrl`
Responsibilities:
- Initialize NEAR to 0 at original image start.
- Keep NEAR at 0 for `ratio=0` and invalid ratio values.
- For `ratio=1/2/3`, update the next strip's NEAR after the current strip
output byte count is known.
- Clamp NEAR to `0..31`.
- Report a sticky target-miss condition when the cumulative actual bits still
exceed cumulative target bits while NEAR is already 31.
Counters:
- Target bits are computed from actual source pixel width, not from storage
width.
- Actual bits count every byte generated into the internal output buffer.
### `jls_predictor`
Responsibilities:
- Accept local reconstructed neighbors `Ra`, `Rb`, `Rc`, and `Rd` from the
line-buffer neighbor provider.
- Compute MED prediction `Px`.
- Register and forward the original sample, coordinates, strip boundary flags,
and neighbors to the context/error stage.
- Keep top/left strip boundary handling in the line-buffer stage so this module
remains a short compare/add pipeline stage.
Line-buffer companion:
- Provide `Ra/Rb/Rc/Rd` from reconstructed samples.
- Handle top, left, and right edge pixels as JPEG-LS frame boundaries.
- Delay or bank current-line writes so previous-row `Rc/Rb/Rd` reads are not
corrupted by reconstructed samples from the current row.
### `jls_neighbor_provider`
Responsibilities:
- Provide reconstructed-neighbor samples `Ra`, `Rb`, `Rc`, and `Rd` to
`jls_predictor`.
- Use two row banks so the current-row reconstructed write does not corrupt the
previous-row reads needed by `Rb/Rc/Rd`.
- Apply strip-frame top-row and left/right-edge boundary rules:
top-row previous samples are zero, `x=0` uses `Ra=Rb`, `Rc` from the
previous line's left-edge extension sample, and the last column uses `Rd=Rb`.
- Accept reconstructed-sample writeback `Rx` from the later error stage.
- Consume the registered `pixel_row_last` flag from `jls_scan_ctrl` for
right-edge and NEAR>0 row-transition handling; do not recompute row-last from
strip width on the `Rd` read path.
Current implementation note:
- In `NEAR=0` lossless strips, `Rx == X`; the provider commits the accepted
original sample to line history immediately and does not wait for the later
reconstructed-sample path. In `NEAR>0` strips, it keeps one outstanding
pixel and waits for true reconstructed writeback before accepting the next
sample. For regular-mode pixels, the top-level returns that writeback from
`jls_regular_error_quantizer` after `Errval/Rx` acceptance instead of waiting
for downstream Golomb completion.
- A non-EOL `NEAR>0` writeback can overlap the next same-row pixel accept:
the returned `Rx` is bypassed as the next pixel's `Ra`. Row-end transitions
still wait a clock so the bank selector and left-edge extension state update
before `x=0` of the next row.
- A later version must still address the remaining `NEAR>0` one-pixel feedback
dependency, run-segment ordering, or input buffering to reach the 200
MPixel/s goal without committing non-standard neighbor values.
### `jls_mode_router`
Responsibilities:
- Consume neighbor events and decide whether the local gradients select regular
mode or run mode.
- Forward regular-mode events to `jls_predictor`.
- Accumulate run pixels while `|Ix - Ra| <= NEAR` and reconstruct those run
pixels as `Ra` for line-buffer writeback.
- Use gradients only to enter run mode. Once `run_length_accum` is non-zero,
remain in the Annex A.7 run loop and treat the first nonmatching sample as a
run interruption, even if that sample's gradients would not independently
select run mode.
- Emit a run segment to `jls_run_mode` when the run reaches EOL or an
interruption sample.
- After issuing a run segment, continue accepting later non-EOL matching run
pixels because they do not emit entropy immediately, but stall before any
regular event, run interruption, or EOL run segment until `jls_run_mode`
reports `segment_done`.
Current implementation note:
- This is still conservative around entropy ordering: non-EOL matching run
pixels can overlap an outstanding run segment, but any event that would emit
entropy remains blocked until the prior segment completes. That preserves
entropy order while reducing run-only stalls, but remains a throughput risk
for the full 200 MPixel/s target.
### `jls_context_model`
Responsibilities:
- Consume the quantized context event from `jls_context_quantizer`.
- Use `jls_context_memory` to store regular-mode variables `A`, `B`, `C`, and
`N` for 365 contexts.
- Use `jls_context_update` as the regular-mode update arithmetic core after
`Errval` is known.
- Bypass/forward updated context values when a later pipeline stage needs the
same context before table writeback completes.
- Track in-flight context indices so a same-context read either waits for the
matching writeback or uses the same-cycle write/read bypass values.
Stop-for-confirmation trigger:
- If bypass cannot maintain standard semantics without frequent stalls that
threaten the 200 MPixel/s target, raise the issue before changing the target.
### `jls_context_update`
Responsibilities:
- Compute regular-mode Golomb parameter `k` from pre-update `A` and `N`.
- Update one regular-mode context's `A`, `B`, `C`, and `N` after `Errval`.
- Apply the `RESET` halving rule.
- Apply the JPEG-LS bias correction bounds for `B` and `C`.
- Forward `Errval`, context index, strip-last flag, `LIMIT`, `qbpp`, and
mapping-inversion metadata to downstream writeback/error-mapping logic.
- Stay independent from the 365-entry context RAM so table hazards can be
handled in a wrapper with explicit bypass rules.
### `jls_context_memory`
Responsibilities:
- Apply lazy initialization for all 365 regular contexts at standalone
strip-frame start by clearing a written-bit vector.
- Initialize untouched contexts as `A[Q] = max(2, (RANGE + 32) / 64)`,
`B[Q] = 0`, `C[Q] = 0`, and `N[Q] = 1` when they are read.
- Latch the initialization A value when the init command is accepted.
- Provide a registered read result and a simple writeback port.
- Leave same-context read-after-write forwarding to the `jls_context_model`
wrapper so the RAM stays a simple registered storage primitive.
### `jls_context_quantizer`
Responsibilities:
- Compute standard local gradients `D1 = Rd - Rb`, `D2 = Rb - Rc`, and
`D3 = Rc - Ra`.
- Quantize gradients to `Q1/Q2/Q3` using current strip `T1/T2/T3/NEAR`.
- Compute the signed context value `(Q1 * 9 + Q2) * 9 + Q3`.
- Emit `context_index = abs(context_value)`, `context_negative`, and
`run_mode_context`.
- Register and forward sample, prediction, neighbors, and strip boundary flags.
### `jls_prediction_corrector`
Responsibilities:
- Accept `Px`, context variable `C[Q]`, and the quantized context sign.
- Apply the context sign to `C[Q]`.
- Clamp `Px +/- C[Q]` to `0..MAXVAL` using the JPEG-LS
`correct_prediction` behavior.
- Forward sample, coordinates, context metadata, strip boundary flags, and the
pre-update `A/B/C/N` variables needed by context update.
### `jls_regular_error_quantizer`
Responsibilities:
- Compute regular-mode `Errval` from the original sample and corrected
prediction.
- Apply NEAR-dependent quantization and RANGE modulo normalization.
- Compute the reconstructed sample `Rx` used by the encoder-side line history.
- Forward `A/B/C/N`, context index, strip-last flag, `LIMIT`, and `qbpp` to the
context update and entropy pipeline.
Current implementation note:
- `NEAR=0` takes the direct path. `NEAR>0` uses an exact reciprocal-LUT multiply
plus quotient-correction pipeline, covering the supported `NEAR=1..31` range
without a single-cycle combinational divider.
- In the integrated top level, regular-mode `Rx` is returned to line history
when this module's result is accepted. Annex A.6 context update and Annex
G.2 entropy coding consume the same `Errval` later, but they do not modify
`Rx`.
### `jls_run_mode`
Responsibilities:
- Accept one upstream-detected run segment: `run_length`, EOL flag, and optional
run-interruption sample.
- Emit direct run-length code events before any run-interruption mapped-error
event for the same segment.
- Compute run-interruption `RItype`, signed `Errval`, `MErrval`, `k`, and
`LIMIT - J[RUNindex] - 1`.
- Maintain `RUNindex` and the two run-interruption contexts for `RItype=0` and
`RItype=1`.
- Emit the reconstructed run-interruption sample `Rx` for line-buffer writeback.
Current integration note:
- The module is a run-segment entropy helper, not the upstream run scanner. The
top-level integration uses `jls_mode_router` as that run scanner; the router
consumes run pixels, emits reconstructed run pixels, and feeds this module
with complete run segments.
### `jls_golomb_encoder`
Responsibilities:
- Accept already computed standard variables `MErrval`, `k`, `LIMIT`, and
`qbpp` from the regular-mode or run-interruption pipeline.
- Generate left-aligned variable-length Golomb code events for the bit packer.
- Emit the regular Golomb path and the JPEG-LS LIMIT fallback path.
- Allow multi-cycle handling for extreme long codes while preserving standard
ordering.
- Keep prediction-error mapping and `k` calculation in the upstream
context/regular-mode stage so those timing paths can be split independently.
### `jls_error_mapper`
Responsibilities:
- Accept standard signed `Errval` after quantization and context-sign handling.
- Apply the context-correction inversion used before mapping when requested.
- Map `Errval` to the non-negative standard variable `MErrval`.
- Forward `k`, `LIMIT`, and `qbpp` to `jls_golomb_encoder`.
### `jls_bit_packer`
Responsibilities:
- Pack variable-length code events into JPEG-LS byte stream.
- Apply JPEG-LS marker/zero-bit stuffing rules.
- Flush to byte boundary before `EOI` for each strip frame.
- Accept left-aligned code events; the first bit is
`code_bits[MAX_CODE_BITS-1]`.
- Emit at most one scan payload byte per cycle to the internal output buffer.
### `jls_byte_arbiter`
Responsibilities:
- Merge marker/header bytes and scan-payload bytes into one encoded byte stream.
- Give header/EOI bytes priority over payload bytes so each strip frame remains
`SOI/SOF55/LSE/SOS`, payload, then `EOI`.
- Forward `original_image_start` only from the header stream.
### `jls_output_buffer`
Responsibilities:
- Buffer generated bytes before the 9-bit output FIFO.
- Drain one byte per cycle to `ofifo`.
- Ignore `ofifo_full` and `ofifo_alfull` in RTL behavior.
- Produce simulation error if `ofifo_full=1` and `ofifo_wr=1`.
- Raise internal near-full pause request when free space is below
`OUT_BUF_AFULL_MARGIN`.
- Provide `byte_accepted` and `buffer_level` for statistics, dynamic NEAR
accounting, and verification reports.
## First RTL Smoke Target
The first RTL smoke target should not implement full entropy coding. It should
verify safe sequencing before algorithmic complexity is added:
- Input SOF detection and dimension latch.
- Strip frame boundary generation.
- Header writer emits one minimal strip frame per strip.
- Output buffer emits bytes with correct `ofifo_wdata[8]` placement.
- Testbench captures output stream and uses the reference decode script once
entropy payload is valid.

View File

@@ -0,0 +1,115 @@
# JPEG-LS RTL Pipeline Mermaid Flow
This document describes the current RTL algorithm pipeline implemented around
`jpeg_ls_encoder_top`. The requirement source is `fpga/srs/jpeg_ls.md`.
The figure is an implementation trace, not a replacement for the standard. The
standard references identify the algorithmic step that each RTL stage must be
equivalent to after pipelining, lookup, speculation, or multi-cycle splitting.
```mermaid
flowchart TB
%% JPEG-LS FPGA encoder pipeline, current RTL implementation trace.
subgraph EXT["External interfaces"]
IFIFO["Input FIFO\nIn: ififo_empty/alempty/rdata, cfg_pic_col/row, ratio\nOut: synchronous read data\nStd: image sample order before Annex A encoding"]
OFIFO["Output FIFO\nIn: ofifo_wr, ofifo_wdata[8:0]\nOut: external byte stream\nStd: Annex C marker stream bytes"]
end
subgraph CTRL["Strip control, parameters, and headers"]
IN["S00 jls_input_ctrl\nIn: FIFO word, SOF sideband, cfg, ratio, pause_req\nDo: SOF gate, 1-cycle FIFO align, dimension fallback, x/y and strip flags\nOut: pixel event, active_pic_col/row, active_ratio\nStd: Annex A.8 control procedure, Annex D scan order"]
SCAN["S01 jls_scan_ctrl\nIn: pixel event, current_near\nDo: start/finish one standalone strip frame, choose first-strip NEAR=0\nOut: enc pixel event, strip_start/finish, strip width/height/near\nStd: Annex A.8, Annex D.1-D.3"]
PRESET["S02 jls_preset_defaults + jls_coding_params\nIn: PIX_WIDTH, strip NEAR\nDo: default MAXVAL/T1/T2/T3/RESET, RANGE/qbpp/LIMIT lookup\nOut: header preset fields, active coding params\nStd: Annex A.2, Annex C.2.4.1.1, Annex G.2"]
HDR["S03 jls_header_writer\nIn: strip_start/finish, strip size, NEAR, preset params\nDo: emit SOI/SOF55/LSE/SOS and EOI, big-endian marker fields\nOut: header/eoi byte stream, original_image_start sideband\nStd: Annex C.1-C.4, Annex D.3"]
NEAR["S04 jls_near_ctrl\nIn: image_start, strip_done, strip pixels, output bytes, ratio\nDo: cumulative actual-vs-target NEAR update and clamp 0..31\nOut: current_near, target_miss_at_max_near\nStd: NEAR usage in Annex A/C/D; dynamic policy is project-specific"]
end
subgraph PIX["Pixel neighborhood and mode decision"]
NBR["S10 jls_neighbor_provider\nIn: enc pixel event, reconstructed writeback Rx, NEAR\nDo: two-bank line history, strip edge handling, Ra/Rb/Rc/Rd selection; NEAR=0 commits X as Rx immediately; NEAR>0 same-row non-EOL Rx bypasses to next Ra\nOut: neighbor event with X,x,y,Ra,Rb,Rc,Rd\nStd: Annex A.3 local gradients, Annex A.4 prediction neighbors"]
ROUTE["S11 jls_mode_router\nIn: neighbor event, strip_width, NEAR\nDo: run/regular decision, run_length accumulation, run pixel Rx=Ra\nOut: regular event OR run segment, direct run-pixel writeback\nStd: Annex A.3 context determination, Annex A.7 run mode"]
end
subgraph REG["Regular-mode pipeline"]
PRED["S20 jls_predictor\nIn: X,Ra,Rb,Rc,Rd\nDo: MED predictor Px\nOut: predicted event with Px and neighbors\nStd: Annex A.4 MED prediction"]
QTZ["S21 jls_context_quantizer\nIn: Ra,Rb,Rc,Rd,T1,T2,T3,NEAR,Px\nDo: D1=Rd-Rb, D2=Rb-Rc, D3=Rc-Ra; quantize Q1/Q2/Q3; context sign/index\nOut: context event, run_mode_context flag\nStd: Annex A.3, Annex G.1"]
CMODEL["S22 jls_context_model + jls_context_memory\nIn: context_index, strip init, writeback A/B/C/N\nDo: lazy-init 365 contexts, read/bypass pre-update A/B/C/N, stall same-context hazards\nOut: vars event with A,B,C,N,C[Q]\nStd: Annex A.2 initialization, Annex A.6 variables"]
CORR["S23 jls_prediction_corrector\nIn: Px,C[Q],context sign, MAXVAL\nDo: bias-correct prediction and clamp to 0..MAXVAL\nOut: corrected Px and pre-update context vars\nStd: Annex A.5 prediction correction, Annex A.6 bias variables"]
ERRQ["S24 jls_regular_error_quantizer\nIn: X, corrected Px, RANGE, NEAR\nDo: Errval quantization/modulo normalization, reconstruct Rx, reciprocal-LUT NEAR division; regular Rx may return before Golomb done\nOut: Errval, Rx, context metadata, qbpp/LIMIT\nStd: Annex A.5 prediction error encoding, Annex A.2 RANGE"]
CUPDATE["S25 jls_context_update\nIn: Errval,A,B,C,N,NEAR,RESET\nDo: compute k from pre-update vars; update A/B/C/N; map inversion flag\nOut: k, updated A/B/C/N, Errval for mapping\nStd: Annex A.5 Golomb parameter, Annex A.6 update variables"]
EMAP["S26 jls_error_mapper\nIn: Errval, map_invert, k, LIMIT, qbpp\nDo: signed Errval to non-negative MErrval\nOut: regular MErrval,k,LIMIT,qbpp,last flag\nStd: Annex A.5 mapped error value, Annex G.2"]
end
subgraph RUN["Run-mode pipeline"]
RUNCORE["S30 jls_run_mode\nIn: run_length, EOL flag, interruption X,x,y,Ra,Rb,RANGE,qbpp,LIMIT,NEAR,RESET\nDo: RUNindex/J run-length code, RItype, reciprocal-LUT interruption Errval/MErrval/k, RI context update, interruption Rx\nOut: direct run code bits, run MErrval,k,limit,qbpp, interruption Rx\nStd: Annex A.7 run mode, Annex A.5 mapped error, Annex G.3"]
end
subgraph ENT["Entropy, packing, and byte output"]
MMERGE["S40 mapped-error arbiter in top\nIn: regular MErrval stream, run-interruption MErrval stream\nDo: prioritize pending run mapped event while preserving conservative order\nOut: selected MErrval,k,limit,qbpp\nStd: engineering ordering wrapper for Annex A.5/A.7 events"]
GOL["S41 jls_golomb_encoder\nIn: MErrval,k,LIMIT,qbpp\nDo: Golomb-Rice prefix/suffix and LIMIT fallback code events\nOut: left-aligned variable-length code bits\nStd: Annex A.5, Annex G.2"]
CMERGE["S42 code-event arbiter in top\nIn: direct run-length code, Golomb code events\nDo: run-length code before same-segment interruption code\nOut: ordered code_bits/code_bit_count\nStd: Annex A.7 run-length code order, Annex A.5 error code order"]
PACK["S43 jls_bit_packer\nIn: ordered code events, flush request\nDo: bit-to-byte packing, JPEG-LS zero-bit stuffing after 0xFF, flush before EOI\nOut: scan payload bytes\nStd: Annex C.1-C.4, Annex H.2"]
BARB["S44 jls_byte_arbiter\nIn: header/eoi bytes, payload bytes\nDo: header/EOI priority over payload, preserve original_image_start sideband\nOut: encoded byte event\nStd: Annex C marker and entropy-coded segment ordering"]
OUTBUF["S45 jls_output_buffer\nIn: encoded byte event\nDo: byte FIFO buffering, one byte/cycle drain, internal near-full pause\nOut: ofifo_wr/ofifo_wdata, byte_accepted, buffer_level\nStd: Annex C byte stream delivery"]
end
IFIFO --> IN --> SCAN --> NBR --> ROUTE
SCAN --> PRESET
SCAN --> HDR
PRESET --> HDR
PRESET --> CMODEL
PRESET --> ERRQ
PRESET --> RUNCORE
PRESET --> QTZ
NEAR --> SCAN
OUTBUF --> NEAR
SCAN --> NEAR
ROUTE -- "regular event: X,x,y,Ra/Rb/Rc/Rd" --> PRED --> QTZ --> CMODEL --> CORR --> ERRQ --> CUPDATE --> EMAP --> MMERGE
CUPDATE -- "updated A/B/C/N writeback" --> CMODEL
ROUTE -- "run segment: run_length,EOL,interruption X,Ra,Rb" --> RUNCORE
ROUTE -- "run pixel Rx=Ra" --> NBR
RUNCORE -- "interruption Rx" --> NBR
RUNCORE -- "direct run code bits" --> CMERGE
RUNCORE -- "run MErrval,k,limit,qbpp" --> MMERGE
ERRQ -- "regular Rx after Errval/Rx accept" --> NBR
MMERGE --> GOL --> CMERGE --> PACK --> BARB --> OUTBUF --> OFIFO
HDR --> BARB
```
## Stage Detail
| Stage | RTL module | Main inputs | Main processing | Main outputs | Standard / pseudocode mapping |
| --- | --- | --- | --- | --- | --- |
| S00 | `jls_input_ctrl` | `ififo_rdata`, `ififo_empty`, `ififo_alempty`, `cfg_pic_col`, `cfg_pic_row`, `ratio`, `pause_req` | Wait for SOF, align synchronous FIFO read latency, latch runtime config, validate/fallback image size, generate `x/y` and strip/image flags. | Pixel event: `sample`, `x`, `y`, `strip_first_pixel`, `strip_last_pixel`, `image_first_pixel`, `image_last_pixel`, active config. | Annex A.8 control procedure; Annex D scan order. |
| S01 | `jls_scan_ctrl` | Pixel event, `current_near`, downstream readiness. | Split the original image into standalone strip frames, emit strip start/finish commands, force the first strip NEAR to 0. | Encode pixel event, strip start/finish event, `strip_width`, `strip_height`, `strip_near`, strip pixel count. | Annex A.8; Annex D.1-D.3 scan control. |
| S02 | `jls_preset_defaults`, `jls_coding_params` | `PIX_WIDTH`, strip `NEAR`. | Compute default `MAXVAL/T1/T2/T3/RESET`; lookup `RANGE/qbpp/LIMIT` for the active strip. | Preset fields for LSE/header and active coding parameters for regular/run mode. | Annex A.2; Annex C.2.4.1.1 preset parameters; Annex G.2 coding parameters. |
| S03 | `jls_header_writer` | Strip start/finish, strip size, `NEAR`, preset fields. | Emit `SOI/SOF55/LSE/SOS` at strip start and `EOI` after payload flush. | Header/EOI byte stream and `original_image_start` sideband. | Annex C.1-C.4 marker syntax; Annex D.3 scan syntax. |
| S04 | `jls_near_ctrl` | Image start ratio, strip output byte count, strip pixel count. | Project dynamic policy: cumulative actual-vs-target bits, step `NEAR` up/down, clamp to `0..31`. | `current_near`, cumulative bit counters, target-miss flag. | Standard NEAR usage is Annex A/C/D; dynamic ratio policy is project-specific. |
| S10 | `jls_neighbor_provider` | Encode pixel event, reconstructed writeback `Rx`, active strip width, `NEAR`. | Maintain reconstructed line history in two banks, apply top/left/right edge rules, produce JPEG-LS neighbors. For `NEAR=0`, commit `X` as `Rx` immediately. For `NEAR>0`, a non-EOL writeback can overlap the next same-row pixel accept by bypassing returned `Rx` as that pixel's `Ra`; row transitions still wait one clock. Regular-mode true `Rx` returns immediately after S24 accepts the `Errval/Rx` result rather than after Golomb completion. | Neighbor event with `X`, `x/y`, `Ra/Rb/Rc/Rd`, strip flags. | Annex A.3 local gradients; Annex A.4 prediction neighborhood. |
| S11 | `jls_mode_router` | Neighbor event, `strip_width`, `NEAR`. | Determine regular/run entry from gradients, then stay in the Annex A.7 run loop while `run_length_accum` is non-zero; accumulate matching run pixels, reconstruct them as `Ra`, and form run segments at EOL/interruption. Later non-EOL matching run pixels may overlap an outstanding run segment because they emit no entropy yet. | Regular event or run segment; direct run-pixel reconstruction. | Annex A.3 context determination; Annex A.7 run mode. |
| S20 | `jls_predictor` | Regular event `X,Ra,Rb,Rc,Rd`. | Compute MED prediction `Px`. | Predicted event with `Px` and neighbor metadata. | Annex A.4 MED predictor pseudocode. |
| S21 | `jls_context_quantizer` | `Ra/Rb/Rc/Rd`, `T1/T2/T3`, `NEAR`, `Px`. | Compute `D1/D2/D3`, quantize to `Q1/Q2/Q3`, derive context sign and index. | Context event, `context_index`, `context_negative`, `run_mode_context`. | Annex A.3; Annex G.1 context quantization. |
| S22 | `jls_context_model`, `jls_context_memory` | Context event, strip init, regular context writeback. | Lazy-initialize/read 365 regular contexts; track in-flight context indices; bypass same-cycle write/read values to prevent stale `A/B/C/N`. | Vars event with `A/B/C/N`, `C[Q]`, context metadata. | Annex A.2 initialization; Annex A.6 variables. |
| S23 | `jls_prediction_corrector` | `Px`, `C[Q]`, context sign, `MAXVAL`, pre-update vars. | Apply bias correction to prediction and clamp to sample range. | Corrected `Px`, forwarded context vars and metadata. | Annex A.5 prediction correction; Annex A.6 bias variables. |
| S24 | `jls_regular_error_quantizer` | `X`, corrected `Px`, `RANGE`, `NEAR`, `qbpp`, `LIMIT`. | Compute quantized/modulo `Errval`, reconstruct regular-mode `Rx`; `NEAR>0` uses reciprocal-LUT multiply plus quotient correction. The top-level may feed this `Rx` back to S10 before S25/S26/S41 finish because later stages do not alter `Rx`. | `Errval`, regular `Rx`, context index, `qbpp`, `LIMIT`, pre-update vars. | Annex A.5 prediction error encoding; Annex A.2 RANGE. |
| S25 | `jls_context_update` | `Errval`, `A/B/C/N`, `NEAR`, `RESET`, context metadata. | Compute pre-update `k`, update `A/B/C/N`, apply RESET halving and bias bounds, compute map inversion flag. | Updated context writeback, `k`, `Errval`, map inversion metadata. | Annex A.5 Golomb parameter; Annex A.6 update variables. |
| S26 | `jls_error_mapper` | `Errval`, map inversion flag, `k`, `LIMIT`, `qbpp`. | Convert signed regular `Errval` into non-negative `MErrval`. | Regular mapped event `MErrval/k/LIMIT/qbpp`. | Annex A.5 mapped error value; Annex G.2. |
| S30 | `jls_run_mode` | Run segment `run_length`, EOL, interruption `X`, `Ra/Rb`, `RANGE`, `qbpp`, `LIMIT`, `NEAR`, `RESET`. | Emit run-length code via `RUNindex/J`; compute `RItype`, run-interruption `Errval/MErrval/k`, update RI contexts, reconstruct interruption `Rx`; `NEAR>0` uses the same reciprocal-LUT division pipeline. | Direct run code event, run mapped event, interruption `Rx`, segment done flags. | Annex A.7 run mode; Annex A.5 mapped error; Annex G.3 run interruption context. |
| S40 | top mapped-error arbiter | Regular mapped event, run mapped event. | Select the next `MErrval/k/limit/qbpp` event while preserving conservative ordering; if one mapped event completes while a new run mapped event is accepted, the new run busy state wins. | Mapped event for Golomb encoder. | Engineering wrapper preserving Annex A.5/A.7 entropy order. |
| S41 | `jls_golomb_encoder` | `MErrval`, `k`, `LIMIT`, `qbpp`. | Generate Golomb-Rice prefix/suffix code events and LIMIT fallback path. | Left-aligned variable-length code bits. | Annex A.5; Annex G.2. |
| S42 | top code-event arbiter | Direct run-length code event, Golomb code event. | Emit run-length code before the same segment's interruption Golomb event; block reordering. | Ordered code event stream. | Annex A.7 run-length code order; Annex A.5 error code order. |
| S43 | `jls_bit_packer` | Ordered code bits, flush request. | Pack bits into bytes, insert JPEG-LS zero bit after data byte `0xFF`, flush partial byte before `EOI`. | Scan payload bytes. | Annex C.1-C.4 entropy-coded segment syntax; Annex H.2 examples. |
| S44 | `jls_byte_arbiter` | Header/EOI bytes, payload bytes. | Prioritize marker bytes and preserve `original_image_start` sideband only from header stream. | Encoded byte event. | Annex C marker and scan payload ordering. |
| S45 | `jls_output_buffer` | Encoded byte event, output FIFO status ports. | Buffer generated bytes, drain one byte per cycle, ignore external full flags in RTL behavior, report internal pause. | `ofifo_wr`, `ofifo_wdata`, byte count/watermark signals. | Annex C byte stream delivery; project FIFO contract. |
## Speculation Rule
The implementation may add a future speculation path before S10/S11 to precompute
gradients, prediction candidates, or context read addresses from original or old
values. Such values are only hints. Final `Ra/Rb/Rc/Rd`, `Px`, context index,
run/regular selection, `Errval/MErrval/k`, context updates, run-state updates,
and reconstructed history must be recomputed or checked against the true
encoder-side reconstructed neighbor history before commit.

204
docs/jls_traceability.md Normal file
View File

@@ -0,0 +1,204 @@
# JPEG-LS RTL 标准可追溯说明
本文档用于在 RTL 实现过程中记录 JPEG-LS 标准条款、伪代码变量、RTL 代码片段和
示例之间的对应关系。实现代码中的关键处理过程必须引用本文档的对应小节。
## 1. 引用标准
- 标准名称ITU-T T.87 (06/1998) / ISO/IEC 14495-1 JPEG-LS Baseline
- 官方页面https://www.itu.int/rec/T-REC-T.87-199806-I
- 参考实现https://github.com/team-charls/charls
## 2. RTL 注释模板
```systemverilog
// Standard : ITU-T T.87 (06/1998) / ISO/IEC 14495-1 JPEG-LS Baseline
// Clause : Annex A.4 Prediction
// Figure : N/A
// Table : N/A
// Pseudocode : MED predictor / Px calculation
// Trace : docs/jls_traceability.md#med-predictor
// Notes : Pipelined implementation; equivalent to the standard step.
```
规则:
- `Clause``Figure``Table` 必须来自正式标准文档或官方目录。
- 没有对应图或表时写 `N/A`
- 禁止凭记忆填写图号、表号或章节号。
- 不在 RTL 注释中大段复制标准原文,只写引用位置、变量对应和工程说明。
- 流水化、查表、旁路、多周期处理必须说明与标准伪代码的等价关系。
## 3. 处理过程对照表
| 处理过程 | RTL 模块 | 标准章节 | 图 | 表 | RTL 片段 ID | 备注 |
| --- | --- | --- | --- | --- | --- | --- |
| 编码总体流程 | `jpeg_ls_encoder_top`, `jls_scan_ctrl` | Clause 4.4, Annex A.8, Annex D.1-D.3 | N/A | N/A | `JLS_TOP_PIPELINE`, `JLS_SCAN_CONTROL` | 见 `docs/jls_pipeline_mermaid.md` |
| 单分量编码参数和压缩数据 | `jls_scan_ctrl`, `jls_header_writer` | Annex A.1 | N/A | N/A | `JLS_SINGLE_COMPONENT_PARAMS` | 灰度单分量,`Nf=1` |
| 初始化和约定 | `jls_scan_ctrl`, `jls_context_model` | Annex A.2 | N/A | N/A | `JLS_CONTEXT_INIT`, `JLS_CODING_PARAMS` | 条带 frame 边界重新初始化 |
| 上下文确定 | `jls_context_quantizer`, `jls_context_model` | Annex A.3, Annex G.1 | N/A | N/A | `JLS_CONTEXT_QUANTIZER` | 见 `fpga/verilog/jls_context_quantizer.sv` |
| MED 预测 | `jls_predictor` | Annex A.4 | N/A | N/A | `MED_PREDICTOR` | 见第 4.1 节 |
| 预测误差编码 | `jls_error_mapper`, `jls_golomb_encoder` | Annex A.5, Annex G.2 | N/A | N/A | `JLS_ERROR_MAPPER`, `JLS_GOLOMB_ENCODER` | 见 `fpga/verilog/jls_error_mapper.sv``fpga/verilog/jls_golomb_encoder.sv` |
| 上下文变量更新 | `jls_context_memory`, `jls_context_update`, `jls_context_model` | Annex A.2, Annex A.6 | N/A | N/A | `JLS_CONTEXT_MEMORY`, `JLS_CONTEXT_UPDATE` | 见 `fpga/verilog/jls_context_memory.sv``fpga/verilog/jls_context_update.sv` |
| 预测值偏差修正 | `jls_prediction_corrector` | Annex A.5, Annex A.6 | N/A | N/A | `JLS_PREDICTION_CORRECTOR` | 见 `fpga/verilog/jls_prediction_corrector.sv` |
| run mode 编码 | `jls_run_mode` | Annex A.7, Annex G.3 | N/A | N/A | `JLS_RUN_MODE` | RUNindex/J 使用标准伪代码表项;见第 4.4 节 |
| JPEG-LS 码流格式和 marker | `jls_header_writer`, `jls_bit_packer` | Annex C.1-C.4 | N/A | N/A | `JLS_HEADER_MARKERS` | 见 `fpga/verilog/jls_header_writer.sv` |
| LSE 默认 preset 参数 | `jls_preset_defaults`, `jls_header_writer` | Annex C.2.4.1.1 | Figure C.3 | Table C.1, Table C.2, Table C.3 | `JLS_PRESET_DEFAULTS` | 见 `fpga/verilog/jls_preset_defaults.sv` |
| RANGE/qbpp/LIMIT 参数 | `jls_coding_params` | Annex A.2, Annex G.2 | N/A | N/A | `JLS_CODING_PARAMS` | 见 `fpga/verilog/jls_coding_params.sv` |
| 输出 FIFO 字节交付 | `jls_output_buffer` | Annex C.1-C.4 | N/A | N/A | `JLS_OUTPUT_BUFFER` | 见 `fpga/verilog/jls_output_buffer.sv` |
| scan 控制流程 | `jls_scan_ctrl`, `jls_header_writer` | Annex D.3 | N/A | N/A | `JLS_SCAN_CONTROL` | 见 `fpga/verilog/jls_scan_ctrl.sv` |
| bitstream 输出示例 | `jls_bit_packer` | Annex H.2 | N/A | N/A | `JLS_BIT_PACKER` | 见 `fpga/verilog/jls_bit_packer.sv` |
| 详细编码示例 | 多模块联合说明 | Annex H.3 | N/A | N/A | `JLS_TRACE_EXAMPLES` | 本文第 4 节给出小规模变量示例 |
| 解码一致性验证 | 验证脚本、CharLS 和 libjpeg 对比 | Annex F.1 | N/A | N/A | `JLS_REFERENCE_COMPARE` | 见 `tools/jls_compat/reference_decode_compare.py` |
## 4. 示例说明模板
### 4.1 MED Predictor
- 标准章节Annex A.4
- RTL 模块:`jls_predictor`
- 标准变量:`Ra`, `Rb`, `Rc`, `Px`
- RTL 片段 ID`MED_PREDICTOR`
- 输入示例:`Ra=10`, `Rb=20`, `Rc=15`
- 中间变量示例:`Rc` 位于 `Ra/Rb` 区间内,选择 `Ra+Rb-Rc`
- 输出示例:`Px=15`
- 工程说明:`jls_predictor` 只实现 MED 比较/加减并寄存输出;`Ra/Rb/Rc/Rd` 的行缓存读取和边界处理放在单独流水级,保持与标准 `Px` 计算等价并降低单级逻辑深度。`NEAR=0` 时 lossless 重建值 `Rx` 等于输入样本 `X`RTL 可将 `X` 立即提交到 line history`NEAR>0` 时必须等待真实重建样本或使用已校验等价的重放机制。当前实现允许非行尾 writeback 与下一同一行像素同周期接受,并把刚返回的 `Rx` 旁路为下一像素 `Ra`;行尾到下一行 `x=0` 的状态切换不做旁路。
### 4.2 Context Update
- 标准章节Annex A.3, Annex A.6, Annex G.1
- RTL 模块:`jls_context_quantizer`, `jls_context_model`
- 标准变量:`D1`, `D2`, `D3`, `Q1`, `Q2`, `Q3`, `A`, `B`, `C`, `N`, `Nn`
- RTL 片段 ID`JLS_CONTEXT_QUANTIZER`, `JLS_CONTEXT_MEMORY`, `JLS_CONTEXT_UPDATE`
- 输入示例:`Rd=32`, `Rb=10`, `Rc=2`, `Ra=0`, `T1=3`, `T2=7`, `T3=21`, `NEAR=0`
- 中间变量示例:`D1=22`, `D2=8`, `D3=2`,量化后 `Q1=4`, `Q2=3`, `Q3=1`
- 输出示例:`context_index=352`, `context_negative=0`, `run_mode_context=0`
- 工程说明:`jls_context_quantizer` 只做梯度量化和 context 编号;`jls_context_memory` 保存 365 个 regular context并用 written-bit 惰性初始化返回条带默认 `A/B/C/N`,避免条带开始逐项清表;`jls_context_update` 只做单个 context 的 `A/B/C/N` 算术更新,并把 Annex A.6 的 `B[Q] += Errval*(2*NEAR+1)` 拆成 DSP 输入操作数、乘积和累加流水级。`jls_context_model` 使用 in-flight busy 位跟踪 context 读后待写状态;连续像素访问同一 context 时,若写回尚未到达则暂停,若写回与新读同周期发生则旁路 `write_A/B/C/N`,禁止读旧 context。
### 4.3 Golomb-Rice Encoding
- 标准章节Annex A.5, Annex G.2
- RTL 模块:`jls_error_mapper`, `jls_golomb_encoder`
- 标准变量:`Errval`, `MErrval`, `k`, `LIMIT`, `qbpp`
- RTL 片段 ID`JLS_ERROR_MAPPER`, `JLS_GOLOMB_ENCODER`
- 输入示例:`Errval=-3`, `map_invert=0`;随后 `MErrval=5`, `k=1`, `LIMIT=32`, `qbpp=8`
- 中间变量示例:`MErrval=5``high_bits=2`,普通路径 prefix 为 `0,0,1`
- 输出示例:两个 left-aligned code event先输出 prefix `001`,再输出 suffix `1`
- 工程说明:`jls_error_mapper` 完成 signed `Errval` 到 non-negative `MErrval` 的映射;`jls_golomb_encoder` 从已经映射好的 `MErrval` 开始生成码字。`Errval` 量化和 `k` 计算放在上游流水级,便于拆分逻辑深度。极端长码允许多周期处理,但不得牺牲主时钟频率和目标吞吐率。
### 4.3a Regular Prediction Correction
- 标准章节Annex A.5, Annex A.6
- RTL 模块:`jls_prediction_corrector`
- 标准变量:`Px`, `C[Q]`
- RTL 片段 ID`JLS_PREDICTION_CORRECTOR`
- 输入示例:`Px=20`, `C=-3`, `context_negative=0`
- 中间变量示例context sign 不取反时修正量为 `-3`
- 输出示例:`corrected_Px=17`
- 工程说明:该模块只实现 prediction correction 和 `0..MAXVAL` 限幅;`Errval` 量化、重建样本和 context 变量更新放在后续流水级。
### 4.3b Regular Error Quantization And Reconstruction Feedback
- 标准章节Annex A.5, Annex A.6, Annex G.2
- RTL 模块:`jls_regular_error_quantizer`, `jpeg_ls_encoder_top`
- 标准变量:`Errval`, `Rx`, `MErrval`, `k`, `A`, `B`, `C`, `N`
- RTL 片段 ID`JLS_REGULAR_ERROR_QUANTIZER`, `JLS_TOP_REGULAR_RX_FEEDBACK`
- 输入示例:`X=22`, `corrected_Px=17`, `NEAR=0`, `RANGE=256`
- 中间变量示例:`Errval=5`,无损场景中 `Rx=X=22`
- 输出示例:`jls_neighbor_provider` 在 regular 误差量化结果被接受后的下一拍收到 `Rx=22`
- 工程说明:`Rx` 在 Annex A.5 的 `Errval` 量化和 modulo 规范化后已经确定Annex A.6 context update、`MErrval` 映射和 Annex G.2 Golomb 码字生成不会修改 `Rx`。顶层因此把 regular-mode `Rx` 提前反馈给 line history同时保持 context 写回和 entropy 事件的标准顺序。
### 4.4 Run Mode
- 标准章节Annex A.7, Annex G.3
- RTL 模块:`jls_run_mode`
- 标准变量:`RUNindex`, `RUNval`, `RItype`, `EMErrval`
- RTL 片段 ID`JLS_RUN_MODE`
- 输入示例:`NEAR=31`, `RANGE=6`, `run_length=0`, `Ra=0`, `Rb=0`, `X=200`
- 中间变量示例:`RItype=1`, `Errval=floor((200+31)/63)=3`,随后按 `RANGE=6` modulo 规范化为 `-3`;重建 `Rx=-3*63+6*63=189`
- 输出示例:`RUNindex=0` 的 zero-length run 输出 1 个 `0` bit`MErrval=4`, `k=1`
- 工程说明:`jls_mode_router` 只用梯度判断进入 run mode一旦 `run_length_accum` 非零,就保持在 Annex A.7 run loop 中,后续非匹配样本按 run interruption 编码,而不是重新按 regular 梯度分类。`jls_run_mode` 对 run-length code 和 run-interruption mapped event 分开输出,并维护 RItype 0/1 上下文。`NEAR>0` 中断误差量化采用倒数查表乘法和商校正流水,避免长组合除法。
### 4.5 Bit Packing And Stuffing
- 标准章节Annex C.1-C.4, Annex H.2
- RTL 模块:`jls_bit_packer`
- 标准变量entropy-coded segment bitstream, marker byte `0xFF`
- RTL 片段 ID`JLS_BIT_PACKER`
- 输入示例:`code_bits[63:56]=8'hFF`, `code_bit_count=8`,随后输入 7 个数据 bit `1111111`
- 中间变量示例:第一个 payload byte 为 `0xFF` 后,按 JPEG-LS 规则插入 1 个 zero stuffed bit再继续装入后续 7 个数据 bit
- 输出示例payload bytes `FF 7F`
- 工程说明:必须按 JPEG-LS marker/zero-bit stuffing 规则处理,禁止简化为普通
JPEG `0xFF 0x00` byte stuffing。
### 4.6 Dynamic NEAR Control
- 标准章节JPEG-LS `NEAR` 使用见 Annex A/C/D动态控制参考 CN102088602A。
- RTL 模块:`jls_near_ctrl`
- 标准变量:`NEAR`
- RTL 片段 ID`JLS_NEAR_CONTROL`
- 输入示例:`ratio=2`, `current_near=3`, `actual_bits_cumulative=9000`, `target_bits_cumulative=8192`
- 中间变量示例:累计实际 bit 数大于累计目标 bit 数,下一条带 `NEAR` 尝试加 1并钳位到 `0..31`
- 输出示例:`next_near=4`,同时输出当前条带的累计 bit 统计供报告使用
- 工程说明:第一版按条带 frame 结束后的累计实际 bit 与累计目标 bit 简单步进调节;
后续可按专利方法优化。
### 4.7 JLS Header Markers
- 标准章节Annex C.2.2, Annex C.2.3, Annex C.2.4.1
- RTL 模块:`jls_header_writer`
- 标准变量:`P`, `Y`, `X`, `Nf`, `Ci`, `NEAR`, `ILV`, `MAXVAL`, `T1`, `T2`, `T3`, `RESET`
- RTL 片段 ID`JLS_HEADER_MARKERS`
- 输入示例:`PIX_WIDTH=8`, `strip_width=32`, `strip_height=16`, `NEAR=0`
- 输出示例:`SOI/SOF55/LSE/SOS` header 后,条带 payload flush 完成时输出 `EOI`
- 工程说明marker 字段按大端字节输出;`ofifo_wdata[8]` 只映射到首条带 `SOI` 的第一个 `0xFF` 字节。
### 4.8 JLS Preset Defaults
- 标准章节Annex C.2.4.1.1
- RTL 模块:`jls_preset_defaults`
- 标准变量:`MAXVAL`, `T1`, `T2`, `T3`, `RESET`, `NEAR`
- RTL 片段 ID`JLS_PRESET_DEFAULTS`
- 输入示例:`PIX_WIDTH=8`, `NEAR=0`
- 输出示例:`MAXVAL=255`, `T1=3`, `T2=7`, `T3=21`, `RESET=64`
- 工程说明:本项目只支持 `PIX_WIDTH=8/10/12/14/16``NEAR<=31`,默认阈值计算退化为浅层 shift-add若后续扩大范围需要重新评审 clamp 路径。
### 4.9 JLS Output Buffer
- 标准章节Annex C.1-C.4
- RTL 模块:`jls_output_buffer`
- 标准变量JPEG-LS marker stream byte order
- RTL 片段 ID`JLS_OUTPUT_BUFFER`
- 输入示例:`original_image_start=1`, `byte_data=8'hFF`
- 输出示例:`ofifo_wdata=9'h1FF`
- 工程说明:外部 `ofifo_full/ofifo_alfull` 不参与 RTL 流控;若仿真中 `ofifo_full=1` 时仍写出,模块报告错误,用于暴露外部 FIFO 深度或系统级流控问题。
### 4.9a JLS Coding Parameters
- 标准章节Annex A.2, Annex G.2
- RTL 模块:`jls_coding_params`
- 标准变量:`RANGE`, `qbpp`, `LIMIT`, `NEAR`
- RTL 片段 ID`JLS_CODING_PARAMS`
- 输入示例:`PIX_WIDTH=8`, `NEAR=0`
- 输出示例:`RANGE=256`, `qbpp=8`, `LIMIT=32`
- 工程说明:本项目 `NEAR` 限制为 `0..31`,因此使用查表替代运行时除法;该路径属于条带级控制参数,但仍按高主频设计约束处理。
### 4.10 JLS Scan Control
- 标准章节Annex A.8, Annex D.1-D.3
- RTL 模块:`jls_scan_ctrl`
- 标准变量scan start/end control, `NEAR`
- RTL 片段 ID`JLS_SCAN_CONTROL`
- 输入示例:`strip_first_pixel=1`, `image_first_pixel=1`, `current_near=7`
- 输出示例:`strip_start_valid=1`, `original_image_first_strip=1`, `strip_near=0`
- 工程说明:第一幅图像首条带强制使用 `NEAR=0`,防止上一幅图像的动态 `NEAR` 状态影响新图像 header后续条带使用 `jls_near_ctrl` 输出。
### 4.11 JLS Bit Packer
- 标准章节Annex C.1-C.4, Annex H.2
- RTL 模块:`jls_bit_packer`
- 标准变量JPEG-LS entropy-coded bitstream
- RTL 片段 ID`JLS_BIT_PACKER`
- 输入示例:`code_bits[63:56]=8'hFF`, `code_bit_count=8`
- 输出示例:若后续还有 7 个 `1` 数据 bit则输出 payload bytes `FF 7F`
- 工程说明:`0xFF` 后只插入 1 个 zero bit不能简化为传统 JPEG byte stuffingflush 时以 0 补齐当前字节,并保证 `EOI` marker 前不存在未完成 bit。

View File

@@ -0,0 +1,310 @@
# JPEG-LS FPGA Verification Plan
This document defines the verification ladder and report schema for the RTL
encoder described by `fpga/srs/jpeg_ls.md`.
## Verification Ladder
### Smoke
Purpose: catch structural and integration failures quickly.
Required checks:
- Reference tool path can decode a known-good project RTL output (`.rtljls`) or a generic `.jls`.
- Concatenated standalone strip-frame stream can be split.
- Split strip frames decode independently with CharLS.
- Recombined decoded strips match the reference PGM.
- If jpeg.org/libjpeg `jpeg` executable is available, it decodes the same strip
frames and matches CharLS.
Current command:
```powershell
$env:JLS_COMPAT_PYDEPS = (Resolve-Path tools/jls_compat/.deps).Path
python tools/jls_compat/make_strip_stream_smoke.py --width 32 --height 32 --strip-rows 16 --bit-depth 8 --name strip_smoke_8b
python tools/jls_compat/reference_decode_compare.py tools/jls_compat/out/strip_smoke_8b.jls --split-frames --expected-frames 2 --reference-pgm tools/jls_compat/out/strip_smoke_8b.pgm
```
### RTL Smoke
Purpose: verify the first RTL integration before full entropy coding.
Required checks:
- `ififo_rd` obeys synchronous FIFO timing.
- SOF detection starts exactly one original image.
- Runtime dimensions and `ratio` are sampled at SOF.
- Invalid dimensions fall back to `6144 x 256`.
- Strip boundaries occur every `SCAN_ROWS` rows.
- Scan controller emits strip start/finish events and forwards pixels in order.
- Neighbor provider emits reconstructed `Ra/Rb/Rc/Rd` for top-row, left-edge,
middle-column, and right-edge cases. It covers the immediate `Rx == X`
commit path for `NEAR=0` lossless strips, including a no-recon consecutive
input case, and retains the true reconstructed writeback wait path for
`NEAR>0`. It also covers the `NEAR>0` same-row writeback-to-next-`Ra`
bypass and verifies that row transitions do not bypass bank/edge-state
updates.
- Mode router sends non-run contexts to the regular path, reconstructs matching
run pixels as `Ra`, accumulates run length, and emits interruption/EOL run
segments. It must remain in the Annex A.7 run loop while `run_length_accum`
is non-zero, so a later nonmatching sample is encoded as a run interruption
even if its gradients would not independently enter run mode.
- MED predictor computes `Px` for the three standard Ra/Rb/Rc comparison
cases and stalls cleanly when the downstream stage is not ready.
- Context quantizer computes `Q1/Q2/Q3`, absolute context index, context sign,
and run-mode flag for zero, positive, negative, and NEAR-zero gradients.
- Prediction corrector applies context variable `C[Q]` with context sign and
clamps the corrected prediction to `0..MAXVAL`.
- Regular error quantizer covers `Errval` normalization, reconstructed `Rx`,
and the `NEAR=31` reciprocal-LUT division path. Top-level compatibility
smokes must keep passing after regular-mode `Rx` is fed back at quantizer
acceptance instead of `mapped_done`.
- JPEG-LS default LSE preset parameters match the supported
8/10/12/14/16-bit threshold equations and NEAR clamp rule.
- Coding parameter lookup returns `RANGE`, `qbpp`, and `LIMIT` for representative
supported precisions, `NEAR=31`, and defensive NEAR clamp cases.
- Header writer emits exact `SOI/SOF55/LSE/SOS` bytes and trailing `EOI`.
- Dynamic NEAR controller updates by cumulative actual-vs-target bits, forces
ratio=0/invalid to NEAR=0, and reports the MAX_NEAR miss condition.
- Context memory lazily initializes all 365 regular contexts, latches the
A-init value, returns defaults for untouched contexts, supports registered
readback, and overwrites old state on re-init.
- Context model stalls same-context hazards until writeback and bypasses
same-cycle write/read values so a later event cannot read stale `A/B/C/N`.
- Context update arithmetic computes pre-update `k` and next `A/B/C/N` for
positive, negative, RESET-halving, and C-saturation cases.
- Error mapper converts positive, negative, and context-inverted `Errval`
values to `MErrval` and forwards `k/LIMIT/qbpp`.
- Run mode encodes zero-length run interruptions for `RItype=0/1`, emits
EOL run chunks from `RUNindex/J`, updates run-interruption contexts, and
preserves code-event ordering under its direct run-code interface. It also
covers a `NEAR=31` run-interruption case through the reciprocal division
pipeline.
- Golomb encoder emits the regular and LIMIT-path code events from `MErrval`,
`k`, `LIMIT`, and `qbpp` with left-aligned bit order.
- Bit packer packs left-aligned variable-length code events, handles JPEG-LS
0-bit stuffing after `0xFF`, and flushes partial bytes before EOI.
- Byte arbiter gives header/EOI bytes priority over payload bytes and preserves
`original_image_start` sideband only for the header stream.
- Output buffer preserves byte order and places `original_image_start` on
`ofifo_wdata[8]` for the corresponding byte event.
- Top-level idle smoke elaborates the integrated RTL and verifies that empty
input produces no FIFO read/write activity.
- Top-level all-zero run-mode smoke consumes a small 8-bit image, emits one
complete `SOI...EOI` strip frame, and checks that `ofifo_wdata[8]` appears
exactly once.
- `ofifo_wdata[8]` is high only on the first byte of the first strip frame.
- Output byte stream preserves strip order.
Current standalone RTL smoke commands:
```powershell
fpga/sim/run_jls_smoke.ps1
```
Current top-level compatibility smoke command:
```powershell
fpga/sim/run_jls_top_compat_smoke.ps1
```
Current staged throughput command:
```powershell
fpga/sim/run_jls_throughput_regression.ps1
```
Notes:
- The throughput script uses `tb_jpeg_ls_encoder_top_run_smoke` with
`+CHECK_THROUGHPUT=1` and writes `tools/jls_compat/out/rtl_throughput_stats.csv`.
- Its default is staged and narrow: `PIX_WIDTH=8`, `ratio=1/2/3`,
`6144 x 256`, `IMAGE_COUNT=10`, and `PATTERN=9`.
- `PATTERN=9` rotates ten deterministic representative images across the
10-image stream. It covers smooth, gradient, checker, edge, low-gradient,
stripe, texture, and pseudo-noise style inputs.
- Full regression should pass `-BitsList 8,10,12,14,16` and then run the
reference decoder comparison flow on the generated streams.
- Smoke/compatibility scripts scan simulator output for `** Fatal` and non-zero
`Errors:` counts instead of relying only on process exit codes, because
QuestaSim can finish with exit code 0 after a testbench `$fatal` when the
command script ends with `quit`.
Current top-level compatibility status:
- `tb_jpeg_ls_encoder_top_run_smoke` writes
16x16 zero and row-major ramp outputs for `PIX_WIDTH=8/10/12/14/16`.
- `run_jls_top_compat_smoke.ps1` generates the matching reference PGMs and
verifies all smoke outputs with CharLS through `reference_decode_compare.py`.
- The same script also runs an 8-bit 16x32 ramp image as two 16-row strip
frames, splits the concatenated `SOI...EOI` stream, decodes both frames with
CharLS, and compares the stitched image with the reference PGM.
- The script runs a small 8-bit two-image all-zero stream with two SOF sideband
events, splits the two standalone JPEG-LS frames, and checks the stitched
decoded result against a 16x32 zero reference PGM. This is a smoke precursor
for the later 10-image throughput regression.
- The compatibility script also runs an 8-bit 16x32 `ratio=2` dynamic-NEAR ramp
case, splits and decodes both strip frames with CharLS, and compares against
the reference PGM with a bounded absolute-difference tolerance.
- The staged throughput script has been bring-up tested on a small
8-bit 16x16x10 `PATTERN=9` stream for `ratio=1/2/3` with the hard throughput
assertion disabled. This verifies the script, CSV report, and mixed-pattern
run/regular control path; it is not a 200 MPixel/s result.
- This is still a narrow compatibility smoke. It covers all-zero run-heavy
behavior, mixed regular/run ramp cases, one lossless two-strip ramp case, and
one near-lossless dynamic-NEAR two-strip ramp case, but it does not replace
the later larger-image and throughput regressions.
Equivalent manual commands:
```powershell
vlog -sv fpga/verilog/jls_preset_defaults.sv fpga/sim/tb_jls_preset_defaults.sv
vsim -c tb_jls_preset_defaults -do "run -all; quit"
vlog -sv fpga/verilog/jls_coding_params.sv fpga/sim/tb_jls_coding_params.sv
vsim -c tb_jls_coding_params -do "run -all; quit"
vlog -sv fpga/verilog/jls_common_pkg.sv fpga/verilog/jls_input_ctrl.sv fpga/sim/tb_jls_input_ctrl.sv
vsim -c tb_jls_input_ctrl -do "run -all; quit"
vlog -sv fpga/verilog/jls_scan_ctrl.sv fpga/sim/tb_jls_scan_ctrl.sv
vsim -c tb_jls_scan_ctrl -do "run -all; quit"
vlog -sv fpga/verilog/jls_neighbor_provider.sv fpga/sim/tb_jls_neighbor_provider.sv
vsim -c tb_jls_neighbor_provider -do "run -all; quit"
vlog -sv fpga/verilog/jls_mode_router.sv fpga/sim/tb_jls_mode_router.sv
vsim -c tb_jls_mode_router -do "run -all; quit"
vlog -sv fpga/verilog/jls_predictor.sv fpga/sim/tb_jls_predictor.sv
vsim -c tb_jls_predictor -do "run -all; quit"
vlog -sv fpga/verilog/jls_context_quantizer.sv fpga/sim/tb_jls_context_quantizer.sv
vsim -c tb_jls_context_quantizer -do "run -all; quit"
vlog -sv fpga/verilog/jls_prediction_corrector.sv fpga/sim/tb_jls_prediction_corrector.sv
vsim -c tb_jls_prediction_corrector -do "run -all; quit"
vlog -sv fpga/verilog/jls_common_pkg.sv fpga/verilog/jls_header_writer.sv fpga/sim/tb_jls_header_writer.sv
vsim -c tb_jls_header_writer -do "run -all; quit"
vlog -sv fpga/verilog/jls_near_ctrl.sv fpga/sim/tb_jls_near_ctrl.sv
vsim -c tb_jls_near_ctrl -do "run -all; quit"
vlog -sv fpga/verilog/jls_context_memory.sv fpga/sim/tb_jls_context_memory.sv
vsim -c tb_jls_context_memory -do "run -all; quit"
vlog -sv fpga/verilog/jls_context_update.sv fpga/sim/tb_jls_context_update.sv
vsim -c tb_jls_context_update -do "run -all; quit"
vlog -sv fpga/verilog/jls_error_mapper.sv fpga/sim/tb_jls_error_mapper.sv
vsim -c tb_jls_error_mapper -do "run -all; quit"
vlog -sv fpga/verilog/jls_run_mode.sv fpga/sim/tb_jls_run_mode.sv
vsim -c tb_jls_run_mode -do "run -all; quit"
vlog -sv fpga/verilog/jls_golomb_encoder.sv fpga/sim/tb_jls_golomb_encoder.sv
vsim -c tb_jls_golomb_encoder -do "run -all; quit"
vlog -sv fpga/verilog/jls_bit_packer.sv fpga/sim/tb_jls_bit_packer.sv
vsim -c tb_jls_bit_packer -do "run -all; quit"
vlog -sv fpga/verilog/jls_byte_arbiter.sv fpga/sim/tb_jls_byte_arbiter.sv
vsim -c tb_jls_byte_arbiter -do "run -all; quit"
vlog -sv fpga/verilog/jls_output_buffer.sv fpga/sim/tb_jls_output_buffer.sv
vsim -c tb_jls_output_buffer -do "run -all; quit"
vlog -sv -f fpga/verilog/jpeg_ls_rtl.f fpga/sim/tb_jpeg_ls_encoder_top_idle.sv
vsim -c tb_jpeg_ls_encoder_top_idle -do "run -all; quit"
vlog -sv -f fpga/verilog/jpeg_ls_rtl.f fpga/sim/tb_jpeg_ls_encoder_top_run_smoke.sv
vsim -c tb_jpeg_ls_encoder_top_run_smoke -do "run -all; quit"
```
### Small Regression
Purpose: verify basic JPEG-LS algorithm correctness.
Image matrix:
| Bit depth | Ratio | Pattern | Size |
| ---: | ---: | --- | --- |
| 8 | 0 | gradient | 16 x 16 |
| 8 | 1 | checker | 32 x 32 |
| 10 | 1 | gradient | 32 x 32 |
| 12 | 2 | edge | 32 x 32 |
| 14 | 2 | ramp | 32 x 32 |
| 16 | 0 | gradient | 16 x 16 |
| 16 | 3 | checker | 32 x 32 |
Pass/fail:
- `ratio=0`: decoded pixels exactly match input.
- `ratio=1/2/3`: per-pixel error is less than or equal to that strip frame's
actual NEAR.
- All output strip frames decode with CharLS.
- If libjpeg executable is present, all output strip frames decode with libjpeg.
### Full Regression
Purpose: enforce the hard SRS requirements.
Required cases:
- All supported `PIX_WIDTH`: 8, 10, 12, 14, 16.
- `ratio=0/1/2/3`.
- Default image size `6144 x 256`.
- Maximum row width `6144`.
- Maximum row count boundary `4096`.
- Minimum legal size `16 x 16`.
- At least 10 representative images covering smooth, gradient, noise, edge, and
texture scenes.
- Continuous 10-image throughput test for `ratio=1/2/3`.
- The staged throughput script is the executable entry point for this test, but
the full pass criterion also requires CharLS and jpeg.org/libjpeg reference
decode after the long simulations are considered mature enough to run.
Pass/fail:
- Complete `.rtljls`/`.jls` stream is split into the expected number of strip frames.
- Every strip frame decodes with CharLS.
- Every strip frame decodes with jpeg.org/libjpeg; if the executable is missing
in full regression, the run is FAIL.
- CharLS and libjpeg decoded pixels match each other.
- Decoded pixels meet the lossless or near-lossless error rule.
- Compression-ratio error is within the SRS threshold or reported FAIL when
`NEAR=31` cannot satisfy the target.
- Average input throughput is at least 200 MPixel/s for `ratio=1/2/3`, excluding
upstream `ififo_empty` waits and including internal stalls.
## Report Schema
The regression report should be JSON or CSV with equivalent fields.
Image-level fields:
| Field | Description |
| --- | --- |
| `case_id` | Stable test case name. |
| `pix_width` | RTL `PIX_WIDTH`. |
| `ratio` | Runtime ratio port value. |
| `cfg_pic_col` | Runtime configured width. |
| `cfg_pic_row` | Runtime configured height. |
| `active_pic_col` | Actual effective width after fallback. |
| `active_pic_row` | Actual effective height after fallback. |
| `strip_rows` | `SCAN_ROWS`. |
| `strip_count` | Number of standalone JPEG-LS strip frames. |
| `output_bytes` | Total generated byte count across all strips. |
| `raw_bits` | `active_pic_col * active_pic_row * PIX_WIDTH`. |
| `actual_bits` | `output_bytes * 8`. |
| `target_bits` | Target bit count from `ratio`. |
| `compression_ratio` | `raw_bits / actual_bits`. |
| `max_error` | Whole-image maximum absolute reconstruction error. |
| `charls_status` | `PASS`, `FAIL`, or `SKIP`. |
| `libjpeg_status` | `PASS`, `FAIL`, or `SKIP`. |
| `compat_status` | `PASS` only if all required reference decoders agree. |
| `throughput_mpix_s` | Average input throughput for the case. |
| `pause_cycles_total` | Total internal pause cycles included in throughput. |
| `ififo_empty_wait_cycles` | Upstream-empty cycles excluded from throughput. |
| `outbuf_max_watermark` | Maximum internal output-buffer occupancy. |
| `result` | Overall `PASS` or `FAIL`. |
Strip-level fields:
| Field | Description |
| --- | --- |
| `case_id` | Parent test case name. |
| `strip_index` | Zero-based strip index. |
| `strip_y0` | First original-image row in the strip. |
| `strip_height` | Strip frame height. |
| `near` | NEAR used in this strip frame's `SOS`. |
| `output_bytes` | Byte count for this standalone JPEG-LS frame. |
| `actual_bits_cumulative` | Cumulative actual bit count after this strip. |
| `target_bits_cumulative` | Cumulative target bit count after this strip. |
| `max_error` | Maximum reconstruction error in this strip. |
| `pause_cycles` | Internal pause cycles attributed to this strip. |
| `outbuf_max_watermark` | Maximum buffer occupancy during this strip. |
## Compatibility Notes
- `tools/jls_compat/reference_decode_compare.py --split-frames` is the reference
tool for concatenated strip-frame streams.
- Smoke runs may skip libjpeg when no `jpeg` executable is available.
- Full regression must use `--require-libjpeg`.
- If CharLS and libjpeg disagree, mark `compat_status=FAIL` and preserve the
`.rtljls`/`.jls`, decoded images, command log, and tool versions.

357
docs/jls_work_plan.md Normal file
View File

@@ -0,0 +1,357 @@
# JPEG-LS FPGA Long-Term Work Plan
This plan is the execution companion to `fpga/srs/jpeg_ls.md`. The SRS is the
source of requirements. This file records the planned execution order, current
status, and the points where user confirmation is required.
## Working Rule
- Raise user-confirmation questions as early as possible.
- Do not stop when a reasonable engineering assumption is safe and reversible.
- Stop for user confirmation only when the decision changes an external
interface, JPEG-LS stream structure, hard performance target, verification
pass/fail rule, licensing/dependency boundary, or resource parameter that the
SRS says requires review.
- Every new or changed user requirement must be added to `fpga/srs/jpeg_ls.md`.
- `fpga/srs/jpeg_ls_design.drawio` is maintained only when the user explicitly
asks for draw.io updates.
## Current Confirmed Direction
- First RTL path: one original image is split into horizontal strip frames.
- Each strip frame is a complete standalone grayscale JPEG-LS frame:
`SOI ... EOI`.
- `ofifo_wdata[8]` is asserted only on the first byte of the first strip frame
of the original input image.
- Primary smoke decoder: CharLS via `tools/jls_compat/reference_decode_compare.py`.
- Additional reference decoder: jpeg.org/libjpeg command line tool when
available.
## Phase 1: Requirements And Compatibility
Status: in progress.
Deliverables:
- `fpga/srs/jpeg_ls.md` requirement baseline.
- `tools/jls_compat/duplicate_sos_probe.py`.
- `tools/jls_compat/reference_decode_compare.py`.
- `third_party/charls`.
- `third_party/libjpeg`.
Remaining work:
- Add a concatenated strip-frame smoke test once a simple local encoder or RTL
bitstream generator exists.
- Build or provide jpeg.org/libjpeg `jpeg` executable for full reference
comparison.
## Phase 2: Architecture And Interfaces
Status: in progress.
Deliverables:
- Module interface specification.
- Mermaid algorithm/pipeline flow: `docs/jls_pipeline_mermaid.md`.
- Top-level port list frozen against the SRS.
- Internal valid/stall contract for the high-throughput pipeline.
- Header/output-buffer contract for strip-frame sequencing.
- Context table and line-buffer access contract.
Planned module order:
- `jpeg_ls_encoder_top`
- `jls_input_ctrl`
- `jls_preset_defaults`
- `jls_scan_ctrl`
- `jls_header_writer`
- `jls_near_ctrl`
- `jls_predictor`
- `jls_context_model`
- `jls_neighbor_provider`
- `jls_golomb_encoder`
- `jls_bit_packer`
- `jls_byte_arbiter`
- `jls_output_buffer`
- `jls_run_mode`
Stop-for-confirmation triggers:
- Changing any external port.
- Changing `OUT_BUF_BYTES` or `OUT_BUF_AFULL_MARGIN`.
- Changing strip-frame output semantics.
- Dropping a supported `PIX_WIDTH`.
- Reducing the `ratio=1/2/3` 200 MPixel/s performance requirement.
## Phase 3: Verification Scaffold
Status: in progress.
Deliverables:
- Raw image generator for 8/10/12/14/16-bit grayscale tests.
- Reference decode and compare runner.
- Strip-frame splitter/combiner for validation reports.
- Report schema for per-strip and whole-image metrics.
- QuestaSim smoke test scripts and smoke testbenches.
Planned smoke set:
- Tiny 8-bit lossless image.
- Tiny 16-bit lossless image.
- 16x16 near-lossless image.
- Two-strip image to validate `ofifo_wdata[8]` and strip ordering.
Current progress:
- `tools/jls_compat/make_strip_stream_smoke.py` generates concatenated
standalone strip-frame streams for tool smoke tests.
- `tools/jls_compat/reference_decode_compare.py` splits concatenated streams,
decodes each strip with CharLS, optionally decodes with jpeg.org/libjpeg, and
compares the vertically recombined image to a reference PGM.
## Phase 4: RTL Implementation
Status: in progress.
Implementation order:
- Package/constants and shared type definitions.
- `jls_input_ctrl`.
- `jls_header_writer`.
- `jls_output_buffer`.
- Minimal top-level path for header-only and strip-frame sequencing tests.
- Predictor/context/run-mode/Golomb/bit-packer pipeline.
- Dynamic `NEAR` update.
Coding rules:
- Use SystemVerilog.
- Use `always_ff` with nonblocking assignments only.
- Use `always_comb` with blocking assignments.
- Do not define variables inside procedural blocks.
- Do not use `task`.
- Avoid complex functions.
- Split complex decisions across pipeline cycles.
- Use meaningful English comments and standard traceability comments.
- Keep RTL design files free of `ifdef/ifndef SYNTHESIS`,
`translate_off/on`, and design-embedded pass/fail checks; verification-only
checks belong in testbenches, monitors, scoreboards, or scripts.
Current progress:
- `fpga/verilog/jls_common_pkg.sv` defines shared constants and simple enums.
- `fpga/verilog/jpeg_ls_encoder_top.sv` now instantiates the input, scan,
header, regular/run-mode entropy, bit-packer, byte-arbiter, output-buffer,
and dynamic-NEAR modules as a functional top-level smoke integration.
- `fpga/verilog/jls_preset_defaults.sv` computes JPEG-LS default LSE preset
coding parameters for the supported grayscale bit depths and NEAR range.
- `fpga/verilog/jls_coding_params.sv` looks up strip-level `RANGE`, `qbpp`, and
`LIMIT` for supported `PIX_WIDTH` and `NEAR=0..31`, avoiding runtime division.
- `fpga/verilog/jls_input_ctrl.sv` implements FIFO read alignment, SOF gating,
runtime dimension fallback, coordinate generation, and strip/image boundary
flags.
- `fpga/verilog/jls_scan_ctrl.sv` converts input pixel boundary flags into
strip start/finish commands and forwards pixels to the encode pipeline.
- `fpga/verilog/jls_neighbor_provider.sv` provides reconstructed
`Ra/Rb/Rc/Rd` samples using two line banks. For `NEAR=0`, it commits the
original sample immediately because lossless `Rx == X`, removing the
reconstruction feedback bubble. For `NEAR>0`, it keeps one pixel
outstanding until the true reconstructed sample returns, but accepts the next
same-row pixel on the same clock as a non-EOL writeback by bypassing that
returned `Rx` as the next pixel's `Ra`.
- `fpga/verilog/jls_mode_router.sv` performs the first regular/run decision,
forwards regular pixels, accumulates run pixels, reconstructs run pixels as
`Ra`, and emits complete run segments for `jls_run_mode`.
- `fpga/verilog/jls_predictor.sv` implements a registered MED prediction stage
from reconstructed neighbor inputs `Ra/Rb/Rc/Rd`; the separate
`jls_neighbor_provider` supplies those neighbors from reconstructed history.
- `fpga/verilog/jls_context_quantizer.sv` computes `D1/D2/D3`, quantizes them
to `Q1/Q2/Q3`, and emits the absolute context index, sign, and run-mode flag.
- `fpga/verilog/jls_prediction_corrector.sv` applies context variable `C[Q]`
with context sign to `Px` and clamps the corrected prediction to `0..MAXVAL`.
- `fpga/verilog/jls_context_memory.sv` stores the 365 regular-mode contexts,
uses lazy strip initialization by clearing a written-bit vector and returning
default `A/B/C/N` for untouched contexts, and provides registered read/write
ports without a 365-cycle boundary sweep.
- `fpga/verilog/jls_context_model.sv` wraps `jls_context_memory` and forwards
quantized context events with the read `A/B/C/N` variables. It now tracks
in-flight regular contexts, stalls same-context reads until writeback, and
bypasses a same-cycle write/read pair so the next event cannot read stale
Annex A.6 state.
- `fpga/verilog/jls_context_update.sv` computes one regular context's
pre-update `k`, next `A/B/C/N`, mapped-error inversion flag, context-index
writeback metadata, and strip-last metadata.
- `fpga/verilog/jls_regular_error_quantizer.sv` computes regular-mode
`Errval`, reconstructed sample `Rx`, modulo error normalization, and forwards
pre-update context variables. `NEAR>0` uses an exact reciprocal-LUT multiply
and quotient-correction pipeline instead of a one-bit-per-cycle divider.
- `fpga/verilog/jpeg_ls_encoder_top.sv` returns regular-mode `Rx` to
`jls_neighbor_provider` as soon as the regular error-quantizer result is
accepted. Context update, mapped-error generation, Golomb coding, and bit
packing remain ordered by their own handshakes, but line history no longer
waits for `mapped_done` on the regular path.
- `fpga/verilog/jls_header_writer.sv` emits standalone strip-frame
`SOI/SOF55/LSE/SOS` headers and trailing `EOI` markers.
- `fpga/verilog/jls_near_ctrl.sv` applies the first-version cumulative
actual-vs-target dynamic NEAR step and reports the MAX_NEAR miss condition.
- `fpga/verilog/jls_error_mapper.sv` maps signed `Errval` into non-negative
`MErrval` with the standard context-correction inversion and forwards
`k/LIMIT/qbpp` to the Golomb encoder.
- `fpga/verilog/jls_run_mode.sv` implements a run-segment entropy helper:
direct run-length code events, standard `RUNindex/J` updates, RItype 0/1
run-interruption context variables, `MErrval/k/limit` generation, and
reconstructed interruption sample output. The run-interruption `NEAR>0`
quantizer uses the same reciprocal-LUT multiply and quotient-correction
pipeline as the regular path.
- `fpga/verilog/jls_golomb_encoder.sv` generates left-aligned Golomb code
events from standard `MErrval`, `k`, `LIMIT`, and `qbpp` inputs, including
the LIMIT fallback path.
- `fpga/verilog/jls_bit_packer.sv` packs left-aligned variable-length code
events into JPEG-LS scan payload bytes, including 0-bit stuffing after
`0xFF` data bytes and zero-padded flush before EOI.
- `fpga/verilog/jls_byte_arbiter.sv` arbitrates header/EOI bytes ahead of
payload bytes before the internal output buffer.
- `fpga/verilog/jls_output_buffer.sv` buffers encoded byte events and drains
them to the fixed 9-bit output FIFO interface while ignoring external full
flags in RTL behavior.
- `fpga/verilog/jpeg_ls_rtl.f` lists the current RTL compilation order.
- RTL design files have had all previous `ifndef SYNTHESIS` diagnostic blocks
removed so normal simulation and `+define+SYNTHESIS` compilation use the same
design logic.
- `fpga/sim/tb_jls_preset_defaults.sv`, `fpga/sim/tb_jls_coding_params.sv`,
`fpga/sim/tb_jls_input_ctrl.sv`,
`fpga/sim/tb_jls_scan_ctrl.sv`, `fpga/sim/tb_jls_neighbor_provider.sv`,
`fpga/sim/tb_jls_neighbor_provider_near_bypass.sv`,
`fpga/sim/tb_jls_mode_router.sv`, `fpga/sim/tb_jls_header_writer.sv`,
`fpga/sim/tb_jls_predictor.sv`, `fpga/sim/tb_jls_context_quantizer.sv`,
`fpga/sim/tb_jls_prediction_corrector.sv`, `fpga/sim/tb_jls_near_ctrl.sv`,
`fpga/sim/tb_jls_context_memory.sv`, `fpga/sim/tb_jls_context_update.sv`,
`fpga/sim/tb_jls_error_mapper.sv`, `fpga/sim/tb_jls_run_mode.sv`,
`fpga/sim/tb_jls_golomb_encoder.sv`, `fpga/sim/tb_jls_bit_packer.sv`,
`fpga/sim/tb_jls_byte_arbiter.sv`, `fpga/sim/tb_jls_output_buffer.sv`, and
`fpga/sim/tb_jpeg_ls_encoder_top_idle.sv`,
`fpga/sim/tb_jpeg_ls_encoder_top_run_smoke.sv` cover the current standalone,
idle-integration, tiny run-mode top-level, and small back-to-back multi-image
smoke checks.
- `fpga/sim/run_jls_smoke.ps1` runs the current RTL smoke compile/sim sequence.
- `fpga/sim/run_jls_top_compat_smoke.ps1` compiles the current top-level RTL,
runs 16x16 all-zero and row-major ramp image smokes for
`PIX_WIDTH=8/10/12/14/16`, also runs lossless and `ratio=2` dynamic-NEAR
8-bit 16x32 two-strip ramp smokes plus a two-image zero stream smoke, and
checks all results with CharLS against generated reference PGMs.
- `fpga/sim/run_jls_throughput_regression.ps1` is the staged executable entry
for the SRS continuous 10-image throughput check. It drives
`6144 x 256`, `IMAGE_COUNT=10`, `ratio=1/2/3`, `+CHECK_THROUGHPUT=1`, and
appends CSV stats to `tools/jls_compat/out/rtl_throughput_stats.csv`.
- The RTL smoke scripts now treat simulator `$fatal`/non-zero error summaries
as failures even when the simulator process exits with code 0.
Known implementation gaps:
- Top-level run-mode now has a conservative functional path through
`jls_mode_router` and `jls_run_mode`. It is covered by all-zero and ramp
tiny-image smokes across all supported bit depths plus a two-strip 8-bit ramp
smoke, a small dynamic-NEAR near-lossless smoke, and CharLS reference decode
for those cases, but larger images and high-throughput ordering optimization
remain open.
- The run scanner can now overlap non-EOL matching run pixels with an
outstanding run segment, while still blocking regular/interruption/EOL entropy
emission until the previous segment completes. The 8-bit all-zero top smoke
improved from 2670 ns to 2266 ns after this change.
- The run scanner also remains in the Annex A.7 run loop while
`run_length_accum` is non-zero. A checkerboard top-level smoke exposed that
the previous implementation could reclassify the next nonmatching pixel by
gradients and stall forever behind a pending run segment; this is now covered
in `tb_jls_mode_router`.
- The top-level Golomb busy tracker now handles the same-cycle case where a
previous mapped event completes while a new run mapped event is accepted, so
the new run segment cannot lose its busy ownership bit.
- The lossless `NEAR=0` neighbor feedback bubble has been reduced by immediate
`Rx == X` commit, and the regular/run `NEAR>0` arithmetic divider bottleneck
has been reduced to a short pipelined reciprocal-LUT multiply/correction
path. Regular-mode `NEAR>0` reconstructed-sample feedback now returns after
error-quantizer acceptance instead of waiting for downstream Golomb coding.
Same-row `NEAR>0` line-history feedback can also accept the next pixel on
the same clock as a non-EOL reconstructed writeback by bypassing `Rx` to
`Ra`. The 8-bit 16x32 two-strip `ratio=2` dynamic-NEAR smoke improved from
22718 ns before these feedback changes to 15582 ns.
The fixed 365-cycle context table clear at strip start has also been removed
by lazy written-bit initialization. Remaining 200 MPixel/s risks are the
`NEAR>0` one-pixel line-history feedback dependency, run-segment entropy
ordering stalls, and missing full near-lossless/larger-image throughput
regression.
- Full 10-image default-size throughput and full CharLS/libjpeg decode
regressions remain to be run; the throughput script is now present but has
not been executed in this iteration because it is intentionally long.
- A small 8-bit 16x16x10 mixed-pattern staged throughput bring-up passed for
`ratio=1/2/3` with the hard throughput assertion disabled. The report showed
2560 input pixels, 10297 input cycles, and `throughput_mpix_x1000=62154` for
each ratio; because the test is single-strip per image and intentionally
tiny, it is only a script/control-path check, not a performance conclusion.
## Phase 5: Integration And Regression
Status: in progress.
Deliverables:
- End-to-end RTL simulation producing `.rtljls` output.
- CharLS decode and pixel compare.
- libjpeg decode and pixel compare when executable is available.
- Per-strip `NEAR`, bit count, max-error, compression-ratio report.
- Throughput and stall report.
Pass/fail focus:
- Standard decodability.
- Lossless exactness for `ratio=0`.
- Near-lossless error bound for `ratio=1/2/3`.
- 200 MPixel/s average input throughput for the required multi-frame test.
## Phase 6: Synthesis And Timing
Status: in progress.
Deliverables:
- Vivado project/scripts under `fpga/synthesis`.
- Synthesis report.
- Fmax report for 250 MHz target.
- Resource report and timing-risk notes.
Current target:
- FPGA part: `xc7vx690tffg1761-2`.
- Quick synthesis scripts are kept under `fpga/synthesis`, but synthesis is not
run as an automatic step until all RTL modules are fully implemented, unless
the user explicitly requests it.
- `fpga/synthesis/quick_synth.tcl` reads the complete RTL compilation list from
`fpga/verilog/jpeg_ls_rtl.f`, so future top-level quick-synthesis reports are
not based on an early partial RTL subset.
- DSP timing work split the NEAR-dependent multiplier paths with registered
operands and product stages. `jls_run_mode` registers the run-interruption
reconstruction multiplier operands, and `jls_context_update` now stages the
Annex A.6 `B[Q] += Errval*(2*NEAR+1)` multiplier as input operands, product,
and accumulation stages.
- `jls_scan_ctrl` now has a one-entry registered slot between the input
controller and downstream strip/encode pipeline. It breaks the direct
`pixel_valid` to strip-start/context CE control path while still allowing the
slot to drain and refill in the same cycle for steady one-pixel-per-cycle
operation.
- Timing-violation triage now exports every negative-slack path with
`fpga/synthesis/report_timing_violations.tcl`. The flow first classifies
paths by DSP48 usage and logic level, then optimizes non-DSP paths with logic
depth greater than one before rechecking whether DSP paths still dominate.
- Kept non-DSP timing fixes: `jls_context_model` decouples `result_next_*`
write enables from downstream ready; `jls_scan_ctrl` registers
`enc_row_last_pixel` so `jls_neighbor_provider` does not recompute row-last
from `strip_width` on the `Rd` RAM-read path; `jls_regular_error_quantizer`
accepts the next input in `STATE_IDLE` and waits for output space only at
`STATE_FINISH`.
- Latest quick synthesis result: target part `xc7vx690tffg1761-2`,
4.000 ns constraint, WNS -0.615 ns, TNS -175.754 ns,
537 failing endpoints. Resource use is 22895 LUT, 6308 registers,
3.5 BRAM tiles, and 14 DSPs. The rough OOC synthesis frequency estimate is
about 216.7 MHz and must not be treated as final implementation Fmax.
- The current worst path is still DSP-related: `context_update_i/s1_near_scale_reg[6]`
to `context_update_i/s2_B_delta_reg/PCIN[*]`, with 3.557 ns data path delay.
The latest violation export has 260 DSP paths and 277 non-DSP logic-level>1
paths; the worst non-DSP path is now the `context_model_i/context_busy_reg[*]`
to `predictor_i/*/CE` ready/hazard chain at -0.274 ns.
Manual high/low half-word partial-product splitting of this 33x8 multiply
worsened WNS to -1.468 ns and was reverted.
- Naive extra buffering on the top-level context/prediction boundary and broad
`max_fanout` attributes worsened WNS in quick synthesis and were reverted.
Future control-path optimizations should be validated by quick synthesis
before being kept.
Stop-for-confirmation triggers:
- Requirement to increase output buffer default sizes.
- Requirement to reduce throughput target.
- Any architectural change that affects the external integration contract.