只保留LLM提取模式,修改提取逻辑
This commit is contained in:
45
README.md
45
README.md
@@ -1,6 +1,6 @@
|
|||||||
# SRS需求文档解析工具
|
# SRS需求文档解析工具
|
||||||
|
|
||||||
一个智能的SRS(软件需求规格说明书)文档解析工具,支持PDF和Docx格式,能够自动提取需求并生成结构化JSON输出。
|
一个基于大模型的SRS(软件需求规格说明书)文档解析工具,支持PDF和Docx格式,能够自动提取需求并生成结构化JSON输出。
|
||||||
|
|
||||||
## 特性
|
## 特性
|
||||||
|
|
||||||
@@ -12,6 +12,8 @@
|
|||||||
- **表格需求识别**:支持从表格中提取功能/接口/其他需求
|
- **表格需求识别**:支持从表格中提取功能/接口/其他需求
|
||||||
- **PDF表格提取**:支持从PDF中提取表格并自动挂接到章节
|
- **PDF表格提取**:支持从PDF中提取表格并自动挂接到章节
|
||||||
- **长句原子拆分**:自动将包含多个需求点的长句拆分为多个可验证需求项
|
- **长句原子拆分**:自动将包含多个需求点的长句拆分为多个可验证需求项
|
||||||
|
- **章节筛选提取**:支持按章节号提取(如输入`3`提取第3章及其全部子章节)
|
||||||
|
- **LLM-only**:当前版本仅支持LLM提取链路,不再提供规则提取模式
|
||||||
|
|
||||||
## 快速开始
|
## 快速开始
|
||||||
|
|
||||||
@@ -27,7 +29,7 @@ pip install dashscope
|
|||||||
pip install pdfplumber
|
pip install pdfplumber
|
||||||
```
|
```
|
||||||
|
|
||||||
### 配置API密钥(LLM模式)
|
### 配置API密钥(必需)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 方式1:环境变量(推荐)
|
# 方式1:环境变量(推荐)
|
||||||
@@ -45,11 +47,11 @@ llm:
|
|||||||
### 运行
|
### 运行
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# LLM增强模式
|
# LLM增强模式(唯一模式)
|
||||||
python main.py -i ".\input\DC-SRS.pdf" -o ".\output\output.json"
|
python main.py -i ".\input\DC-SRS.pdf" -o ".\output\output.json"
|
||||||
|
|
||||||
# 纯规则模式(不使用LLM)
|
# 按章节提取(输入3表示提取第3章及3.x子章节)
|
||||||
python main.py -i DC-SRS.pdf -o output.json --no-llm
|
python main.py -i ".\input\DC-SRS.pdf" -o ".\output\output_ch3.json" --chapters 3
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- ```bash
|
<!-- ```bash
|
||||||
@@ -73,16 +75,33 @@ python -c "from src.document_parser import DocxParser; parser = DocxParser('test
|
|||||||
|
|
||||||
| 字段 | 说明 |
|
| 字段 | 说明 |
|
||||||
|------|------|
|
|------|------|
|
||||||
| **接口名称** | 接口的名称
|
| **接口名称** | 接口的名称 |
|
||||||
| **接口类型** | 接口的类型
|
| **接口类型** | 接口的类型 |
|
||||||
| **来源** | 数据或信号的来源/发送方 |
|
| **数据来源** | 数据或信号的来源/发送方 |
|
||||||
| **目的地** | 数据或信号的目的地/接收方 |
|
| **数据目的地** | 数据或信号的目的地/接收方 |
|
||||||
|
|
||||||
### 需求描述规则
|
### 需求描述策略(LLM驱动)
|
||||||
|
|
||||||
- **功能需求**:保持原文描述,不改写润色
|
- **功能需求**:以原文为主,必要时轻微补全语义
|
||||||
- **接口需求**:允许改写润色,确保描述清晰完整
|
- **接口需求**:允许适度改写润色,并补齐接口字段
|
||||||
- **其他需求**:保持原文描述,不改写润色
|
- **其他需求**:以原文为主,避免无意义改写
|
||||||
|
|
||||||
|
### 表格处理策略
|
||||||
|
|
||||||
|
- **系统功能要求表、性能要求表**:默认忽略,不提取需求
|
||||||
|
- **接口要求表**:可提取接口需求,且接口字段优先从表格列提取
|
||||||
|
- **硬件/软件/运行环境表**:按“一表一条”生成需求,避免拆成多条
|
||||||
|
|
||||||
|
### 润色约束
|
||||||
|
|
||||||
|
- 除接口需求外,需求描述尽量保持原文
|
||||||
|
- 非接口需求的润色改动上限为20个字(超限则回退原描述)
|
||||||
|
|
||||||
|
## 运行约束
|
||||||
|
|
||||||
|
- 必须配置可用的 `DASHSCOPE_API_KEY`(或在 `config.yaml` 中配置 `llm.api_key`)
|
||||||
|
- 当LLM初始化失败或调用失败时,程序会直接报错退出,不会降级为规则提取
|
||||||
|
- `--chapters` 为空时提取全量;设置为 `3` 时仅提取第3章及其子章节
|
||||||
|
|
||||||
## 目录结构
|
## 目录结构
|
||||||
|
|
||||||
|
|||||||
92
config.yaml
92
config.yaml
@@ -3,12 +3,12 @@
|
|||||||
|
|
||||||
# LLM配置 - 阿里云千问
|
# LLM配置 - 阿里云千问
|
||||||
llm:
|
llm:
|
||||||
# 是否启用LLM(设为false则使用纯规则提取)
|
# 是否启用LLM(当前版本必须为true)
|
||||||
enabled: true
|
enabled: true
|
||||||
# LLM提供商:qwen(阿里云千问)
|
# LLM提供商:qwen(阿里云千问)
|
||||||
provider: "qwen"
|
provider: "qwen"
|
||||||
# 模型名称
|
# 模型名称
|
||||||
model: "qwen3-max-2026-01-23"
|
model: "glm-5"
|
||||||
# API密钥(建议使用环境变量 DASHSCOPE_API_KEY)
|
# API密钥(建议使用环境变量 DASHSCOPE_API_KEY)
|
||||||
api_key: "sk-7097f7842f724f0c9e70c4bf3b16dacb"
|
api_key: "sk-7097f7842f724f0c9e70c4bf3b16dacb"
|
||||||
# 可选参数
|
# 可选参数
|
||||||
@@ -48,7 +48,7 @@ extraction:
|
|||||||
priority: 1
|
priority: 1
|
||||||
接口需求:
|
接口需求:
|
||||||
prefix: "IR"
|
prefix: "IR"
|
||||||
keywords: ["接口", "interface", "api", "外部接口", "内部接口", "CAN", "以太网", "通信"]
|
keywords: ["接口", "interface", "api", "外部接口", "内部接口", "输入输出"]
|
||||||
priority: 2
|
priority: 2
|
||||||
性能需求:
|
性能需求:
|
||||||
prefix: "PR"
|
prefix: "PR"
|
||||||
@@ -68,23 +68,105 @@ extraction:
|
|||||||
priority: 6
|
priority: 6
|
||||||
splitter:
|
splitter:
|
||||||
enabled: true
|
enabled: true
|
||||||
max_sentence_len: 120
|
max_sentence_len: 160
|
||||||
min_clause_len: 12
|
min_clause_len: 20
|
||||||
|
semantic_type_policy:
|
||||||
|
interface_section_hints:
|
||||||
|
- "接口描述"
|
||||||
|
- "接口需求"
|
||||||
|
- "接口要求"
|
||||||
|
- "外部接口"
|
||||||
|
- "内部接口"
|
||||||
|
- "I/O"
|
||||||
|
interface_title_excludes:
|
||||||
|
- "计算机通信需求"
|
||||||
|
- "通信需求"
|
||||||
|
- "通信要求"
|
||||||
|
functional_section_hints:
|
||||||
|
- "功能需求"
|
||||||
|
- "功能要求"
|
||||||
|
other_section_hints:
|
||||||
|
- "安全性需求"
|
||||||
|
- "保密性需求"
|
||||||
|
- "适应性需求"
|
||||||
|
- "环境需求"
|
||||||
|
- "资源需求"
|
||||||
|
- "质量"
|
||||||
|
- "设计约束"
|
||||||
|
- "培训需求"
|
||||||
|
- "软件保障"
|
||||||
|
- "验收"
|
||||||
|
- "交付"
|
||||||
|
- "包装"
|
||||||
|
- "通信需求"
|
||||||
|
- "计算机通信需求"
|
||||||
|
- "硬件环境"
|
||||||
|
- "软件环境"
|
||||||
|
- "运行环境"
|
||||||
semantic_guard:
|
semantic_guard:
|
||||||
enabled: true
|
enabled: true
|
||||||
preserve_condition_action_chain: true
|
preserve_condition_action_chain: true
|
||||||
preserve_alarm_chain: true
|
preserve_alarm_chain: true
|
||||||
|
system_description_hints:
|
||||||
|
- "系统描述"
|
||||||
|
- "功能描述"
|
||||||
|
- "概述"
|
||||||
|
- "示意图"
|
||||||
|
- "组成"
|
||||||
|
- "架构"
|
||||||
|
- "原理"
|
||||||
table_strategy:
|
table_strategy:
|
||||||
llm_semantic_enabled: true
|
llm_semantic_enabled: true
|
||||||
sequence_table_merge: "single_requirement"
|
sequence_table_merge: "single_requirement"
|
||||||
merge_time_series_rows_min: 3
|
merge_time_series_rows_min: 3
|
||||||
|
skip_keywords:
|
||||||
|
- "系统功能要求"
|
||||||
|
- "性能要求"
|
||||||
|
- "系统性能要求"
|
||||||
|
- "系统接口要求"
|
||||||
|
- "功能矩阵"
|
||||||
|
- "能力对照"
|
||||||
|
- "性能指标对照"
|
||||||
|
interface_keywords:
|
||||||
|
- "接口"
|
||||||
|
- "interface"
|
||||||
|
- "输入输出"
|
||||||
|
- "I/O"
|
||||||
|
- "数据来源"
|
||||||
|
- "数据目的地"
|
||||||
|
- "来源"
|
||||||
|
- "目的地"
|
||||||
|
single_requirement_keywords:
|
||||||
|
- "硬件要求"
|
||||||
|
- "软件要求"
|
||||||
|
- "运行环境"
|
||||||
|
- "硬件环境"
|
||||||
|
- "软件环境"
|
||||||
|
- "运行硬件环境"
|
||||||
|
- "运行软件环境"
|
||||||
|
- "环境需求"
|
||||||
|
- "资源需求"
|
||||||
|
- "计算机资源"
|
||||||
rewrite_policy:
|
rewrite_policy:
|
||||||
llm_light_rewrite_enabled: true
|
llm_light_rewrite_enabled: true
|
||||||
preserve_ratio_min: 0.65
|
preserve_ratio_min: 0.65
|
||||||
max_length_growth_ratio: 1.25
|
max_length_growth_ratio: 1.25
|
||||||
|
non_interface_max_edit_distance: 20
|
||||||
renumber_policy:
|
renumber_policy:
|
||||||
enabled: true
|
enabled: true
|
||||||
mode: "section_continuous"
|
mode: "section_continuous"
|
||||||
|
dedup_policy:
|
||||||
|
similarity_threshold: 0.88
|
||||||
|
enable_cross_section_dedup: true
|
||||||
|
prefer_text_over_table: true
|
||||||
|
interface_policy:
|
||||||
|
unknown_fallback: "未知"
|
||||||
|
normalization_policy:
|
||||||
|
ocr_spacing_normalize: true
|
||||||
|
fidelity_policy:
|
||||||
|
preserve_source_text_for_text_blocks: true
|
||||||
|
punctuation_policy:
|
||||||
|
ensure_terminal_period: true
|
||||||
|
|
||||||
# 输出配置
|
# 输出配置
|
||||||
output:
|
output:
|
||||||
|
|||||||
@@ -45,8 +45,8 @@ def parse_requirements_from_json(json_data, parent_section=""):
|
|||||||
"需求描述": req.get("需求描述", ""),
|
"需求描述": req.get("需求描述", ""),
|
||||||
"接口名称": req.get("接口名称", ""),
|
"接口名称": req.get("接口名称", ""),
|
||||||
"接口类型": req.get("接口类型", ""),
|
"接口类型": req.get("接口类型", ""),
|
||||||
"来源": req.get("来源", ""),
|
"数据来源": req.get("数据来源", ""),
|
||||||
"目的地": req.get("目的地", "")
|
"数据目的地": req.get("数据目的地", "")
|
||||||
}
|
}
|
||||||
requirements.append(req_data)
|
requirements.append(req_data)
|
||||||
|
|
||||||
@@ -108,7 +108,7 @@ def create_excel(json_file, output_file):
|
|||||||
# 定义表头(按用户要求的顺序)
|
# 定义表头(按用户要求的顺序)
|
||||||
headers = [
|
headers = [
|
||||||
"章节编号", "章节标题", "需求类型", "需求编号", "需求描述",
|
"章节编号", "章节标题", "需求类型", "需求编号", "需求描述",
|
||||||
"接口名称", "接口类型", "来源", "目的地"
|
"接口名称", "接口类型", "数据来源", "数据目的地"
|
||||||
]
|
]
|
||||||
|
|
||||||
# 写入表头
|
# 写入表头
|
||||||
@@ -154,8 +154,8 @@ def create_excel(json_file, output_file):
|
|||||||
'E': 80, # 需求描述
|
'E': 80, # 需求描述
|
||||||
'F': 25, # 接口名称
|
'F': 25, # 接口名称
|
||||||
'G': 25, # 接口类型
|
'G': 25, # 接口类型
|
||||||
'H': 25, # 来源
|
'H': 25, # 数据来源
|
||||||
'I': 25 # 目的地
|
'I': 25 # 数据目的地
|
||||||
}
|
}
|
||||||
|
|
||||||
for col, width in column_widths.items():
|
for col, width in column_widths.items():
|
||||||
|
|||||||
123
main.py
123
main.py
@@ -2,7 +2,6 @@
|
|||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
"""
|
"""
|
||||||
SRS 解析工具 - 主程序入口
|
SRS 解析工具 - 主程序入口
|
||||||
LLM 增强版 - 默认阿里云千问大模型
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
@@ -16,6 +15,7 @@ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
|||||||
|
|
||||||
from src.utils import load_config, setup_logging, validate_file_path, ensure_directory_exists, get_env_or_config
|
from src.utils import load_config, setup_logging, validate_file_path, ensure_directory_exists, get_env_or_config
|
||||||
from src.document_parser import create_parser
|
from src.document_parser import create_parser
|
||||||
|
from src.document_parser import Section
|
||||||
from src.requirement_extractor import RequirementExtractor
|
from src.requirement_extractor import RequirementExtractor
|
||||||
from src.json_generator import JSONGenerator
|
from src.json_generator import JSONGenerator
|
||||||
|
|
||||||
@@ -34,10 +34,9 @@ def create_llm(config: dict):
|
|||||||
"""
|
"""
|
||||||
llm_config = config.get('llm', {})
|
llm_config = config.get('llm', {})
|
||||||
|
|
||||||
# 检查是否启用LLM
|
# 当前版本仅支持LLM模式
|
||||||
if not llm_config.get('enabled', True):
|
if not llm_config.get('enabled', True):
|
||||||
logger.info("LLM已禁用,使用纯规则提取模式")
|
raise ValueError("当前版本仅支持LLM模式,请将配置 llm.enabled 设为 true")
|
||||||
return None
|
|
||||||
|
|
||||||
provider = llm_config.get('provider', 'qwen')
|
provider = llm_config.get('provider', 'qwen')
|
||||||
|
|
||||||
@@ -45,9 +44,7 @@ def create_llm(config: dict):
|
|||||||
api_key = get_env_or_config('DASHSCOPE_API_KEY', llm_config.get('api_key'))
|
api_key = get_env_or_config('DASHSCOPE_API_KEY', llm_config.get('api_key'))
|
||||||
|
|
||||||
if not api_key:
|
if not api_key:
|
||||||
logger.warning("未配置API密钥,请使用纯规则提取模式")
|
raise ValueError("未配置API密钥:请设置环境变量 DASHSCOPE_API_KEY 或在 config.yaml 中配置 llm.api_key")
|
||||||
logger.warning("请设置环境变量 DASHSCOPE_API_KEY 或在 config.yaml 中配置 llm.api_key")
|
|
||||||
return None
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from src.llm_interface import QwenLLM
|
from src.llm_interface import QwenLLM
|
||||||
@@ -67,12 +64,80 @@ def create_llm(config: dict):
|
|||||||
return llm
|
return llm
|
||||||
|
|
||||||
except ImportError as e:
|
except ImportError as e:
|
||||||
logger.warning(f"无法导入LLM模块: {e}")
|
raise RuntimeError(f"无法导入LLM模块: {e}。请安装依赖:pip install dashscope") from e
|
||||||
logger.warning("请运行: pip install dashscope")
|
|
||||||
return None
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"创建LLM实例失败: {e}")
|
raise RuntimeError(f"创建LLM实例失败: {e}") from e
|
||||||
return None
|
|
||||||
|
|
||||||
|
def parse_chapter_selector(selector: str) -> list:
|
||||||
|
"""解析章节筛选参数。"""
|
||||||
|
if not selector:
|
||||||
|
return []
|
||||||
|
chapters = [x.strip() for x in selector.split(',') if x.strip()]
|
||||||
|
valid = []
|
||||||
|
for chapter in chapters:
|
||||||
|
if not chapter or not all(p.isdigit() for p in chapter.split('.')):
|
||||||
|
raise ValueError(f"无效章节编号: {chapter},仅支持如 3 或 3.1 的格式")
|
||||||
|
valid.append(chapter)
|
||||||
|
return valid
|
||||||
|
|
||||||
|
|
||||||
|
def _clone_section_with_children(section: Section) -> Section:
|
||||||
|
copied = Section(
|
||||||
|
level=section.level,
|
||||||
|
title=section.title,
|
||||||
|
number=section.number,
|
||||||
|
content=section.content,
|
||||||
|
uid=section.uid,
|
||||||
|
)
|
||||||
|
copied.tables = list(section.tables)
|
||||||
|
copied.blocks = list(section.blocks)
|
||||||
|
for child in section.children:
|
||||||
|
copied.add_child(_clone_section_with_children(child))
|
||||||
|
return copied
|
||||||
|
|
||||||
|
|
||||||
|
def filter_sections_by_chapters(sections: list, chapters: list) -> list:
|
||||||
|
"""按章节前缀过滤章节树(如3匹配3及3.x)。"""
|
||||||
|
if not chapters:
|
||||||
|
return sections
|
||||||
|
|
||||||
|
def matched(number: str) -> bool:
|
||||||
|
number = (number or "").strip()
|
||||||
|
if not number:
|
||||||
|
return False
|
||||||
|
for chapter in chapters:
|
||||||
|
if number == chapter or number.startswith(f"{chapter}."):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def recurse(section: Section) -> Section:
|
||||||
|
if matched(section.number):
|
||||||
|
return _clone_section_with_children(section)
|
||||||
|
|
||||||
|
copied = Section(
|
||||||
|
level=section.level,
|
||||||
|
title=section.title,
|
||||||
|
number=section.number,
|
||||||
|
content=section.content,
|
||||||
|
uid=section.uid,
|
||||||
|
)
|
||||||
|
copied.tables = list(section.tables)
|
||||||
|
copied.blocks = list(section.blocks)
|
||||||
|
|
||||||
|
for child in section.children:
|
||||||
|
filtered_child = recurse(child)
|
||||||
|
if filtered_child:
|
||||||
|
copied.add_child(filtered_child)
|
||||||
|
|
||||||
|
return copied if copied.children else None
|
||||||
|
|
||||||
|
filtered = []
|
||||||
|
for s in sections:
|
||||||
|
fs = recurse(s)
|
||||||
|
if fs:
|
||||||
|
filtered.append(fs)
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
@@ -86,7 +151,7 @@ def main():
|
|||||||
示例用法:
|
示例用法:
|
||||||
python main.py --input sample.pdf --output output.json
|
python main.py --input sample.pdf --output output.json
|
||||||
python main.py -i requirements.docx -o output.json --verbose
|
python main.py -i requirements.docx -o output.json --verbose
|
||||||
python main.py -i DC-SRS.pdf -o output.json --no-llm # 禁用LLM
|
python main.py -i DC-SRS.pdf -o output.json
|
||||||
"""
|
"""
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -116,11 +181,12 @@ def main():
|
|||||||
action='store_true',
|
action='store_true',
|
||||||
help='输出详细日志'
|
help='输出详细日志'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--no-llm',
|
'--chapters',
|
||||||
action='store_true',
|
type=str,
|
||||||
help='禁用LLM,使用纯规则提取'
|
default=None,
|
||||||
|
help='按章节提取(如: 3 或 3,4.1);输入3表示提取第3章及其子章节'
|
||||||
)
|
)
|
||||||
|
|
||||||
# 解析命令行参数
|
# 解析命令行参数
|
||||||
@@ -129,10 +195,6 @@ def main():
|
|||||||
# 加载配置
|
# 加载配置
|
||||||
config = load_config(args.config)
|
config = load_config(args.config)
|
||||||
|
|
||||||
# 命令行参数覆盖配置
|
|
||||||
if args.no_llm:
|
|
||||||
config.setdefault('llm', {})['enabled'] = False
|
|
||||||
|
|
||||||
# 设置日志
|
# 设置日志
|
||||||
if args.verbose:
|
if args.verbose:
|
||||||
config.setdefault('logging', {})['level'] = 'DEBUG'
|
config.setdefault('logging', {})['level'] = 'DEBUG'
|
||||||
@@ -158,12 +220,9 @@ def main():
|
|||||||
|
|
||||||
logger.info(f"输出文件: {args.output}")
|
logger.info(f"输出文件: {args.output}")
|
||||||
|
|
||||||
# 创建LLM实例
|
# 创建LLM实例(必需)
|
||||||
llm = create_llm(config)
|
llm = create_llm(config)
|
||||||
if llm:
|
logger.info("LLM增强模式已启用")
|
||||||
logger.info("LLM增强模式已启用")
|
|
||||||
else:
|
|
||||||
logger.info("使用纯规则提取模式")
|
|
||||||
|
|
||||||
# 步骤1:解析文档
|
# 步骤1:解析文档
|
||||||
logger.info("\n" + "=" * 60)
|
logger.info("\n" + "=" * 60)
|
||||||
@@ -176,6 +235,13 @@ def main():
|
|||||||
|
|
||||||
sections = doc_parser.parse()
|
sections = doc_parser.parse()
|
||||||
document_title = doc_parser.get_document_title()
|
document_title = doc_parser.get_document_title()
|
||||||
|
|
||||||
|
selected_chapters = parse_chapter_selector(args.chapters) if args.chapters else []
|
||||||
|
if selected_chapters:
|
||||||
|
sections = filter_sections_by_chapters(sections, selected_chapters)
|
||||||
|
if not sections:
|
||||||
|
raise ValueError(f"未匹配到指定章节: {', '.join(selected_chapters)}")
|
||||||
|
logger.info(f"章节筛选已启用: {', '.join(selected_chapters)}")
|
||||||
|
|
||||||
logger.info(f"成功解析文档,提取{len(sections)}个顶级章节")
|
logger.info(f"成功解析文档,提取{len(sections)}个顶级章节")
|
||||||
|
|
||||||
@@ -192,10 +258,7 @@ def main():
|
|||||||
|
|
||||||
# 步骤2:提取需求
|
# 步骤2:提取需求
|
||||||
logger.info("\n" + "=" * 60)
|
logger.info("\n" + "=" * 60)
|
||||||
if llm:
|
logger.info("步骤2:提取需求(LLM增强模式)")
|
||||||
logger.info("步骤2:提取需求(LLM增强模式)")
|
|
||||||
else:
|
|
||||||
logger.info("步骤2:提取需求(规则匹配模式)")
|
|
||||||
logger.info("=" * 60)
|
logger.info("=" * 60)
|
||||||
|
|
||||||
extractor = RequirementExtractor(config, llm=llm)
|
extractor = RequirementExtractor(config, llm=llm)
|
||||||
|
|||||||
@@ -4,7 +4,6 @@
|
|||||||
支持PDF和Docx格式,针对GJB438B标准SRS文档优化
|
支持PDF和Docx格式,针对GJB438B标准SRS文档优化
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import os
|
|
||||||
import re
|
import re
|
||||||
import logging
|
import logging
|
||||||
import importlib
|
import importlib
|
||||||
@@ -119,43 +118,19 @@ class DocumentParser(ABC):
|
|||||||
sections: 章节列表
|
sections: 章节列表
|
||||||
parent_number: 父章节编号
|
parent_number: 父章节编号
|
||||||
"""
|
"""
|
||||||
# 仅在顶级章节重编号
|
if not sections:
|
||||||
if not parent_number:
|
return
|
||||||
# 前置章节关键词(需要跳过的)
|
|
||||||
skip_keywords = ['目录', '封面', '扉页', '未命名', '年', '月']
|
# 仅为缺失编号的章节补号;已存在的文档原始编号必须保留。
|
||||||
# 正文章节关键词(遇到这些说明正文开始)
|
sibling_index = 0
|
||||||
content_keywords = ['外部接口', '接口', '软件需求', '需求', '功能', '性能', '设计', '概述', '标识', '引言']
|
for section in sections:
|
||||||
|
has_number = bool((section.number or "").strip()) and not self._is_chinese_number(section.number)
|
||||||
start_index = 0
|
if not has_number:
|
||||||
for idx, section in enumerate(sections):
|
sibling_index += 1
|
||||||
# 优先检查是否是正文章节
|
section.generate_auto_number(parent_number, sibling_index)
|
||||||
is_content = any(kw in section.title for kw in content_keywords)
|
|
||||||
if is_content and section.level == 1:
|
if section.children:
|
||||||
start_index = idx
|
self._auto_number_sections(section.children, section.number)
|
||||||
break
|
|
||||||
|
|
||||||
# 重新编号所有章节
|
|
||||||
counter = 1
|
|
||||||
for i, section in enumerate(sections):
|
|
||||||
if i < start_index:
|
|
||||||
# 前置章节不编号
|
|
||||||
section.number = ""
|
|
||||||
else:
|
|
||||||
# 正文章节:顶级章节从1开始编号
|
|
||||||
if section.level == 1:
|
|
||||||
section.number = str(counter)
|
|
||||||
counter += 1
|
|
||||||
|
|
||||||
# 递归处理子章节
|
|
||||||
if section.children:
|
|
||||||
self._auto_number_sections(section.children, section.number)
|
|
||||||
else:
|
|
||||||
# 子章节编号
|
|
||||||
for i, section in enumerate(sections, 1):
|
|
||||||
if not section.number or self._is_chinese_number(section.number):
|
|
||||||
section.generate_auto_number(parent_number, i)
|
|
||||||
if section.children:
|
|
||||||
self._auto_number_sections(section.children, section.number)
|
|
||||||
|
|
||||||
def _is_chinese_number(self, text: str) -> bool:
|
def _is_chinese_number(self, text: str) -> bool:
|
||||||
"""检查是否是中文数字编号"""
|
"""检查是否是中文数字编号"""
|
||||||
@@ -327,8 +302,13 @@ class PDFParser(DocumentParser):
|
|||||||
'优先', '关键', '合格', '追踪', '注释',
|
'优先', '关键', '合格', '追踪', '注释',
|
||||||
'CSCI', '计算机', '软件', '硬件', '通信', '通讯',
|
'CSCI', '计算机', '软件', '硬件', '通信', '通讯',
|
||||||
'数据', '适应', '可靠', '内部', '外部',
|
'数据', '适应', '可靠', '内部', '外部',
|
||||||
'描述', '要求', '规定', '说明', '定义',
|
'描述', '要求', '规定', '说明', '定义'
|
||||||
'电场', '防护', '装置', '控制', '监控', '显控'
|
]
|
||||||
|
|
||||||
|
TOP_LEVEL_TITLE_KEYWORDS = [
|
||||||
|
'范围', '标识', '概述', '引用', '文档', '需求', '接口', '性能',
|
||||||
|
'安全', '保密', '环境', '资源', '质量', '设计', '约束', '验收',
|
||||||
|
'交付', '包装', '注释'
|
||||||
]
|
]
|
||||||
|
|
||||||
# 明显无效的章节标题模式(噪声)
|
# 明显无效的章节标题模式(噪声)
|
||||||
@@ -411,21 +391,41 @@ class PDFParser(DocumentParser):
|
|||||||
if page_idx < len(self._page_texts):
|
if page_idx < len(self._page_texts):
|
||||||
page_text = self._page_texts[page_idx]
|
page_text = self._page_texts[page_idx]
|
||||||
|
|
||||||
extracted_tables = page.extract_tables() or []
|
table_objs = page.find_tables() or []
|
||||||
for table_idx, table in enumerate(extracted_tables):
|
if table_objs:
|
||||||
|
extracted_tables = [(idx, t.extract(), t.bbox) for idx, t in enumerate(table_objs)]
|
||||||
|
else:
|
||||||
|
raw_tables = page.extract_tables() or []
|
||||||
|
extracted_tables = [(idx, t, None) for idx, t in enumerate(raw_tables)]
|
||||||
|
|
||||||
|
for table_idx, table, bbox in extracted_tables:
|
||||||
cleaned_table: List[List[str]] = []
|
cleaned_table: List[List[str]] = []
|
||||||
for row in table or []:
|
for row in table or []:
|
||||||
cells = [re.sub(r'\s+', ' ', str(cell or '')).strip() for cell in row]
|
cells = [re.sub(r'\s+', ' ', str(cell or '')).strip() for cell in row]
|
||||||
|
# 只要存在非空单元格就保留,避免有效行被误丢弃。
|
||||||
if any(cells):
|
if any(cells):
|
||||||
cleaned_table.append(cells)
|
cleaned_table.append(cells)
|
||||||
|
|
||||||
if cleaned_table:
|
if cleaned_table:
|
||||||
|
section_hint = ""
|
||||||
|
if bbox:
|
||||||
|
try:
|
||||||
|
top = float(bbox[1])
|
||||||
|
text_above = page.crop((0, 0, page.width, top)).extract_text() or ""
|
||||||
|
section_hint = self._find_last_section_number(text_above)
|
||||||
|
except Exception:
|
||||||
|
section_hint = ""
|
||||||
|
|
||||||
|
table_ref = self._extract_table_reference(cleaned_table)
|
||||||
|
|
||||||
tables.append(
|
tables.append(
|
||||||
{
|
{
|
||||||
"page_idx": page_idx,
|
"page_idx": page_idx,
|
||||||
"table_idx": table_idx,
|
"table_idx": table_idx,
|
||||||
"page_text": page_text,
|
"page_text": page_text,
|
||||||
"data": cleaned_table,
|
"data": cleaned_table,
|
||||||
|
"section_hint": section_hint,
|
||||||
|
"table_ref": table_ref,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -435,16 +435,86 @@ class PDFParser(DocumentParser):
|
|||||||
logger.info(f"PDF表格提取完成,共{len(tables)}个表格")
|
logger.info(f"PDF表格提取完成,共{len(tables)}个表格")
|
||||||
return tables
|
return tables
|
||||||
|
|
||||||
|
def _extract_table_reference(self, table: List[List[str]]) -> str:
|
||||||
|
"""从表格前几行中提取表号引用,如“表3-5”。"""
|
||||||
|
if not table:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
head_rows = table[:2]
|
||||||
|
merged = " ".join(" ".join(str(c or "") for c in row) for row in head_rows)
|
||||||
|
merged = re.sub(r"\s+", "", merged)
|
||||||
|
m = re.search(r"表\s*(\d+(?:[--]\d+){1,3})", merged)
|
||||||
|
if not m:
|
||||||
|
return ""
|
||||||
|
return m.group(1).replace("-", "-")
|
||||||
|
|
||||||
|
def _build_table_reference_index(self, sections: List[Section]) -> Dict[str, List[Section]]:
|
||||||
|
"""构建“表号 -> 章节”索引,用于优先精确挂接表格。"""
|
||||||
|
index: Dict[str, List[Section]] = {}
|
||||||
|
for section in sections:
|
||||||
|
content = re.sub(r"\s+", "", section.content or "")
|
||||||
|
for m in re.finditer(r"表\s*(\d+(?:[--]\d+){1,3})", content):
|
||||||
|
ref = m.group(1).replace("-", "-")
|
||||||
|
index.setdefault(ref, []).append(section)
|
||||||
|
return index
|
||||||
|
|
||||||
|
def _find_last_section_number(self, text: str) -> str:
|
||||||
|
"""从文本中提取最后出现的章节号。"""
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
found = ""
|
||||||
|
for line in text.split("\n"):
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
section_info = self._match_section_header(line, set())
|
||||||
|
if section_info:
|
||||||
|
found = section_info[0]
|
||||||
|
return found
|
||||||
|
|
||||||
def _attach_pdf_tables_to_sections(self, tables: List[Dict[str, Any]]) -> None:
|
def _attach_pdf_tables_to_sections(self, tables: List[Dict[str, Any]]) -> None:
|
||||||
"""将提取出的PDF表格挂接到最匹配的章节。"""
|
"""将提取出的PDF表格挂接到最匹配的章节。"""
|
||||||
flat_sections = self._flatten_sections(self.sections)
|
flat_sections = self._flatten_sections(self.sections)
|
||||||
if not flat_sections:
|
if not flat_sections:
|
||||||
return
|
return
|
||||||
|
|
||||||
|
section_by_number = {
|
||||||
|
(s.number or "").strip(): s
|
||||||
|
for s in flat_sections
|
||||||
|
if (s.number or "").strip()
|
||||||
|
}
|
||||||
|
table_ref_index = self._build_table_reference_index(flat_sections)
|
||||||
|
|
||||||
last_section: Optional[Section] = None
|
last_section: Optional[Section] = None
|
||||||
for table in tables:
|
for table in tables:
|
||||||
matched = self._match_table_section(table.get("page_text", ""), flat_sections)
|
target = None
|
||||||
target = matched or last_section or flat_sections[0]
|
|
||||||
|
table_ref = (table.get("table_ref") or "").strip()
|
||||||
|
if table_ref and table_ref in table_ref_index:
|
||||||
|
candidates = table_ref_index[table_ref]
|
||||||
|
# 同表号命中多个章节时,优先更深层章节,避免父级“汇总章节”抢占。
|
||||||
|
target = max(candidates, key=lambda s: (s.level, len(s.content or "")))
|
||||||
|
|
||||||
|
section_hint = (table.get("section_hint") or "").strip()
|
||||||
|
if not target and section_hint and section_hint in section_by_number:
|
||||||
|
target = section_by_number[section_hint]
|
||||||
|
|
||||||
|
if not target:
|
||||||
|
target = self._match_table_section(table.get("page_text", ""), flat_sections)
|
||||||
|
|
||||||
|
# 兜底优先使用上一个命中章节,避免错误挂到首章节造成跨章污染。
|
||||||
|
if not target:
|
||||||
|
target = last_section
|
||||||
|
|
||||||
|
if not target:
|
||||||
|
logger.warning(
|
||||||
|
"未定位到表格归属章节,跳过: page=%s table=%s",
|
||||||
|
table.get("page_idx", -1),
|
||||||
|
table.get("table_idx", -1),
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
target.add_table(table["data"])
|
target.add_table(table["data"])
|
||||||
last_section = target
|
last_section = target
|
||||||
|
|
||||||
@@ -464,7 +534,7 @@ class PDFParser(DocumentParser):
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
matched: Optional[Section] = None
|
matched: Optional[Section] = None
|
||||||
matched_score = -1
|
matched_score = (-1, -1)
|
||||||
for section in sections:
|
for section in sections:
|
||||||
title = (section.title or "").strip()
|
title = (section.title or "").strip()
|
||||||
if not title:
|
if not title:
|
||||||
@@ -479,7 +549,7 @@ class PDFParser(DocumentParser):
|
|||||||
for candidate in candidates:
|
for candidate in candidates:
|
||||||
normalized_candidate = re.sub(r"\s+", "", candidate).lower()
|
normalized_candidate = re.sub(r"\s+", "", candidate).lower()
|
||||||
if normalized_candidate and normalized_candidate in normalized_page:
|
if normalized_candidate and normalized_candidate in normalized_page:
|
||||||
score = len(normalized_candidate)
|
score = (len(normalized_candidate), section.level)
|
||||||
if score > matched_score:
|
if score > matched_score:
|
||||||
matched = section
|
matched = section
|
||||||
matched_score = score
|
matched_score = score
|
||||||
@@ -514,6 +584,7 @@ class PDFParser(DocumentParser):
|
|||||||
current_section = None
|
current_section = None
|
||||||
content_buffer = []
|
content_buffer = []
|
||||||
found_sections = set()
|
found_sections = set()
|
||||||
|
last_top_level_number = 0
|
||||||
|
|
||||||
for line in lines:
|
for line in lines:
|
||||||
line = line.strip()
|
line = line.strip()
|
||||||
@@ -526,6 +597,22 @@ class PDFParser(DocumentParser):
|
|||||||
if section_info:
|
if section_info:
|
||||||
number, title = section_info
|
number, title = section_info
|
||||||
level = len(number.split('.'))
|
level = len(number.split('.'))
|
||||||
|
top_level_number = int(number.split('.')[0])
|
||||||
|
|
||||||
|
# 顶级章节序号大幅跳跃通常是误识别(如正文中的“8 表...”)。
|
||||||
|
if level == 1 and last_top_level_number and top_level_number > last_top_level_number + 1:
|
||||||
|
if line and not self._is_noise(line):
|
||||||
|
content_buffer.append(line)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 顶级章节编号倒退通常是正文枚举项被误识别(如“1 综合监控...”)。
|
||||||
|
if level == 1 and last_top_level_number and top_level_number < last_top_level_number:
|
||||||
|
if line and not self._is_noise(line):
|
||||||
|
content_buffer.append(line)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if level > 6:
|
||||||
|
continue
|
||||||
|
|
||||||
# 保存之前章节的内容
|
# 保存之前章节的内容
|
||||||
if current_section and content_buffer:
|
if current_section and content_buffer:
|
||||||
@@ -540,6 +627,7 @@ class PDFParser(DocumentParser):
|
|||||||
if level == 1:
|
if level == 1:
|
||||||
sections.append(section)
|
sections.append(section)
|
||||||
section_stack = {1: section}
|
section_stack = {1: section}
|
||||||
|
last_top_level_number = top_level_number
|
||||||
else:
|
else:
|
||||||
parent_level = level - 1
|
parent_level = level - 1
|
||||||
while parent_level >= 1 and parent_level not in section_stack:
|
while parent_level >= 1 and parent_level not in section_stack:
|
||||||
@@ -557,6 +645,10 @@ class PDFParser(DocumentParser):
|
|||||||
for l in list(section_stack.keys()):
|
for l in list(section_stack.keys()):
|
||||||
if l > level:
|
if l > level:
|
||||||
del section_stack[l]
|
del section_stack[l]
|
||||||
|
|
||||||
|
# 若出现层级跳跃(如1->3),自动回退到父级+1。
|
||||||
|
if level > 1 and (level - 1) not in section_stack:
|
||||||
|
section.level = max(section_stack.keys()) if section_stack else 1
|
||||||
|
|
||||||
current_section = section
|
current_section = section
|
||||||
else:
|
else:
|
||||||
@@ -577,13 +669,14 @@ class PDFParser(DocumentParser):
|
|||||||
Returns:
|
Returns:
|
||||||
(章节编号, 章节标题) 或 None
|
(章节编号, 章节标题) 或 None
|
||||||
"""
|
"""
|
||||||
# 模式: "3.1功能需求" 或 "3.1 功能需求"
|
# 模式: "3.1 功能需求" / "3.1.2 电场..."
|
||||||
match = re.match(r'^(\d+(?:\.\d+)*)\s*(.+)$', line)
|
match = re.match(r'^(\d+(?:\.\d+)*)[\s、.))]*(.+)$', line)
|
||||||
if not match:
|
if not match:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
number = match.group(1)
|
number = match.group(1)
|
||||||
title = match.group(2).strip()
|
title = match.group(2).strip()
|
||||||
|
level = len(number.split('.'))
|
||||||
|
|
||||||
# 排除目录行
|
# 排除目录行
|
||||||
if '...' in title or title.count('.') > 5:
|
if '...' in title or title.count('.') > 5:
|
||||||
@@ -609,6 +702,18 @@ class PDFParser(DocumentParser):
|
|||||||
# 标题长度检查
|
# 标题长度检查
|
||||||
if len(title) > 60 or len(title) < 2:
|
if len(title) > 60 or len(title) < 2:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
# 过滤更像正文描述的句式。
|
||||||
|
if self._looks_like_statement(title):
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 过滤疑似正文句子(含句号/分号且过长)。
|
||||||
|
if len(title) > 24 and re.search(r'[。;;]', title):
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 过滤指令拼接噪声标题(逗号过多通常是正文残片)。
|
||||||
|
if title.count(',') >= 2 and len(title) > 20:
|
||||||
|
return None
|
||||||
|
|
||||||
# 放宽标题字符要求(兼容部分PDF字体导致中文抽取异常的情况)
|
# 放宽标题字符要求(兼容部分PDF字体导致中文抽取异常的情况)
|
||||||
if not re.search(r'[\u4e00-\u9fa5A-Za-z]', title):
|
if not re.search(r'[\u4e00-\u9fa5A-Za-z]', title):
|
||||||
@@ -631,8 +736,30 @@ class PDFParser(DocumentParser):
|
|||||||
# 检查标题是否包含反斜杠(通常是表格噪声)
|
# 检查标题是否包含反斜杠(通常是表格噪声)
|
||||||
if '\\' in title and '需求' not in title:
|
if '\\' in title and '需求' not in title:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
# 常见有效标题关键词兜底,降低正文被识别为标题的概率。
|
||||||
|
if not any(k in title for k in self.VALID_TITLE_KEYWORDS):
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 顶级章节标题需符合SRS结构性关键词,避免“综合监控”“电场”等正文短语被识别。
|
||||||
|
if level == 1 and not any(k in title for k in self.TOP_LEVEL_TITLE_KEYWORDS):
|
||||||
|
return None
|
||||||
|
|
||||||
return (number, title)
|
return (number, title)
|
||||||
|
|
||||||
|
def _looks_like_statement(self, title: str) -> bool:
|
||||||
|
"""判断标题是否更像正文语句而非章节名。"""
|
||||||
|
if not title:
|
||||||
|
return False
|
||||||
|
|
||||||
|
statement_hints = ["应", "能够", "可以", "进行", "通过", "并", "同时", "当", "如果", "则"]
|
||||||
|
if any(h in title for h in statement_hints):
|
||||||
|
return True
|
||||||
|
|
||||||
|
if len(title) > 24 and re.search(r'[,。;;::]', title):
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
def _is_noise(self, line: str) -> bool:
|
def _is_noise(self, line: str) -> bool:
|
||||||
"""检查是否是噪声行"""
|
"""检查是否是噪声行"""
|
||||||
|
|||||||
@@ -146,8 +146,8 @@ class JSONGenerator:
|
|||||||
if req.type == 'interface':
|
if req.type == 'interface':
|
||||||
req_dict["接口名称"] = req.interface_name
|
req_dict["接口名称"] = req.interface_name
|
||||||
req_dict["接口类型"] = req.interface_type
|
req_dict["接口类型"] = req.interface_type
|
||||||
req_dict["来源"] = req.source
|
req_dict["数据来源"] = req.source
|
||||||
req_dict["目的地"] = req.destination
|
req_dict["数据目的地"] = req.destination
|
||||||
result["需求列表"].append(req_dict)
|
result["需求列表"].append(req_dict)
|
||||||
|
|
||||||
# 如果有子章节,添加子章节
|
# 如果有子章节,添加子章节
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -33,8 +33,10 @@ class RequirementSplitter:
|
|||||||
CONNECTOR_HINTS = ["并", "并且", "同时", "然后", "且", "以及", "及"]
|
CONNECTOR_HINTS = ["并", "并且", "同时", "然后", "且", "以及", "及"]
|
||||||
CONDITIONAL_HINTS = ["如果", "当", "若", "在", "其中", "此时", "满足"]
|
CONDITIONAL_HINTS = ["如果", "当", "若", "在", "其中", "此时", "满足"]
|
||||||
CONTEXT_PRONOUN_HINTS = ["该", "其", "上述", "此", "这些", "那些"]
|
CONTEXT_PRONOUN_HINTS = ["该", "其", "上述", "此", "这些", "那些"]
|
||||||
|
CHAIN_HINTS = ["从而", "以便", "用于", "以实现", "并据此", "进而", "从而实现"]
|
||||||
|
ENUMERATION_HINTS = ["具体包括", "包括但不限于", "主要包括", "其中包括", "如下"]
|
||||||
|
|
||||||
def __init__(self, max_sentence_len: int = 120, min_clause_len: int = 12):
|
def __init__(self, max_sentence_len: int = 160, min_clause_len: int = 20):
|
||||||
self.max_sentence_len = max_sentence_len
|
self.max_sentence_len = max_sentence_len
|
||||||
self.min_clause_len = min_clause_len
|
self.min_clause_len = min_clause_len
|
||||||
|
|
||||||
@@ -107,6 +109,14 @@ class RequirementSplitter:
|
|||||||
if len(current) < self.min_clause_len:
|
if len(current) < self.min_clause_len:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# “具体包括/其中包括”后的列举项通常是上一句延伸,不应拆分为独立需求。
|
||||||
|
if any(h in current for h in self.ENUMERATION_HINTS):
|
||||||
|
return False
|
||||||
|
|
||||||
|
# 承接链条短语一般不是独立需求动作,避免切断语义链。
|
||||||
|
if any(fragment.startswith(h) for h in self.CHAIN_HINTS):
|
||||||
|
return False
|
||||||
|
|
||||||
# 指代承接片段通常是语义延续,不应切断。
|
# 指代承接片段通常是语义延续,不应切断。
|
||||||
if any(fragment.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
|
if any(fragment.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
|
||||||
return False
|
return False
|
||||||
@@ -123,6 +133,12 @@ class RequirementSplitter:
|
|||||||
has_action = any(h in fragment for h in self.ACTION_HINTS)
|
has_action = any(h in fragment for h in self.ACTION_HINTS)
|
||||||
current_has_action = any(h in current for h in self.ACTION_HINTS)
|
current_has_action = any(h in current for h in self.ACTION_HINTS)
|
||||||
|
|
||||||
|
# 并列连接词后接“控制/处理/显示”等限定短语时,优先视为同一需求。
|
||||||
|
if has_connector and len(fragment) < self.max_sentence_len // 3 and not any(
|
||||||
|
kw in fragment for kw in ["并输出", "并上传", "并记录", "并触发"]
|
||||||
|
):
|
||||||
|
return False
|
||||||
|
|
||||||
# 连接词 + 动作词,且当前片段已经包含动作,优先拆分。
|
# 连接词 + 动作词,且当前片段已经包含动作,优先拆分。
|
||||||
if has_connector and has_action and current_has_action:
|
if has_connector and has_action and current_has_action:
|
||||||
return True
|
return True
|
||||||
@@ -147,6 +163,9 @@ class RequirementSplitter:
|
|||||||
return merged
|
return merged
|
||||||
|
|
||||||
def _should_merge(self, prev: str, current: str) -> bool:
|
def _should_merge(self, prev: str, current: str) -> bool:
|
||||||
|
if any(h in prev for h in self.ENUMERATION_HINTS):
|
||||||
|
return True
|
||||||
|
|
||||||
# 指代开头:如“该报警信号...”。
|
# 指代开头:如“该报警信号...”。
|
||||||
if any(current.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
|
if any(current.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
|
||||||
return True
|
return True
|
||||||
|
|||||||
134
src/settings.py
134
src/settings.py
@@ -60,6 +60,46 @@ class AppSettings:
|
|||||||
"other": "OR",
|
"other": "OR",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
DEFAULT_INTERFACE_SECTION_HINTS = [
|
||||||
|
"接口描述",
|
||||||
|
"接口需求",
|
||||||
|
"接口要求",
|
||||||
|
"外部接口",
|
||||||
|
"内部接口",
|
||||||
|
"i/o",
|
||||||
|
]
|
||||||
|
|
||||||
|
DEFAULT_INTERFACE_TITLE_EXCLUDES = [
|
||||||
|
"计算机通信需求",
|
||||||
|
"通信需求",
|
||||||
|
"通信要求",
|
||||||
|
]
|
||||||
|
|
||||||
|
DEFAULT_FUNCTIONAL_SECTION_HINTS = [
|
||||||
|
"功能需求",
|
||||||
|
"功能要求",
|
||||||
|
]
|
||||||
|
|
||||||
|
DEFAULT_OTHER_SECTION_HINTS = [
|
||||||
|
"安全性需求",
|
||||||
|
"保密性需求",
|
||||||
|
"适应性需求",
|
||||||
|
"环境需求",
|
||||||
|
"资源需求",
|
||||||
|
"质量",
|
||||||
|
"设计约束",
|
||||||
|
"培训需求",
|
||||||
|
"软件保障",
|
||||||
|
"验收",
|
||||||
|
"交付",
|
||||||
|
"包装",
|
||||||
|
"通信需求",
|
||||||
|
"计算机通信需求",
|
||||||
|
"硬件环境",
|
||||||
|
"软件环境",
|
||||||
|
"运行环境",
|
||||||
|
]
|
||||||
|
|
||||||
def __init__(self, config: Dict[str, Any] = None):
|
def __init__(self, config: Dict[str, Any] = None):
|
||||||
self.config = config or {}
|
self.config = config or {}
|
||||||
|
|
||||||
@@ -75,6 +115,20 @@ class AppSettings:
|
|||||||
self.type_prefix = self._build_type_prefix(req_types_cfg)
|
self.type_prefix = self._build_type_prefix(req_types_cfg)
|
||||||
self.type_chinese = self._build_type_chinese(req_types_cfg)
|
self.type_chinese = self._build_type_chinese(req_types_cfg)
|
||||||
|
|
||||||
|
semantic_type_cfg = extraction_cfg.get("semantic_type_policy", {})
|
||||||
|
self.interface_section_hints = [
|
||||||
|
str(x).lower() for x in semantic_type_cfg.get("interface_section_hints", self.DEFAULT_INTERFACE_SECTION_HINTS)
|
||||||
|
]
|
||||||
|
self.interface_title_excludes = [
|
||||||
|
str(x).lower() for x in semantic_type_cfg.get("interface_title_excludes", self.DEFAULT_INTERFACE_TITLE_EXCLUDES)
|
||||||
|
]
|
||||||
|
self.functional_section_hints = [
|
||||||
|
str(x).lower() for x in semantic_type_cfg.get("functional_section_hints", self.DEFAULT_FUNCTIONAL_SECTION_HINTS)
|
||||||
|
]
|
||||||
|
self.other_section_hints = [
|
||||||
|
str(x).lower() for x in semantic_type_cfg.get("other_section_hints", self.DEFAULT_OTHER_SECTION_HINTS)
|
||||||
|
]
|
||||||
|
|
||||||
splitter_cfg = extraction_cfg.get("splitter", {})
|
splitter_cfg = extraction_cfg.get("splitter", {})
|
||||||
self.splitter_max_sentence_len = int(splitter_cfg.get("max_sentence_len", 120))
|
self.splitter_max_sentence_len = int(splitter_cfg.get("max_sentence_len", 120))
|
||||||
self.splitter_min_clause_len = int(splitter_cfg.get("min_clause_len", 12))
|
self.splitter_min_clause_len = int(splitter_cfg.get("min_clause_len", 12))
|
||||||
@@ -91,16 +145,61 @@ class AppSettings:
|
|||||||
self.table_llm_semantic_enabled = bool(table_cfg.get("llm_semantic_enabled", True))
|
self.table_llm_semantic_enabled = bool(table_cfg.get("llm_semantic_enabled", True))
|
||||||
self.sequence_table_merge = table_cfg.get("sequence_table_merge", "single_requirement")
|
self.sequence_table_merge = table_cfg.get("sequence_table_merge", "single_requirement")
|
||||||
self.merge_time_series_rows_min = int(table_cfg.get("merge_time_series_rows_min", 3))
|
self.merge_time_series_rows_min = int(table_cfg.get("merge_time_series_rows_min", 3))
|
||||||
|
self.table_skip_keywords = list(
|
||||||
|
table_cfg.get(
|
||||||
|
"skip_keywords",
|
||||||
|
["系统功能要求", "性能要求", "功能矩阵", "能力对照", "性能指标对照"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
self.table_interface_keywords = list(
|
||||||
|
table_cfg.get(
|
||||||
|
"interface_keywords",
|
||||||
|
["接口", "interface", "输入输出", "I/O", "数据来源", "数据目的地", "来源", "目的地"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
self.table_single_requirement_keywords = list(
|
||||||
|
table_cfg.get(
|
||||||
|
"single_requirement_keywords",
|
||||||
|
["硬件要求", "软件要求", "运行环境", "环境需求", "资源需求", "计算机资源"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
rewrite_cfg = extraction_cfg.get("rewrite_policy", {})
|
rewrite_cfg = extraction_cfg.get("rewrite_policy", {})
|
||||||
self.llm_light_rewrite_enabled = bool(rewrite_cfg.get("llm_light_rewrite_enabled", True))
|
self.llm_light_rewrite_enabled = bool(rewrite_cfg.get("llm_light_rewrite_enabled", True))
|
||||||
self.preserve_ratio_min = float(rewrite_cfg.get("preserve_ratio_min", 0.65))
|
self.preserve_ratio_min = float(rewrite_cfg.get("preserve_ratio_min", 0.65))
|
||||||
self.max_length_growth_ratio = float(rewrite_cfg.get("max_length_growth_ratio", 1.25))
|
self.max_length_growth_ratio = float(rewrite_cfg.get("max_length_growth_ratio", 1.25))
|
||||||
|
self.non_interface_max_edit_distance = int(rewrite_cfg.get("non_interface_max_edit_distance", 20))
|
||||||
|
|
||||||
|
self.system_description_hints = list(
|
||||||
|
extraction_cfg.get(
|
||||||
|
"system_description_hints",
|
||||||
|
["系统描述", "功能描述", "概述", "示意图", "组成", "架构", "原理"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
renumber_cfg = extraction_cfg.get("renumber_policy", {})
|
renumber_cfg = extraction_cfg.get("renumber_policy", {})
|
||||||
self.renumber_enabled = bool(renumber_cfg.get("enabled", True))
|
self.renumber_enabled = bool(renumber_cfg.get("enabled", True))
|
||||||
self.renumber_mode = renumber_cfg.get("mode", "section_continuous")
|
self.renumber_mode = renumber_cfg.get("mode", "section_continuous")
|
||||||
|
|
||||||
|
dedup_cfg = extraction_cfg.get("dedup_policy", {})
|
||||||
|
self.dedup_similarity_threshold = float(dedup_cfg.get("similarity_threshold", 0.88))
|
||||||
|
self.enable_cross_section_dedup = bool(dedup_cfg.get("enable_cross_section_dedup", True))
|
||||||
|
self.prefer_text_over_table = bool(dedup_cfg.get("prefer_text_over_table", True))
|
||||||
|
|
||||||
|
interface_cfg = extraction_cfg.get("interface_policy", {})
|
||||||
|
self.interface_unknown_fallback = str(interface_cfg.get("unknown_fallback", "未知"))
|
||||||
|
|
||||||
|
normalization_cfg = extraction_cfg.get("normalization_policy", {})
|
||||||
|
self.ocr_spacing_normalize = bool(normalization_cfg.get("ocr_spacing_normalize", True))
|
||||||
|
|
||||||
|
fidelity_cfg = extraction_cfg.get("fidelity_policy", {})
|
||||||
|
self.preserve_source_text_for_text_blocks = bool(
|
||||||
|
fidelity_cfg.get("preserve_source_text_for_text_blocks", True)
|
||||||
|
)
|
||||||
|
|
||||||
|
punctuation_cfg = extraction_cfg.get("punctuation_policy", {})
|
||||||
|
self.ensure_terminal_period = bool(punctuation_cfg.get("ensure_terminal_period", True))
|
||||||
|
|
||||||
def _build_rules(self, req_types_cfg: Dict[str, Dict[str, Any]]) -> List[RequirementTypeRule]:
|
def _build_rules(self, req_types_cfg: Dict[str, Dict[str, Any]]) -> List[RequirementTypeRule]:
|
||||||
rules: List[RequirementTypeRule] = []
|
rules: List[RequirementTypeRule] = []
|
||||||
if not req_types_cfg:
|
if not req_types_cfg:
|
||||||
@@ -153,10 +252,45 @@ class AppSettings:
|
|||||||
def is_non_requirement_section(self, title: str) -> bool:
|
def is_non_requirement_section(self, title: str) -> bool:
|
||||||
return any(keyword in title for keyword in self.non_requirement_sections)
|
return any(keyword in title for keyword in self.non_requirement_sections)
|
||||||
|
|
||||||
|
def is_interface_semantic_title(self, title: str) -> bool:
|
||||||
|
t = (title or "").strip().lower()
|
||||||
|
if not t:
|
||||||
|
return False
|
||||||
|
|
||||||
|
excluded = any(x in t for x in self.interface_title_excludes)
|
||||||
|
if excluded and "接口" not in t:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return any(h in t for h in self.interface_section_hints)
|
||||||
|
|
||||||
|
def is_functional_semantic_title(self, title: str) -> bool:
|
||||||
|
t = (title or "").strip().lower()
|
||||||
|
if not t:
|
||||||
|
return False
|
||||||
|
return any(h in t for h in self.functional_section_hints)
|
||||||
|
|
||||||
|
def is_other_semantic_title(self, title: str) -> bool:
|
||||||
|
t = (title or "").strip().lower()
|
||||||
|
if not t:
|
||||||
|
return False
|
||||||
|
return any(h in t for h in self.other_section_hints)
|
||||||
|
|
||||||
def detect_requirement_type(self, title: str, content: str) -> str:
|
def detect_requirement_type(self, title: str, content: str) -> str:
|
||||||
|
# 章节语义优先:接口仅由接口类章节触发;安全/保密/适应性等统一归其他需求。
|
||||||
|
if self.is_interface_semantic_title(title):
|
||||||
|
return "interface"
|
||||||
|
if self.is_functional_semantic_title(title):
|
||||||
|
return "functional"
|
||||||
|
if self.is_other_semantic_title(title):
|
||||||
|
return "other"
|
||||||
|
|
||||||
combined_text = f"{title} {(content or '')[:500]}".lower()
|
combined_text = f"{title} {(content or '')[:500]}".lower()
|
||||||
for rule in self.requirement_rules:
|
for rule in self.requirement_rules:
|
||||||
|
if rule.key == "interface" and not self.is_interface_semantic_title(title):
|
||||||
|
continue
|
||||||
for keyword in rule.keywords:
|
for keyword in rule.keywords:
|
||||||
if keyword.lower() in combined_text:
|
if keyword.lower() in combined_text:
|
||||||
|
if rule.key in {"performance", "security", "reliability", "other"}:
|
||||||
|
return "other"
|
||||||
return rule.key
|
return rule.key
|
||||||
return "functional"
|
return "functional"
|
||||||
|
|||||||
Reference in New Issue
Block a user