只保留LLM提取模式，修改提取逻辑

2026-04-18 20:33:58 +08:00
parent f01ddf045d
commit e274e7faa2
9 changed files with 1427 additions and 403 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # SRS需求文档解析工具

-一个智能的SRS（软件需求规格说明书）文档解析工具，支持PDF和Docx格式，能够自动提取需求并生成结构化JSON输出。
+一个基于大模型的SRS（软件需求规格说明书）文档解析工具，支持PDF和Docx格式，能够自动提取需求并生成结构化JSON输出。

 ## 特性

@@ -12,6 +12,8 @@
 - **表格需求识别**：支持从表格中提取功能/接口/其他需求
 - **PDF表格提取**：支持从PDF中提取表格并自动挂接到章节
 - **长句原子拆分**：自动将包含多个需求点的长句拆分为多个可验证需求项
+- **章节筛选提取**：支持按章节号提取（如输入`3`提取第3章及其全部子章节）
+- **LLM-only**：当前版本仅支持LLM提取链路，不再提供规则提取模式

 ## 快速开始

@@ -27,7 +29,7 @@ pip install dashscope
 pip install pdfplumber
 ```

-### 配置API密钥（LLM模式）
+### 配置API密钥（必需）

 ```bash
 # 方式1：环境变量（推荐）
@@ -45,11 +47,11 @@ llm:
 ### 运行

 ```bash
-# LLM增强模式
+# LLM增强模式（唯一模式）
 python main.py -i ".\input\DC-SRS.pdf" -o ".\output\output.json"

-# 纯规则模式（不使用LLM）
-python main.py -i DC-SRS.pdf -o output.json --no-llm
+# 按章节提取（输入3表示提取第3章及3.x子章节）
+python main.py -i ".\input\DC-SRS.pdf" -o ".\output\output_ch3.json" --chapters 3
 ```

 <!-- ```bash
@@ -73,16 +75,33 @@ python -c "from src.document_parser import DocxParser; parser = DocxParser('test

 | 字段 | 说明 |
 |------|------|
-| **接口名称** | 接口的名称
-| **接口类型** | 接口的类型
-| **来源** | 数据或信号的来源/发送方 |
-| **目的地** | 数据或信号的目的地/接收方 |
+| **接口名称** | 接口的名称 |
+| **接口类型** | 接口的类型 |
+| **数据来源** | 数据或信号的来源/发送方 |
+| **数据目的地** | 数据或信号的目的地/接收方 |

-### 需求描述规则
+### 需求描述策略（LLM驱动）

- **功能需求**：保持原文描述，不改写润色
- **接口需求**：允许改写润色，确保描述清晰完整
- **其他需求**：保持原文描述，不改写润色
+- **功能需求**：以原文为主，必要时轻微补全语义
+- **接口需求**：允许适度改写润色，并补齐接口字段
+- **其他需求**：以原文为主，避免无意义改写
+
+### 表格处理策略
+
+- **系统功能要求表、性能要求表**：默认忽略，不提取需求
+- **接口要求表**：可提取接口需求，且接口字段优先从表格列提取
+- **硬件/软件/运行环境表**：按“一表一条”生成需求，避免拆成多条
+
+### 润色约束
+
+- 除接口需求外，需求描述尽量保持原文
+- 非接口需求的润色改动上限为20个字（超限则回退原描述）
+
+## 运行约束
+
+- 必须配置可用的 `DASHSCOPE_API_KEY`（或在 `config.yaml` 中配置 `llm.api_key`）
+- 当LLM初始化失败或调用失败时，程序会直接报错退出，不会降级为规则提取
+- `--chapters` 为空时提取全量；设置为 `3` 时仅提取第3章及其子章节

 ## 目录结构

--- a/config.yaml
+++ b/config.yaml
@@ -3,12 +3,12 @@

 # LLM配置 - 阿里云千问
 llm:
-  # 是否启用LLM（设为false则使用纯规则提取）
+  # 是否启用LLM（当前版本必须为true）
  enabled: true
  # LLM提供商：qwen（阿里云千问）
  provider: "qwen"
  # 模型名称
-  model: "qwen3-max-2026-01-23"
+  model: "glm-5"
  # API密钥（建议使用环境变量 DASHSCOPE_API_KEY）
  api_key: "sk-7097f7842f724f0c9e70c4bf3b16dacb"
  # 可选参数
@@ -48,7 +48,7 @@ extraction:
      priority: 1
    接口需求:
      prefix: "IR"
-      keywords: ["接口", "interface", "api", "外部接口", "内部接口", "CAN", "以太网", "通信"]
+      keywords: ["接口", "interface", "api", "外部接口", "内部接口", "输入输出"]
      priority: 2
    性能需求:
      prefix: "PR"
@@ -68,23 +68,105 @@ extraction:
      priority: 6
  splitter:
    enabled: true
-    max_sentence_len: 120
-    min_clause_len: 12
+    max_sentence_len: 160
+    min_clause_len: 20
+  semantic_type_policy:
+    interface_section_hints:
+      - "接口描述"
+      - "接口需求"
+      - "接口要求"
+      - "外部接口"
+      - "内部接口"
+      - "I/O"
+    interface_title_excludes:
+      - "计算机通信需求"
+      - "通信需求"
+      - "通信要求"
+    functional_section_hints:
+      - "功能需求"
+      - "功能要求"
+    other_section_hints:
+      - "安全性需求"
+      - "保密性需求"
+      - "适应性需求"
+      - "环境需求"
+      - "资源需求"
+      - "质量"
+      - "设计约束"
+      - "培训需求"
+      - "软件保障"
+      - "验收"
+      - "交付"
+      - "包装"
+      - "通信需求"
+      - "计算机通信需求"
+      - "硬件环境"
+      - "软件环境"
+      - "运行环境"
  semantic_guard:
    enabled: true
    preserve_condition_action_chain: true
    preserve_alarm_chain: true
+  system_description_hints:
+    - "系统描述"
+    - "功能描述"
+    - "概述"
+    - "示意图"
+    - "组成"
+    - "架构"
+    - "原理"
  table_strategy:
    llm_semantic_enabled: true
    sequence_table_merge: "single_requirement"
    merge_time_series_rows_min: 3
+    skip_keywords:
+      - "系统功能要求"
+      - "性能要求"
+      - "系统性能要求"
+      - "系统接口要求"
+      - "功能矩阵"
+      - "能力对照"
+      - "性能指标对照"
+    interface_keywords:
+      - "接口"
+      - "interface"
+      - "输入输出"
+      - "I/O"
+      - "数据来源"
+      - "数据目的地"
+      - "来源"
+      - "目的地"
+    single_requirement_keywords:
+      - "硬件要求"
+      - "软件要求"
+      - "运行环境"
+      - "硬件环境"
+      - "软件环境"
+      - "运行硬件环境"
+      - "运行软件环境"
+      - "环境需求"
+      - "资源需求"
+      - "计算机资源"
  rewrite_policy:
    llm_light_rewrite_enabled: true
    preserve_ratio_min: 0.65
    max_length_growth_ratio: 1.25
+    non_interface_max_edit_distance: 20
  renumber_policy:
    enabled: true
    mode: "section_continuous"
+  dedup_policy:
+    similarity_threshold: 0.88
+    enable_cross_section_dedup: true
+    prefer_text_over_table: true
+  interface_policy:
+    unknown_fallback: "未知"
+  normalization_policy:
+    ocr_spacing_normalize: true
+  fidelity_policy:
+    preserve_source_text_for_text_blocks: true
+  punctuation_policy:
+    ensure_terminal_period: true

 # 输出配置
 output:
--- a/json_to_excel.py
+++ b/json_to_excel.py
@@ -45,8 +45,8 @@ def parse_requirements_from_json(json_data, parent_section=""):
                "需求描述": req.get("需求描述", ""),
                "接口名称": req.get("接口名称", ""),
                "接口类型": req.get("接口类型", ""),
-                "来源": req.get("来源", ""),
-                "目的地": req.get("目的地", "")
+                "数据来源": req.get("数据来源", ""),
+                "数据目的地": req.get("数据目的地", "")
            }
            requirements.append(req_data)
        
@@ -108,7 +108,7 @@ def create_excel(json_file, output_file):
    # 定义表头（按用户要求的顺序）
    headers = [
        "章节编号", "章节标题", "需求类型", "需求编号", "需求描述",
-        "接口名称", "接口类型", "来源", "目的地"
+        "接口名称", "接口类型", "数据来源", "数据目的地"
    ]
    
    # 写入表头
@@ -154,8 +154,8 @@ def create_excel(json_file, output_file):
        'E': 80,  # 需求描述
        'F': 25,  # 接口名称
        'G': 25,  # 接口类型
-        'H': 25,  # 来源
-        'I': 25   # 目的地
+        'H': 25,  # 数据来源
+        'I': 25   # 数据目的地
    }
    
    for col, width in column_widths.items():
--- a/main.py
+++ b/main.py
@@ -2,7 +2,6 @@
 # -*- coding: utf-8 -*-
 """
 SRS 解析工具 - 主程序入口
-LLM 增强版 - 默认阿里云千问大模型
 """

 import argparse
@@ -16,6 +15,7 @@ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

 from src.utils import load_config, setup_logging, validate_file_path, ensure_directory_exists, get_env_or_config
 from src.document_parser import create_parser
+from src.document_parser import Section
 from src.requirement_extractor import RequirementExtractor
 from src.json_generator import JSONGenerator

@@ -34,10 +34,9 @@ def create_llm(config: dict):
    """
    llm_config = config.get('llm', {})
    
-    # 检查是否启用LLM
+    # 当前版本仅支持LLM模式
    if not llm_config.get('enabled', True):
-        logger.info("LLM已禁用，使用纯规则提取模式")
-        return None
+        raise ValueError("当前版本仅支持LLM模式，请将配置 llm.enabled 设为 true")
    
    provider = llm_config.get('provider', 'qwen')
    
@@ -45,9 +44,7 @@ def create_llm(config: dict):
    api_key = get_env_or_config('DASHSCOPE_API_KEY', llm_config.get('api_key'))
    
    if not api_key:
-        logger.warning("未配置API密钥，请使用纯规则提取模式")
-        logger.warning("请设置环境变量 DASHSCOPE_API_KEY 或在 config.yaml 中配置 llm.api_key")
-        return None
+        raise ValueError("未配置API密钥：请设置环境变量 DASHSCOPE_API_KEY 或在 config.yaml 中配置 llm.api_key")
    
    try:
        from src.llm_interface import QwenLLM
@@ -67,12 +64,80 @@ def create_llm(config: dict):
        return llm
        
    except ImportError as e:
-        logger.warning(f"无法导入LLM模块: {e}")
-        logger.warning("请运行: pip install dashscope")
-        return None
+        raise RuntimeError(f"无法导入LLM模块: {e}。请安装依赖：pip install dashscope") from e
    except Exception as e:
-        logger.warning(f"创建LLM实例失败: {e}")
-        return None
+        raise RuntimeError(f"创建LLM实例失败: {e}") from e
+
+
+def parse_chapter_selector(selector: str) -> list:
+    """解析章节筛选参数。"""
+    if not selector:
+        return []
+    chapters = [x.strip() for x in selector.split(',') if x.strip()]
+    valid = []
+    for chapter in chapters:
+        if not chapter or not all(p.isdigit() for p in chapter.split('.')):
+            raise ValueError(f"无效章节编号: {chapter}，仅支持如 3 或 3.1 的格式")
+        valid.append(chapter)
+    return valid
+
+
+def _clone_section_with_children(section: Section) -> Section:
+    copied = Section(
+        level=section.level,
+        title=section.title,
+        number=section.number,
+        content=section.content,
+        uid=section.uid,
+    )
+    copied.tables = list(section.tables)
+    copied.blocks = list(section.blocks)
+    for child in section.children:
+        copied.add_child(_clone_section_with_children(child))
+    return copied
+
+
+def filter_sections_by_chapters(sections: list, chapters: list) -> list:
+    """按章节前缀过滤章节树（如3匹配3及3.x）。"""
+    if not chapters:
+        return sections
+
+    def matched(number: str) -> bool:
+        number = (number or "").strip()
+        if not number:
+            return False
+        for chapter in chapters:
+            if number == chapter or number.startswith(f"{chapter}."):
+                return True
+        return False
+
+    def recurse(section: Section) -> Section:
+        if matched(section.number):
+            return _clone_section_with_children(section)
+
+        copied = Section(
+            level=section.level,
+            title=section.title,
+            number=section.number,
+            content=section.content,
+            uid=section.uid,
+        )
+        copied.tables = list(section.tables)
+        copied.blocks = list(section.blocks)
+
+        for child in section.children:
+            filtered_child = recurse(child)
+            if filtered_child:
+                copied.add_child(filtered_child)
+
+        return copied if copied.children else None
+
+    filtered = []
+    for s in sections:
+        fs = recurse(s)
+        if fs:
+            filtered.append(fs)
+    return filtered


 def main():
@@ -86,7 +151,7 @@ def main():
 示例用法：
  python main.py --input sample.pdf --output output.json
  python main.py -i requirements.docx -o output.json --verbose
-  python main.py -i DC-SRS.pdf -o output.json --no-llm  # 禁用LLM
+    python main.py -i DC-SRS.pdf -o output.json
        """
    )
    
@@ -116,11 +181,12 @@ def main():
        action='store_true',
        help='输出详细日志'
    )
-    
+
    parser.add_argument(
-        '--no-llm',
-        action='store_true',
-        help='禁用LLM，使用纯规则提取'
+        '--chapters',
+        type=str,
+        default=None,
+        help='按章节提取（如: 3 或 3,4.1）；输入3表示提取第3章及其子章节'
    )
    
    # 解析命令行参数
@@ -129,10 +195,6 @@ def main():
    # 加载配置
    config = load_config(args.config)
    
-    # 命令行参数覆盖配置
-    if args.no_llm:
-        config.setdefault('llm', {})['enabled'] = False
-    
    # 设置日志
    if args.verbose:
        config.setdefault('logging', {})['level'] = 'DEBUG'
@@ -158,12 +220,9 @@ def main():
        
        logger.info(f"输出文件: {args.output}")
        
-        # 创建LLM实例
+        # 创建LLM实例（必需）
        llm = create_llm(config)
-        if llm:
-            logger.info("LLM增强模式已启用")
-        else:
-            logger.info("使用纯规则提取模式")
+        logger.info("LLM增强模式已启用")
        
        # 步骤1：解析文档
        logger.info("\n" + "=" * 60)
@@ -176,6 +235,13 @@ def main():
        
        sections = doc_parser.parse()
        document_title = doc_parser.get_document_title()
+
+        selected_chapters = parse_chapter_selector(args.chapters) if args.chapters else []
+        if selected_chapters:
+            sections = filter_sections_by_chapters(sections, selected_chapters)
+            if not sections:
+                raise ValueError(f"未匹配到指定章节: {', '.join(selected_chapters)}")
+            logger.info(f"章节筛选已启用: {', '.join(selected_chapters)}")
        
        logger.info(f"成功解析文档，提取{len(sections)}个顶级章节")
        
@@ -192,10 +258,7 @@ def main():
        
        # 步骤2：提取需求
        logger.info("\n" + "=" * 60)
-        if llm:
-            logger.info("步骤2：提取需求（LLM增强模式）")
-        else:
-            logger.info("步骤2：提取需求（规则匹配模式）")
+        logger.info("步骤2：提取需求（LLM增强模式）")
        logger.info("=" * 60)
        
        extractor = RequirementExtractor(config, llm=llm)
--- a/src/document_parser.py
+++ b/src/document_parser.py
@@ -4,7 +4,6 @@
 支持PDF和Docx格式，针对GJB438B标准SRS文档优化
 """

-import os
 import re
 import logging
 import importlib
@@ -119,43 +118,19 @@ class DocumentParser(ABC):
            sections: 章节列表
            parent_number: 父章节编号
        """
-        # 仅在顶级章节重编号
-        if not parent_number:
-            # 前置章节关键词（需要跳过的）
-            skip_keywords = ['目录', '封面', '扉页', '未命名', '年', '月']
-            # 正文章节关键词（遇到这些说明正文开始）
-            content_keywords = ['外部接口', '接口', '软件需求', '需求', '功能', '性能', '设计', '概述', '标识', '引言']
-            
-            start_index = 0
-            for idx, section in enumerate(sections):
-                # 优先检查是否是正文章节
-                is_content = any(kw in section.title for kw in content_keywords)
-                if is_content and section.level == 1:
-                    start_index = idx
-                    break
-            
-            # 重新编号所有章节
-            counter = 1
-            for i, section in enumerate(sections):
-                if i < start_index:
-                    # 前置章节不编号
-                    section.number = ""
-                else:
-                    # 正文章节：顶级章节从1开始编号
-                    if section.level == 1:
-                        section.number = str(counter)
-                        counter += 1
-                
-                # 递归处理子章节
-                if section.children:
-                    self._auto_number_sections(section.children, section.number)
-        else:
-            # 子章节编号
-            for i, section in enumerate(sections, 1):
-                if not section.number or self._is_chinese_number(section.number):
-                    section.generate_auto_number(parent_number, i)
-                if section.children:
-                    self._auto_number_sections(section.children, section.number)
+        if not sections:
+            return
+
+        # 仅为缺失编号的章节补号；已存在的文档原始编号必须保留。
+        sibling_index = 0
+        for section in sections:
+            has_number = bool((section.number or "").strip()) and not self._is_chinese_number(section.number)
+            if not has_number:
+                sibling_index += 1
+                section.generate_auto_number(parent_number, sibling_index)
+
+            if section.children:
+                self._auto_number_sections(section.children, section.number)
    
    def _is_chinese_number(self, text: str) -> bool:
        """检查是否是中文数字编号"""
@@ -327,8 +302,13 @@ class PDFParser(DocumentParser):
        '优先', '关键', '合格', '追踪', '注释',
        'CSCI', '计算机', '软件', '硬件', '通信', '通讯',
        '数据', '适应', '可靠', '内部', '外部',
-        '描述', '要求', '规定', '说明', '定义',
-        '电场', '防护', '装置', '控制', '监控', '显控'
+        '描述', '要求', '规定', '说明', '定义'
+    ]
+
+    TOP_LEVEL_TITLE_KEYWORDS = [
+        '范围', '标识', '概述', '引用', '文档', '需求', '接口', '性能',
+        '安全', '保密', '环境', '资源', '质量', '设计', '约束', '验收',
+        '交付', '包装', '注释'
    ]
    
    # 明显无效的章节标题模式（噪声）
@@ -411,21 +391,41 @@ class PDFParser(DocumentParser):
                    if page_idx < len(self._page_texts):
                        page_text = self._page_texts[page_idx]

-                    extracted_tables = page.extract_tables() or []
-                    for table_idx, table in enumerate(extracted_tables):
+                    table_objs = page.find_tables() or []
+                    if table_objs:
+                        extracted_tables = [(idx, t.extract(), t.bbox) for idx, t in enumerate(table_objs)]
+                    else:
+                        raw_tables = page.extract_tables() or []
+                        extracted_tables = [(idx, t, None) for idx, t in enumerate(raw_tables)]
+
+                    for table_idx, table, bbox in extracted_tables:
                        cleaned_table: List[List[str]] = []
                        for row in table or []:
                            cells = [re.sub(r'\s+', ' ', str(cell or '')).strip() for cell in row]
+                            # 只要存在非空单元格就保留，避免有效行被误丢弃。
                            if any(cells):
                                cleaned_table.append(cells)

                        if cleaned_table:
+                            section_hint = ""
+                            if bbox:
+                                try:
+                                    top = float(bbox[1])
+                                    text_above = page.crop((0, 0, page.width, top)).extract_text() or ""
+                                    section_hint = self._find_last_section_number(text_above)
+                                except Exception:
+                                    section_hint = ""
+
+                            table_ref = self._extract_table_reference(cleaned_table)
+
                            tables.append(
                                {
                                    "page_idx": page_idx,
                                    "table_idx": table_idx,
                                    "page_text": page_text,
                                    "data": cleaned_table,
+                                    "section_hint": section_hint,
+                                    "table_ref": table_ref,
                                }
                            )
        except Exception as e:
@@ -435,16 +435,86 @@ class PDFParser(DocumentParser):
        logger.info(f"PDF表格提取完成，共{len(tables)}个表格")
        return tables

+    def _extract_table_reference(self, table: List[List[str]]) -> str:
+        """从表格前几行中提取表号引用，如“表3-5”。"""
+        if not table:
+            return ""
+
+        head_rows = table[:2]
+        merged = " ".join(" ".join(str(c or "") for c in row) for row in head_rows)
+        merged = re.sub(r"\s+", "", merged)
+        m = re.search(r"表\s*(\d+(?:[-－]\d+){1,3})", merged)
+        if not m:
+            return ""
+        return m.group(1).replace("－", "-")
+
+    def _build_table_reference_index(self, sections: List[Section]) -> Dict[str, List[Section]]:
+        """构建“表号 -> 章节”索引，用于优先精确挂接表格。"""
+        index: Dict[str, List[Section]] = {}
+        for section in sections:
+            content = re.sub(r"\s+", "", section.content or "")
+            for m in re.finditer(r"表\s*(\d+(?:[-－]\d+){1,3})", content):
+                ref = m.group(1).replace("－", "-")
+                index.setdefault(ref, []).append(section)
+        return index
+
+    def _find_last_section_number(self, text: str) -> str:
+        """从文本中提取最后出现的章节号。"""
+        if not text:
+            return ""
+
+        found = ""
+        for line in text.split("\n"):
+            line = line.strip()
+            if not line:
+                continue
+            section_info = self._match_section_header(line, set())
+            if section_info:
+                found = section_info[0]
+        return found
+
    def _attach_pdf_tables_to_sections(self, tables: List[Dict[str, Any]]) -> None:
        """将提取出的PDF表格挂接到最匹配的章节。"""
        flat_sections = self._flatten_sections(self.sections)
        if not flat_sections:
            return

+        section_by_number = {
+            (s.number or "").strip(): s
+            for s in flat_sections
+            if (s.number or "").strip()
+        }
+        table_ref_index = self._build_table_reference_index(flat_sections)
+
        last_section: Optional[Section] = None
        for table in tables:
-            matched = self._match_table_section(table.get("page_text", ""), flat_sections)
-            target = matched or last_section or flat_sections[0]
+            target = None
+
+            table_ref = (table.get("table_ref") or "").strip()
+            if table_ref and table_ref in table_ref_index:
+                candidates = table_ref_index[table_ref]
+                # 同表号命中多个章节时，优先更深层章节，避免父级“汇总章节”抢占。
+                target = max(candidates, key=lambda s: (s.level, len(s.content or "")))
+
+            section_hint = (table.get("section_hint") or "").strip()
+            if not target and section_hint and section_hint in section_by_number:
+                target = section_by_number[section_hint]
+
+            if not target:
+                target = self._match_table_section(table.get("page_text", ""), flat_sections)
+
+            # 兜底优先使用上一个命中章节，避免错误挂到首章节造成跨章污染。
+            if not target:
+                target = last_section
+
+            if not target:
+                logger.warning(
+                    "未定位到表格归属章节，跳过: page=%s table=%s",
+                    table.get("page_idx", -1),
+                    table.get("table_idx", -1),
+                )
+                continue
+
            target.add_table(table["data"])
            last_section = target

@@ -464,7 +534,7 @@ class PDFParser(DocumentParser):
            return None

        matched: Optional[Section] = None
-        matched_score = -1
+        matched_score = (-1, -1)
        for section in sections:
            title = (section.title or "").strip()
            if not title:
@@ -479,7 +549,7 @@ class PDFParser(DocumentParser):
            for candidate in candidates:
                normalized_candidate = re.sub(r"\s+", "", candidate).lower()
                if normalized_candidate and normalized_candidate in normalized_page:
-                    score = len(normalized_candidate)
+                    score = (len(normalized_candidate), section.level)
                    if score > matched_score:
                        matched = section
                        matched_score = score
@@ -514,6 +584,7 @@ class PDFParser(DocumentParser):
        current_section = None
        content_buffer = []
        found_sections = set()
+        last_top_level_number = 0
        
        for line in lines:
            line = line.strip()
@@ -526,6 +597,22 @@ class PDFParser(DocumentParser):
            if section_info:
                number, title = section_info
                level = len(number.split('.'))
+                top_level_number = int(number.split('.')[0])
+
+                # 顶级章节序号大幅跳跃通常是误识别（如正文中的“8 表...”）。
+                if level == 1 and last_top_level_number and top_level_number > last_top_level_number + 1:
+                    if line and not self._is_noise(line):
+                        content_buffer.append(line)
+                    continue
+
+                # 顶级章节编号倒退通常是正文枚举项被误识别（如“1 综合监控...”）。
+                if level == 1 and last_top_level_number and top_level_number < last_top_level_number:
+                    if line and not self._is_noise(line):
+                        content_buffer.append(line)
+                    continue
+
+                if level > 6:
+                    continue
                
                # 保存之前章节的内容
                if current_section and content_buffer:
@@ -540,6 +627,7 @@ class PDFParser(DocumentParser):
                if level == 1:
                    sections.append(section)
                    section_stack = {1: section}
+                    last_top_level_number = top_level_number
                else:
                    parent_level = level - 1
                    while parent_level >= 1 and parent_level not in section_stack:
@@ -557,6 +645,10 @@ class PDFParser(DocumentParser):
                for l in list(section_stack.keys()):
                    if l > level:
                        del section_stack[l]
+
+                # 若出现层级跳跃（如1->3），自动回退到父级+1。
+                if level > 1 and (level - 1) not in section_stack:
+                    section.level = max(section_stack.keys()) if section_stack else 1
                
                current_section = section
            else:
@@ -577,13 +669,14 @@ class PDFParser(DocumentParser):
        Returns:
            (章节编号, 章节标题) 或 None
        """
-        # 模式: "3.1功能需求" 或 "3.1 功能需求"
-        match = re.match(r'^(\d+(?:\.\d+)*)\s*(.+)$', line)
+        # 模式: "3.1 功能需求" / "3.1.2 电场..."
+        match = re.match(r'^(\d+(?:\.\d+)*)[\s、.)）]*(.+)$', line)
        if not match:
            return None
        
        number = match.group(1)
        title = match.group(2).strip()
+        level = len(number.split('.'))
        
        # 排除目录行
        if '...' in title or title.count('.') > 5:
@@ -609,6 +702,18 @@ class PDFParser(DocumentParser):
        # 标题长度检查
        if len(title) > 60 or len(title) < 2:
            return None
+
+        # 过滤更像正文描述的句式。
+        if self._looks_like_statement(title):
+            return None
+
+        # 过滤疑似正文句子（含句号/分号且过长）。
+        if len(title) > 24 and re.search(r'[。；;]', title):
+            return None
+
+        # 过滤指令拼接噪声标题（逗号过多通常是正文残片）。
+        if title.count('，') >= 2 and len(title) > 20:
+            return None
        
        # 放宽标题字符要求（兼容部分PDF字体导致中文抽取异常的情况）
        if not re.search(r'[\u4e00-\u9fa5A-Za-z]', title):
@@ -631,8 +736,30 @@ class PDFParser(DocumentParser):
        # 检查标题是否包含反斜杠（通常是表格噪声）
        if '\\' in title and '需求' not in title:
            return None
+
+        # 常见有效标题关键词兜底，降低正文被识别为标题的概率。
+        if not any(k in title for k in self.VALID_TITLE_KEYWORDS):
+            return None
+
+        # 顶级章节标题需符合SRS结构性关键词，避免“综合监控”“电场”等正文短语被识别。
+        if level == 1 and not any(k in title for k in self.TOP_LEVEL_TITLE_KEYWORDS):
+            return None
        
        return (number, title)
+
+    def _looks_like_statement(self, title: str) -> bool:
+        """判断标题是否更像正文语句而非章节名。"""
+        if not title:
+            return False
+
+        statement_hints = ["应", "能够", "可以", "进行", "通过", "并", "同时", "当", "如果", "则"]
+        if any(h in title for h in statement_hints):
+            return True
+
+        if len(title) > 24 and re.search(r'[，。；;:：]', title):
+            return True
+
+        return False
    
    def _is_noise(self, line: str) -> bool:
        """检查是否是噪声行"""
--- a/src/json_generator.py
+++ b/src/json_generator.py
@@ -146,8 +146,8 @@ class JSONGenerator:
                if req.type == 'interface':
                    req_dict["接口名称"] = req.interface_name
                    req_dict["接口类型"] = req.interface_type
-                    req_dict["来源"] = req.source
-                    req_dict["目的地"] = req.destination
+                    req_dict["数据来源"] = req.source
+                    req_dict["数据目的地"] = req.destination
                result["需求列表"].append(req_dict)
        
        # 如果有子章节，添加子章节
--- a/src/requirement_extractor.py
+++ b/src/requirement_extractor.py
--- a/src/requirement_splitter.py
+++ b/src/requirement_splitter.py
@@ -33,8 +33,10 @@ class RequirementSplitter:
    CONNECTOR_HINTS = ["并", "并且", "同时", "然后", "且", "以及", "及"]
    CONDITIONAL_HINTS = ["如果", "当", "若", "在", "其中", "此时", "满足"]
    CONTEXT_PRONOUN_HINTS = ["该", "其", "上述", "此", "这些", "那些"]
+    CHAIN_HINTS = ["从而", "以便", "用于", "以实现", "并据此", "进而", "从而实现"]
+    ENUMERATION_HINTS = ["具体包括", "包括但不限于", "主要包括", "其中包括", "如下"]

-    def __init__(self, max_sentence_len: int = 120, min_clause_len: int = 12):
+    def __init__(self, max_sentence_len: int = 160, min_clause_len: int = 20):
        self.max_sentence_len = max_sentence_len
        self.min_clause_len = min_clause_len

@@ -107,6 +109,14 @@ class RequirementSplitter:
        if len(current) < self.min_clause_len:
            return False

+        # “具体包括/其中包括”后的列举项通常是上一句延伸，不应拆分为独立需求。
+        if any(h in current for h in self.ENUMERATION_HINTS):
+            return False
+
+        # 承接链条短语一般不是独立需求动作，避免切断语义链。
+        if any(fragment.startswith(h) for h in self.CHAIN_HINTS):
+            return False
+
        # 指代承接片段通常是语义延续，不应切断。
        if any(fragment.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
            return False
@@ -123,6 +133,12 @@ class RequirementSplitter:
        has_action = any(h in fragment for h in self.ACTION_HINTS)
        current_has_action = any(h in current for h in self.ACTION_HINTS)

+        # 并列连接词后接“控制/处理/显示”等限定短语时，优先视为同一需求。
+        if has_connector and len(fragment) < self.max_sentence_len // 3 and not any(
+            kw in fragment for kw in ["并输出", "并上传", "并记录", "并触发"]
+        ):
+            return False
+
        # 连接词 + 动作词，且当前片段已经包含动作，优先拆分。
        if has_connector and has_action and current_has_action:
            return True
@@ -147,6 +163,9 @@ class RequirementSplitter:
        return merged

    def _should_merge(self, prev: str, current: str) -> bool:
+        if any(h in prev for h in self.ENUMERATION_HINTS):
+            return True
+
        # 指代开头：如“该报警信号...”。
        if any(current.startswith(h) for h in self.CONTEXT_PRONOUN_HINTS):
            return True
--- a/src/settings.py
+++ b/src/settings.py
@@ -60,6 +60,46 @@ class AppSettings:
        "other": "OR",
    }

+    DEFAULT_INTERFACE_SECTION_HINTS = [
+        "接口描述",
+        "接口需求",
+        "接口要求",
+        "外部接口",
+        "内部接口",
+        "i/o",
+    ]
+
+    DEFAULT_INTERFACE_TITLE_EXCLUDES = [
+        "计算机通信需求",
+        "通信需求",
+        "通信要求",
+    ]
+
+    DEFAULT_FUNCTIONAL_SECTION_HINTS = [
+        "功能需求",
+        "功能要求",
+    ]
+
+    DEFAULT_OTHER_SECTION_HINTS = [
+        "安全性需求",
+        "保密性需求",
+        "适应性需求",
+        "环境需求",
+        "资源需求",
+        "质量",
+        "设计约束",
+        "培训需求",
+        "软件保障",
+        "验收",
+        "交付",
+        "包装",
+        "通信需求",
+        "计算机通信需求",
+        "硬件环境",
+        "软件环境",
+        "运行环境",
+    ]
+
    def __init__(self, config: Dict[str, Any] = None):
        self.config = config or {}

@@ -75,6 +115,20 @@ class AppSettings:
        self.type_prefix = self._build_type_prefix(req_types_cfg)
        self.type_chinese = self._build_type_chinese(req_types_cfg)

+        semantic_type_cfg = extraction_cfg.get("semantic_type_policy", {})
+        self.interface_section_hints = [
+            str(x).lower() for x in semantic_type_cfg.get("interface_section_hints", self.DEFAULT_INTERFACE_SECTION_HINTS)
+        ]
+        self.interface_title_excludes = [
+            str(x).lower() for x in semantic_type_cfg.get("interface_title_excludes", self.DEFAULT_INTERFACE_TITLE_EXCLUDES)
+        ]
+        self.functional_section_hints = [
+            str(x).lower() for x in semantic_type_cfg.get("functional_section_hints", self.DEFAULT_FUNCTIONAL_SECTION_HINTS)
+        ]
+        self.other_section_hints = [
+            str(x).lower() for x in semantic_type_cfg.get("other_section_hints", self.DEFAULT_OTHER_SECTION_HINTS)
+        ]
+
        splitter_cfg = extraction_cfg.get("splitter", {})
        self.splitter_max_sentence_len = int(splitter_cfg.get("max_sentence_len", 120))
        self.splitter_min_clause_len = int(splitter_cfg.get("min_clause_len", 12))
@@ -91,16 +145,61 @@ class AppSettings:
        self.table_llm_semantic_enabled = bool(table_cfg.get("llm_semantic_enabled", True))
        self.sequence_table_merge = table_cfg.get("sequence_table_merge", "single_requirement")
        self.merge_time_series_rows_min = int(table_cfg.get("merge_time_series_rows_min", 3))
+        self.table_skip_keywords = list(
+            table_cfg.get(
+                "skip_keywords",
+                ["系统功能要求", "性能要求", "功能矩阵", "能力对照", "性能指标对照"],
+            )
+        )
+        self.table_interface_keywords = list(
+            table_cfg.get(
+                "interface_keywords",
+                ["接口", "interface", "输入输出", "I/O", "数据来源", "数据目的地", "来源", "目的地"],
+            )
+        )
+        self.table_single_requirement_keywords = list(
+            table_cfg.get(
+                "single_requirement_keywords",
+                ["硬件要求", "软件要求", "运行环境", "环境需求", "资源需求", "计算机资源"],
+            )
+        )

        rewrite_cfg = extraction_cfg.get("rewrite_policy", {})
        self.llm_light_rewrite_enabled = bool(rewrite_cfg.get("llm_light_rewrite_enabled", True))
        self.preserve_ratio_min = float(rewrite_cfg.get("preserve_ratio_min", 0.65))
        self.max_length_growth_ratio = float(rewrite_cfg.get("max_length_growth_ratio", 1.25))
+        self.non_interface_max_edit_distance = int(rewrite_cfg.get("non_interface_max_edit_distance", 20))
+
+        self.system_description_hints = list(
+            extraction_cfg.get(
+                "system_description_hints",
+                ["系统描述", "功能描述", "概述", "示意图", "组成", "架构", "原理"],
+            )
+        )

        renumber_cfg = extraction_cfg.get("renumber_policy", {})
        self.renumber_enabled = bool(renumber_cfg.get("enabled", True))
        self.renumber_mode = renumber_cfg.get("mode", "section_continuous")

+        dedup_cfg = extraction_cfg.get("dedup_policy", {})
+        self.dedup_similarity_threshold = float(dedup_cfg.get("similarity_threshold", 0.88))
+        self.enable_cross_section_dedup = bool(dedup_cfg.get("enable_cross_section_dedup", True))
+        self.prefer_text_over_table = bool(dedup_cfg.get("prefer_text_over_table", True))
+
+        interface_cfg = extraction_cfg.get("interface_policy", {})
+        self.interface_unknown_fallback = str(interface_cfg.get("unknown_fallback", "未知"))
+
+        normalization_cfg = extraction_cfg.get("normalization_policy", {})
+        self.ocr_spacing_normalize = bool(normalization_cfg.get("ocr_spacing_normalize", True))
+
+        fidelity_cfg = extraction_cfg.get("fidelity_policy", {})
+        self.preserve_source_text_for_text_blocks = bool(
+            fidelity_cfg.get("preserve_source_text_for_text_blocks", True)
+        )
+
+        punctuation_cfg = extraction_cfg.get("punctuation_policy", {})
+        self.ensure_terminal_period = bool(punctuation_cfg.get("ensure_terminal_period", True))
+
    def _build_rules(self, req_types_cfg: Dict[str, Dict[str, Any]]) -> List[RequirementTypeRule]:
        rules: List[RequirementTypeRule] = []
        if not req_types_cfg:
@@ -153,10 +252,45 @@ class AppSettings:
    def is_non_requirement_section(self, title: str) -> bool:
        return any(keyword in title for keyword in self.non_requirement_sections)

+    def is_interface_semantic_title(self, title: str) -> bool:
+        t = (title or "").strip().lower()
+        if not t:
+            return False
+
+        excluded = any(x in t for x in self.interface_title_excludes)
+        if excluded and "接口" not in t:
+            return False
+
+        return any(h in t for h in self.interface_section_hints)
+
+    def is_functional_semantic_title(self, title: str) -> bool:
+        t = (title or "").strip().lower()
+        if not t:
+            return False
+        return any(h in t for h in self.functional_section_hints)
+
+    def is_other_semantic_title(self, title: str) -> bool:
+        t = (title or "").strip().lower()
+        if not t:
+            return False
+        return any(h in t for h in self.other_section_hints)
+
    def detect_requirement_type(self, title: str, content: str) -> str:
+        # 章节语义优先：接口仅由接口类章节触发；安全/保密/适应性等统一归其他需求。
+        if self.is_interface_semantic_title(title):
+            return "interface"
+        if self.is_functional_semantic_title(title):
+            return "functional"
+        if self.is_other_semantic_title(title):
+            return "other"
+
        combined_text = f"{title} {(content or '')[:500]}".lower()
        for rule in self.requirement_rules:
+            if rule.key == "interface" and not self.is_interface_semantic_title(title):
+                continue
            for keyword in rule.keywords:
                if keyword.lower() in combined_text:
+                    if rule.key in {"performance", "security", "reliability", "other"}:
+                        return "other"
                    return rule.key
        return "functional"