[BUG] Web 知识库无法完整同步某些站点 #1563

saurlax · 2024-11-06T09:23:21Z

联系方式

[email protected]

MaxKB 版本

v1.7.0 (build at 2024-10-31T12:49, commit: 44b3aed)

问题描述

使用 Web 知识库同步某些网站时，只能同步到一两个页面，例如下面的链接：

https://starlight.astro.build/

但是下面的网站却可以正常同步：

https://tauri.app/

怀疑有可能是因为链接跟踪的问题，tauri 的文档在页头中有超链接的 /start、/concept、/blog、/release 开头的就都能同步，而 starlight 的没有这样的规律，就导致了只同步的两个文档。

重现步骤

同步 starlight 的参数配置：

结果只得到了首页和第一个页面：

同步 tauri 的参数配置：

结果可以得到所有/start、/concept、/blog、/release 开头的文档：

期待的正确结果

能够同步设置中定义的根目录下的所有页面。

附加信息

此外建议可以增加从 sitemap 导入的功能，这样对于一些现有支持 sitemap 功能的框架导入效果更好。知识库名称也可以设置为选取页面的 title 而不是超链接的文本。

The text was updated successfully, but these errors were encountered:

saurlax · 2024-11-06T09:53:04Z

原因是 fork_child 函数的递归传参不当，导致在深层页面内搜集到的其他兄弟页面会被误丢弃，例如：

设定根目录为 /
开始遍历根目录，得到 /a
遍历 /a 目录，得到 /a/child 和 /b，因为此时的 base_url 被设置成了 /a，所以本应该被爬取的页面 /b 被误抛弃了

    @staticmethod
    def fork_child(child_link: ChildLink, selector_list: List[str], level: int, exclude_link_url: Set[str],
                   fork_handler):
        if level < 0:
            return
        else:
            child_link.url = remove_fragment(child_link.url)
            child_url = child_link.url[:-1] if child_link.url.endswith('/') else child_link.url
        if not exclude_link_url.__contains__(child_url):
            exclude_link_url.add(child_url)
            response = Fork(child_link.url, selector_list).fork()
            fork_handler(child_link, response)
            for child_link in response.child_link_list:
                child_url = child_link.url[:-1] if child_link.url.endswith('/') else child_link.url   # 这个列表会抛弃所有不是以 `url` 开头的链接
                if not exclude_link_url.__contains__(child_url):
                    ForkManage.fork_child(child_link, selector_list, level - 1, exclude_link_url, fork_handler)  # 但是传参的时候是逐级深入的，会导致某些兄弟页面无法被爬取

修复建议

多添加一个 root_url 参数来保证正确传递 base_url 信息。

zyyfit · 2024-11-12T06:49:04Z

感谢反馈，我们先排查一下问题

GuoDapeng · 2024-12-12T05:10:43Z

我没有看代码，通过测试验证了同步站点存在的问题。我现在使用一个非常丑陋的方式达到同步整个站点的目的。我的站点是 SSR 的。

使用 shell 脚本找出全部的路径

find . -type f -name "*.html" | sed 's|^\.\(.*\)/[^/]*$|\1|'

在跟页插入隐藏节点，里面包含全部需要同步的链接。

/**
 * inline: true
 */
import React from 'react';
import Link from 'antd/es/typography/Link';

const SEOLinks = [
  { url: '/home' },
  { url: '/home/bilibili' },
  { url: '/ffmpeg/macos' },
  { url: '/ffmpeg' },
];

export default React.FC = () => (
  <div style={{ display: 'none' }}>
    {SEOLinks.map((it, index) => (
      <Link href={it.url} target="_blank">
        {it.url}
      </Link>
    ))}
  </div>
);

虽然丑，但是满足了我的需求。

saurlax assigned zyyfit Nov 6, 2024

saurlax added a commit to saurlax/MaxKB that referenced this issue Nov 6, 2024

fix: 修复fork抓取不全 1Panel-dev#1563

da63756

saurlax mentioned this issue Nov 6, 2024

fix: 修复fork抓取不全 #1563 #1565

Closed

3 tasks

zyyfit added the 类型:待验证 label Nov 12, 2024

baixin513 added 状态:已完成 and removed 状态:已完成 labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Web 知识库无法完整同步某些站点 #1563

[BUG] Web 知识库无法完整同步某些站点 #1563

saurlax commented Nov 6, 2024

saurlax commented Nov 6, 2024 •

edited

Loading

zyyfit commented Nov 12, 2024

GuoDapeng commented Dec 12, 2024

[BUG] Web 知识库无法完整同步某些站点 #1563

[BUG] Web 知识库无法完整同步某些站点 #1563

Comments

saurlax commented Nov 6, 2024

联系方式

MaxKB 版本

问题描述

重现步骤

期待的正确结果

相关日志输出

附加信息

saurlax commented Nov 6, 2024 • edited Loading

zyyfit commented Nov 12, 2024

GuoDapeng commented Dec 12, 2024

saurlax commented Nov 6, 2024 •

edited

Loading