Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content of Some links cannot be crawled #65

Open
simmonn opened this issue Apr 26, 2024 · 2 comments
Open

Content of Some links cannot be crawled #65

simmonn opened this issue Apr 26, 2024 · 2 comments

Comments

@simmonn
Copy link

simmonn commented Apr 26, 2024

Description

Hi, I encountered a problem. After executing the scraper, I found that the content of some links cannot be crawled. The logs show 0 records. I have tried many methods, but it still cannot be crawled.

here is the snapshot of logs:
image

Steps to reproduce

here is part of my config

{
  "index_name": "docs",
  "sitemap_urls": [
    "https://mydomain/sitemap.xml"
  ],
  "start_urls": [
    {
      "url": "https://mydomain/guides",
      "tags": [
        "guides"
      ],
      "selectors_key": "guides"
    }
  ],
  "stop_urls": [],
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "文档"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    },
    "guides": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "开发指南"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag",
      "tags"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "nb_hits": 2227
}

Expected Behavior

I hope to crawl the content of all the links in the configuration into Typesense.

Actual Behavior

Content cannot be searched

image

Metadata

Typesense Version: maybe 0.24,I don't know how to get to know version

OS:x86_64 GNU/Linux

@jasonbosco
Copy link
Member

Could you make sure the html selectors exist on that page?

Also, could you make sure that the base url of those links are specified in start_urls section?

@simmonn
Copy link
Author

simmonn commented May 8, 2024

Could you make sure the html selectors exist on that page?

Also, could you make sure that the base url of those links are specified in start_urls section?

Yes, I had configured it. These selectors can be selected using XPath expressions in the Chrome console. And I tried using BeautifulSoup to compress the HTML source code, which can solve the problem. But I'm not sure what the root cause is.
here is the code :
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants