Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web parser refactoring #1587

Open
nsarrazin opened this issue Nov 25, 2024 · 0 comments
Open

Web parser refactoring #1587

nsarrazin opened this issue Nov 25, 2024 · 0 comments
Labels
back This issue is related to the Svelte backend or the DB enhancement New feature or request websearch

Comments

@nsarrazin
Copy link
Collaborator

Describe your feature request

The goal is to move HuggingChat web parsing to a Space that has auto-scaling for increased security.

We should make the parser configurable with a PARSER_CONFIG env var, with at least three options for now:

  • SIMPLE (just html fetch)
  • PLAYWRIGHT (current parser)
  • EXTERNAL (see below)

The external parser should be an external webserver we could host on Spaces for HuggingChat. This would bring multiple benefits:

  • We can rely on the security of the existing Spaces infrastructure which is more battle tested
  • Decouples parsing load from chat load
  • Make the parsing servers more easily scaleable and fault-tolerant so if (when) playwright starts leaking memory we can easily refresh the parser pods without having to constantly kill the nodejs ones
  • We can make use of existing python RAG libraries for the parser servers instead of relying on a custom implementation

We should also add a build arg to the docker image INSTALL_PLAYWRIGHT since the simple & external parsers do not need to bundle playwright in the final image.

@nsarrazin nsarrazin added enhancement New feature or request back This issue is related to the Svelte backend or the DB websearch labels Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back This issue is related to the Svelte backend or the DB enhancement New feature or request websearch
Projects
None yet
Development

No branches or pull requests

1 participant