Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catastrofic backtracking fix #426

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DemonTPx
Copy link

@DemonTPx DemonTPx commented Apr 9, 2024

Fixes #425

Update setHostInContent regexes to reduce steps and hopefully prevent a PREG_BACKTRACK_LIMIT_ERROR and check result of preg_replace to be sure we do not set content or description to null.

Also refactor setHostInContent and getHostFromLink a bit for readability.

cc @IgorA100

… a PREG_BACKTRACK_LIMIT_ERROR and check result of preg_replace to be sure we do not set content or description to null
@DemonTPx DemonTPx requested a review from alexdebril as a code owner April 9, 2024 13:17
@DemonTPx
Copy link
Author

DemonTPx commented Apr 9, 2024

I also tested the updated regex again with a modified version of the feed item, see:

https://regex101.com/r/oxU67Z/1

@IgorA100
Copy link
Contributor

IgorA100 commented Apr 9, 2024

I see this problem with your link.
I'll think about a solution.

$this->description = preg_replace('!(<*\s*[^>]*)(src=)(.?)(\/[^\/])!','\1 src=\3'.$host.'\4', $this->description );
}
if (property_exists($this, 'content') && !is_null($this->content)){
$this->content = preg_replace('!(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])!','\1\2\3'.$host.'\4', $this->content) ?? $this->content;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a lot of nested tags.
Change to:
$this->content = preg_replace('!(href=|src=)(.?)(\/[^\/])!','\1\2'.$host.'\3', $this->content) ?? $this->content;
Similarly for "$this->description"

This will fix error for preg_replace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write the most correct regular expression a little later.
Wait a bit. I have a lot to do now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed the regex in my PR. The cycles went down from over 5.8 million to 10.000 by removing the * after the first <

Copy link
Contributor

@IgorA100 IgorA100 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also necessary to exclude the processing of "href" in tags <code>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, maybe the regex should only search inside <a> and <img> tags or something?

Copy link
Contributor

@IgorA100 IgorA100 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is most correct regular expression:
preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->content)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the regex should only search inside <a> and <img> tags

I'm not 100% sure about this, but I'll think about it. In addition, it is still necessary to replace relative links like href="xxx..."

Copy link
Contributor

@IgorA100 IgorA100 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if analyze only <a> and <img>, don’t need to change them in <code>.

Copy link
Contributor

@IgorA100 IgorA100 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the function to include all the other possible replacements for relative links.
Can you test my changes and make them in this PR?
All replacements can be moved to a separate function since there are more of them.

All links in https://go.dev/blog/feed.atom are replaced correctly.

    protected function setHostInContent(string $host = null): void
    {
        if (is_null($host)) {
            return;
        }
        // Replaced links like href="/aaa/bbb.xxx"
        if (property_exists($this, 'content') && !is_null($this->content)) {
            $this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->content) ?? $this->content;
        }
        if (property_exists($this, 'description') && !is_null($this->description)) {
            $this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->description) ?? $this->description;
        }

        $itemFullLink = $this->getLink();
        if (property_exists($this, 'link') && !is_null($itemFullLink)) {
            $itemLink = implode("/", array_slice(explode("/", $itemFullLink), 0, -1))."/";
            if (property_exists($this, 'content') && !is_null($this->content)){
                // Replaced links like href="#aaa/bbb.xxx"
                $this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemFullLink.'\4', $this->content) ?? $this->content;

                // Replaced links like href="aaa/bbb.xxx"
                $this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemLink.'\4', $this->content) ?? $this->content;
            }
            if (property_exists($this, 'description') && !is_null($this->description)) {
                // Replaced links like href="#aaa/bbb.xxx"
                $this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemFullLink.'\4', $this->description) ?? $this->description;

                // Replaced links like href="aaa/bbb.xxx"
                $this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemLink.'\4', $this->description) ?? $this->description;
            }
        }
    }


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanest code:

    protected function setHostInContent(string $host = null): void
    {
        if (is_null($host)) {
            return;
        }
        // Replaced links like href="/aaa/bbb.xxx"
        $pattern = '(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)';
        $this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$host.'\4');
        $this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$host.'\4');

        $itemFullLink = $this->getLink();
        $itemLink = implode("/", array_slice(explode("/", $itemFullLink), 0, -1))."/";

        // Replaced links like href="#aaa/bbb.xxx"
        $pattern = '(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)';
        $this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$itemFullLink.'\4');
        $this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$itemFullLink.'\4');

        // Replaced links like href="aaa/bbb.xxx"
        $pattern = '(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)';
        $this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$itemLink.'\4');
        $this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$itemLink.'\4');
    }

    public function pregReplaceInProperty(string $property, string $pattern, string $replacement): void
    {
        if (property_exists($this, $property) && !is_null($this->{$property})) {
            $this->{$property} = preg_replace('~'.$pattern.'~', $replacement, $this->{$property}) ?? $this->{$property};
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catastrofic backtracking in regex crashes feed read
2 participants