Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] a transform to perform file level de-dupe (exact) #870

Open
2 tasks done
sujee opened this issue Dec 12, 2024 · 5 comments
Open
2 tasks done

[Feature] a transform to perform file level de-dupe (exact) #870

sujee opened this issue Dec 12, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@sujee
Copy link
Contributor

sujee commented Dec 12, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/universal/ededup

Feature

Current process for dedupe is

  • running pdf2pq
  • it processes document content and computes document hashes
  • then we can run ededupe on the computed hashes

This has some drawbacks

  • we need to process the documents (extract contents, calculate hashes ..etc) before we have document level hash available. This can be resource intensive for large pdfs / zips ..etc. There is no need to spend resources processing duplicate content if we are going to eliminate it soon after

What I propose is

  • a new transformation (say file-dedupe)
  • computes file hashes (no processing content)
  • eliminates duplicate files and only keeps unique files
  • can work with any file type

Functionality

  • write a python version first
  • parameters: input_dir and output_dir
  • the transform create unique files in output_dir . We don't need to copy the files; just symlink or hardlink when available.
  • The output dir can be then fed to other stages

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the enhancement New feature or request label Dec 12, 2024
@sujee sujee changed the title [Feature] perform file level de-dupe (exact) before processing files [Feature] a transform to perform file level de-dupe (exact) Dec 12, 2024
@Bytes-Explorer
Copy link
Collaborator

@touma-I @daw3rd pls add in your comments and suggestions.

@touma-I
Copy link
Collaborator

touma-I commented Dec 13, 2024

@sujee What is file hash ? how is it calculated? Is it based on file name and size and other metadata without consideration for its content ? Or is it treating the file content as just a stream of bytes and calculating the hash ? It is still not clear to me how you get around reading the file to calculate the hash. Can you please elaborate what the file hash formula is ?

@sujee
Copy link
Contributor Author

sujee commented Dec 13, 2024

hash is based on file content - treating it as bunch of bytes.
So yes, we do need to read the files.
but no need to process them (no pdf2pq ..etc)

something like this

with open(file_path, 'rb') as file:
    sha1_hash = hashlib.sha1()

    chunk_size = 4096  # Adjust this value based on your needs
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break

        sha1_hash.update(chunk)

return sha1_hash.hexdigest()

@touma-I
Copy link
Collaborator

touma-I commented Dec 13, 2024

@sujee how would you treat zip and tar files?

@sujee
Copy link
Contributor Author

sujee commented Dec 13, 2024

@sujee how would you treat zip and tar files?

For first-version, I plan to treat zip/tar files as ONE file. So if there are duplicate zip/tar files, dupes will be eliminated. I won't 'look inside' the archives.

in the next version, I can add functionality to extract the archive content, and perform de-dupe on all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants