[Feature] a transform to perform file level de-dupe (exact) #870

sujee · 2024-12-12T06:15:29Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/ededup

Feature

Current process for dedupe is

running pdf2pq
it processes document content and computes document hashes
then we can run ededupe on the computed hashes

This has some drawbacks

we need to process the documents (extract contents, calculate hashes ..etc) before we have document level hash available. This can be resource intensive for large pdfs / zips ..etc. There is no need to spend resources processing duplicate content if we are going to eliminate it soon after

What I propose is

a new transformation (say file-dedupe)
computes file hashes (no processing content)
eliminates duplicate files and only keeps unique files
can work with any file type

Functionality

write a python version first
parameters: input_dir and output_dir
the transform create unique files in output_dir . We don't need to copy the files; just symlink or hardlink when available.
The output dir can be then fed to other stages

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Bytes-Explorer · 2024-12-13T06:23:56Z

@touma-I @daw3rd pls add in your comments and suggestions.

touma-I · 2024-12-13T07:25:56Z

@sujee What is file hash ? how is it calculated? Is it based on file name and size and other metadata without consideration for its content ? Or is it treating the file content as just a stream of bytes and calculating the hash ? It is still not clear to me how you get around reading the file to calculate the hash. Can you please elaborate what the file hash formula is ?

sujee · 2024-12-13T08:13:05Z

hash is based on file content - treating it as bunch of bytes.
So yes, we do need to read the files.
but no need to process them (no pdf2pq ..etc)

something like this

with open(file_path, 'rb') as file:
    sha1_hash = hashlib.sha1()

    chunk_size = 4096  # Adjust this value based on your needs
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break

        sha1_hash.update(chunk)

return sha1_hash.hexdigest()

touma-I · 2024-12-13T19:45:28Z

@sujee how would you treat zip and tar files?

sujee · 2024-12-13T20:04:35Z

@sujee how would you treat zip and tar files?

For first-version, I plan to treat zip/tar files as ONE file. So if there are duplicate zip/tar files, dupes will be eliminated. I won't 'look inside' the archives.

in the next version, I can add functionality to extract the archive content, and perform de-dupe on all.

sujee added the enhancement New feature or request label Dec 12, 2024

sujee changed the title ~~[Feature] perform file level de-dupe (exact) before processing files~~ [Feature] a transform to perform file level de-dupe (exact) Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] a transform to perform file level de-dupe (exact) #870

[Feature] a transform to perform file level de-dupe (exact) #870

sujee commented Dec 12, 2024 •

edited

Loading

Bytes-Explorer commented Dec 13, 2024

touma-I commented Dec 13, 2024

sujee commented Dec 13, 2024

touma-I commented Dec 13, 2024

sujee commented Dec 13, 2024

[Feature] a transform to perform file level de-dupe (exact) #870

[Feature] a transform to perform file level de-dupe (exact) #870

Comments

sujee commented Dec 12, 2024 • edited Loading

Search before asking

Component

Feature

Are you willing to submit a PR?

Bytes-Explorer commented Dec 13, 2024

touma-I commented Dec 13, 2024

sujee commented Dec 13, 2024

touma-I commented Dec 13, 2024

sujee commented Dec 13, 2024

sujee commented Dec 12, 2024 •

edited

Loading