DCLM Style Deduplications #214

revbucket · 2024-09-30T20:57:14Z

General updates to the dedupe command to do deduplication using a joint paragraph/document flow in the same way that DCLM does.

Nuanced update list:
Bloom Filter updates:

Used better binary search to get optimal BF size
Initialize Bloom Filters with multicore parallelism in mind

Deduper updates:

Switched out threadpool for rayon (cleaner, but equivalently performant)
Added optional read/write of bloom filter file (usually it's not necessary to save this, right?)
Made the main rust fxn more modular, easier to add in different types of dedupe methods
Logged some after-dedupe stats: {sparsity, removal rate}
Added DCLM style deduplication

Other stuff:

Modified the dedupe config to want a "dedupe.dedupe_method" attribute to specify which type {documents, paragraphs, dclm} of deduplication we do
Updated tutorial/etc to include this modified config^

…st dclm

Whattabatt

Run make style to pass the linter and style check . Also, please add tests in https://github.com/allenai/dolma/blob/main/tests/python/test_deduper.py

Whattabatt · 2024-10-15T20:55:47Z

python/dolma/cli/deduper.py

@@ -108,7 +127,7 @@ class DeduperConfig:
    dedupe: DedupeConfig = field(help="Deduplication configuration. Required.")
    bloom_filter: BloomFilterConfig = field(help="Bloom filter configuration. Required.")
    processes: int = field(
-        default=1, help="Number of processes to use for deduplication. If 1, no multiprocessing will be used."
+        default=0, help="Number of processes to use for deduplication. If 1, no multiprocessing will be used."


Why this change?

0 => means we do the max parallelism (processes becomes number of cores available). I just assumed that we want this behavior almost all of the time

This might not actually play nice with beaker nodes and how CPU's get allocated here. I'll fall back on ai2-best-practices here

Update the help string to reflect this since it's non-obvious

python/dolma/cli/deduper.py

src/bloom_filter.rs

Matt Jordan added 7 commits September 19, 2024 16:17

Untested commit for many DCLM-dedup, plus many more changes

cb57a09

Fixed some CLI stuff -- paragraph seems to work, but still need to te…

69839ed

…st dclm

Pushing local changes to s3 to verify against bff --hopefully working

e0992c4

bugfixy for some edge cases

bc28fc9

Maybe some slightly better logging?

316283e

Updated docs/examples wth new dedupe method

da10949

Merge branch 'main' into mattj/bff-0924

0af2db6

Whattabatt reviewed Oct 15, 2024

View reviewed changes

Matt Jordan added 3 commits October 16, 2024 15:34

Tests failing, need to chack back with main

0903484

oops

e858388

Oops x2

67a4a9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCLM Style Deduplications #214

DCLM Style Deduplications #214

revbucket commented Sep 30, 2024

Whattabatt left a comment •

edited

Loading

Whattabatt Oct 15, 2024

revbucket Oct 16, 2024

Whattabatt Oct 22, 2024

DCLM Style Deduplications #214

Are you sure you want to change the base?

DCLM Style Deduplications #214

Conversation

revbucket commented Sep 30, 2024

Whattabatt left a comment • edited Loading

Choose a reason for hiding this comment

Whattabatt Oct 15, 2024

Choose a reason for hiding this comment

revbucket Oct 16, 2024

Choose a reason for hiding this comment

Whattabatt Oct 22, 2024

Choose a reason for hiding this comment

Whattabatt left a comment •

edited

Loading