apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207

alex-hh · 2024-10-08T15:44:53Z

I got to this by hacking around a bit but it seems to solve #7206

I have no idea if this approach makes sense or would break something else?

Could maybe work on a full pr if this looks reasonable @lhoestq ? I imagine the same issue might affect other iterable dataset methods?

lhoestq

Interesting ! Let's validate if we get the same speed on those 2 ?

Dataset.to_iterable_dataset -> numpy format
Dataset.to_iterable_dataset -> numpy format -> batched map identity (batched could be quite important)

lhoestq · 2024-10-08T15:48:29Z

src/datasets/iterable_dataset.py

+        if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
+            formatter = get_formatter(self.formatting.format_type)


oh this might me a good idea

lhoestq · 2024-10-08T15:52:27Z

src/datasets/iterable_dataset.py

@@ -926,7 +926,7 @@ def __init__(

    @property
    def iter_arrow(self):
-        if self.formatting and self.formatting.format_type == "arrow":
+        if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):


not sure about this change though, since it will do

Original data in Arrow -> Map function in NumPy -> Arrow (since it has iter_arrow and Arrow is preferred) -> Python objects

While it could be

Original data in Arrow -> Map function in NumPy -> Python objects -> Python objects

agreed, I wasn't really understanding the flow - have updated with a more direct attempt

alex-hh · 2024-10-08T21:37:29Z

I think the problem is that the underlying ex_iterable will not use iter_arrow unless the formatting type is arrow, which leads to conversion from arrow -> python -> numpy in this case rather than arrow -> numpy.

Idea of updated fix is to use the ex_iterable's iter_arrow in any case where it's available and any formatting is specified. The formatter then works directly on arrow tables; the outputs of the formatter get passed to the function to be mapped.

With updated version:

import numpy as np
import time
from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=True, batch_size=10)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Total time: < 0.01s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=False)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02 s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02s

alex-hh · 2024-10-08T22:32:28Z

also now working for filter with similar performance improvements:

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: [arr.shape[0]==2000 for arr in x["array0"]], batch_size=10, batched=True)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.01s vs 50s on main

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: x["array0"].shape[0]==2000, batched=False)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.04s vs 50s on main

HuggingFaceDocBuilderDev · 2024-10-09T15:21:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-10-09T15:23:36Z

src/datasets/iterable_dataset.py

@@ -2211,6 +2250,7 @@ def map(
        remove_columns: Optional[Union[str, List[str]]] = None,
        features: Optional[Features] = None,
        fn_kwargs: Optional[dict] = None,
+        format_outputs: bool = True,


is this useful ?

was useful for optimising performance in my case -

idea is it allows the user to determine the formatting of the examples returned by map (within the mapped function) rather than via self.formatting, which determines the formatting of the inputs to the map function

if map returns lists / arbitrary types then re-applying the formatter to the outputs can be expensive

However, haven't thought about how to handle preserving a consistent self._formatting FormattingConfig for downstream dataset transformations...might require something better than this kwarg

You can still do ds = ds.map(...).with_format(None) and no formatting will be applied to the output

perfect - will remove the kwarg

lhoestq · 2024-10-09T15:28:28Z

(the distributed tests failing in the CI are unrelated)

alex-hh · 2024-10-09T15:37:04Z

There also appears to be a separate? issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

I don't have a good minimal example atm

lhoestq · 2024-10-09T15:38:06Z

src/datasets/iterable_dataset.py

            )
+            batched_examples_iterator = formatted_arrow_examples_iterator(ex_iterable, formatter, batched=self.batched)


maybe name it input_iterator or something like that since it's not necessarily batched ?

src/datasets/iterable_dataset.py

lhoestq · 2024-10-09T15:52:15Z

issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

Maybe related to this issue ?

ds = Dataset.from_dict({"a": range(10)}).to_iterable_dataset()
ds = ds.with_format("arrow").map(lambda x: x, features=Features({"a": Value("string")})).with_format(None)
print(list(ds))  # yields integers instead of strings

lhoestq · 2024-10-09T15:56:50Z

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

btw you can pass features= in get_formatter() to get a formatter that does the feature conversion at the same time as formatting

(edit:

except maybe the arrow formatter doesn't use features yet, we can fix it like this if it's really needed

class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]):
    def format_row(self, pa_table: pa.Table) -> pa.Table:
-         return self.simple_arrow_extractor().extract_row(pa_table)
+         pa_table = self.simple_arrow_extractor().extract_row(pa_table)
+.        return cast_table_to_features(pa_table, self.features) if self.features else pa_table

)

alex-hh · 2024-10-09T16:39:39Z

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

… arrow

lhoestq · 2024-10-09T17:14:52Z

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

Yes this is surely the way to go actually !

alex-hh · 2024-10-09T17:52:09Z

ok i've fixed the chaining issue with my last two commits.

Will see if I can refactor into a FormattedExampleIterable

The other issue you posted seems to be unrelated (maybe something to do with feature decoding?)

lhoestq · 2024-10-11T12:51:44Z

src/datasets/iterable_dataset.py

+        # if self.formatting is not None, we format, filter, then return arrow iterator
+        # if formatting is unset, we don't, as that would require filter fn to accept arrow format
+        if self.formatting and self.ex_iterable.iter_arrow:


I think it's ok to not return iter_arrow if the formatting is not "arrow", this way we decode the data only once in the pipeline (e.g. if it contains image/audio data)

alex-hh · 2024-10-12T16:43:18Z

Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

lhoestq · 2024-10-14T16:56:34Z

Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

For ArrayND they already implement to_pylist to decode arrow data and it can be updated to return a numpy array (see the ArrayExtensionArray class for more details)

lhoestq

I left a few more comments, this is so close to be ready !

lhoestq · 2024-10-14T17:13:42Z

src/datasets/iterable_dataset.py

            info=self._info.copy(),
            split=self._split,
-            formatting=FormattingConfig(format_type=type),
+            formatting=None,  # formatting is applied in ex_iterable


Originally we tried to keep formatting as an attribute of the IterableDataset so that people can change the formatting without having to create nested examples iterables that would apply a different formatting every time.

What is the rationale for removing it from here ?

Note that applying a formatting on data already in the right format is cheap as you also noted (but no need to do is_formatted indeed)

c.f. comment #7207 (comment)

my thinking was that this way the formatting is never `lost'; it only gets overridden if the user overrides it.

there could definitely be some nested cases that i didn't consider though, can you think of some?

I guess also that hopefully having an explicit FormattedExamplesIterable makes some of the concern about nesting at least handled in a more transparent way: I guess now formatting is now really the responsibilyt of the FormattedExamplesIterable, and only happens when it gets iterated over (unless formatting is never set in which case it is still currently handled in IterableDataset._iter_ . I guess that formatting could probably also be handled by FormattedExamplesIterable with a bit more refactoring (although would somewhat prefer not to add to this pr unless seems essential)?)

By nested I actually meant consecutive, I had in mind that this code would be a no-op:

format = ds.format ds = ds.with_format("np") ds = ds.with_format(**format)

(actually just noticed IterableDataset.format is not implemented yet, though it should be and work like Dataset.format, but anyway the point is successive with_format would override the previous ones since it's an operation that is done "on top" of the dataset rather than a processing function)

take the point on the consecutive formatting

For me it does kind of seem that a with_format().map() should not be assumed to preserve format state especially in the absence of any solution to disable output formatting; this is my main motivation I think.

I suppose you could imagine a solution where _formatting is preserved as an attribute by all transformations until map is applied, and only map removes it? Then the wrapping with FormattedExamplesIterable could happen in iter for non map transforms but in map directly?

Or map could just accept an output_formatting kwarg? Maybe with False to set self._formatting to None? Or with_format(False) could set self._formatting to None?

reverted this case so with_format.with_format can still no op

lhoestq · 2024-10-14T17:14:16Z

src/datasets/iterable_dataset.py

        )
        info = self.info.copy()
        info.features = features
        return IterableDataset(
            ex_iterable=ex_iterable,
            info=info,
            split=self._split,
-            formatting=self._formatting,
+            formatting=None,  # formatting is applied in mapped ex_iterable; no need to re-apply it in IterableDataset


you mean formatting is already applied to the input, but not to the output afaik ?

yes - i guess what i was thinking is that if a user calls

ds.with_format(format_type).map(function)

then they are responsible for choosing the format applied to the mapped function outputs:

if the mapped function does nothing, then the output formatting will still be be format_type.

it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).

e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter

it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).
e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter

You can keep in self._formatting because np.asarray(obj) is cheap on an object that is already a numpy array (zero copy). And if someone does this

ds.with_format(format_type).map(function).with_format(another_type)

then np.asarray would not be called on the output since we can just replace the IterableDataset._formatting with the new one (similarly to the logic in my comment #7207 (comment))

I think my question around this was that there is no way to just prevent any formatting happening on the outputs of map, because the user can't set _formatting to None (not what with_format(None) does).

So if map already takes care of formatting (including formatting to some object type not compatible with the standard formatting options provided by datasets) then keeping self._formatting risks introducing an expensive re-formatting step, even if user calls a new with_format.

Maybe with_format(None) should imply no formatting (contrary to the current logic which is python formatting) ? would that work ?

@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?

Realise I haven't been clear about this in my head so far, and know your comment above was somewhat related but wanted to be sure I understand!

No problem, this is not as easy as it seems though you've been doing great ! Anyway, to answer your questions:

is it right that formatting doesn't get re-applied to output of Dataset.map?

MappedExamplesIterable doesn't apply formatting, but the IterableDataset does apply formatting if the users did set one.

e.g. would it potentially be possible to completely remove format_dict in IterableDataset.iter in the case that ex_iterable.iter_arrow evaluates to False, since in this case extraction+formatting has presumably already occurred?

the output of map can be anything in MappedExamplesIterable, it's the IterableDataset's job to format it if the user asks for a specific format. If there is no format set by the user, we yield the example as is (the data can even contain arbitrary python objects). E.g. if a user wants to do some processing from numpy arrays and return arbitrary python objects they should do ds = ds.with_format("np").map(fn).with_format(None)

So no I don't think we can remove the format_dict in IterableDataset (or we would have to add FormattedExamplesIterable after each MappedExamplesIterable but this doesn't allow disabling formatting after a map)

@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?

the second with_format sets the format of the data yielded when iterating on the IterableDataset - it doens't change the input to map

so, just to be sure, the default behaviour of with_format().map() should be to apply the specified formatting to both the inputs and the outputs of map?

I know this goes back to previous discussions but asking because it looked like for the non-iterable arrow_dataset.Dataset.map formatting only applies to the inputs of the mapped function and not the outputs of the mapped function in which case I wondered if we could do the same here - but I wasn't sure I was understanding that correctly.

the output of Dataset.map is written as Arrow data on disk, but when the data is accessed it's formatted using the Dataset format type (map doesn't change the format of the dataset)

that's why a code likes this still returns numpy arrays:

>>> ds = with_format("np").map(...) >>> ds[0] # dict of numpy arrays

and it can be reset to python objects with

>>> ds = with_format("np").map(...).with_format("python") >>> ds[0] # dict of python objects

ok got it thanks!

src/datasets/iterable_dataset.py

alex-hh · 2024-10-25T00:19:23Z

@lhoestq im no longer sure my specific concern about with_format(None) was well-founded - I didn't appreciate that the python formatter tries to do nothing to python objects including numpy arrays, so the existing with_format(None) should I think do what I want. Do you think with_format(None) is ok as is after all? If so think this is hopefully ready for final review!

alex-hh · 2024-10-31T11:41:19Z

@lhoestq I've updated to make compatible with latest changes on main, and think the current with_format None behaviour is probably fine - please let me know if there's anything else I can do!

lhoestq · 2024-11-05T15:19:09Z

Hi Alex, I will be less available from today and for a week. I'll review your PR and play with it once I come back if you don't mind !

apply formatting after iter_arrow

a73bb02

lhoestq reviewed Oct 8, 2024

View reviewed changes

add support for formatting to map iteration

4a761a9

formatted iterator for filter

3b65d99

alex-hh marked this pull request as ready for review October 8, 2024 22:04

fix filtered formatting

d906b9f

alex-hh force-pushed the iterable-map-with-format branch from 7119cd2 to d906b9f Compare October 8, 2024 22:36

alex-hh changed the title ~~apply formatting after iter_arrow to speed up format -> map for iterable datasets~~ apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets Oct 8, 2024

option to disable formatting for outputs of map

421917d

alex-hh requested a review from lhoestq October 9, 2024 07:45

alex-hh mentioned this pull request Oct 9, 2024

Preserve features in iterable dataset.filter #7209

Merged

lhoestq reviewed Oct 9, 2024

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

remove format_outputs kwarg

e7b67c3

rename batched_examples_iterator -> inputs_iterator

a4f9700

support arbitrary input formatting in filtered examples iterable iter…

a465abd

… arrow

preserve formatting on filtered shuffle

1863f8c

alex-hh added 2 commits October 9, 2024 20:51

pass token_per_repo_id to python_feature_decoder in formatters

205e0d6

implement FormattedExamplesIterator

42dc44f

alex-hh added 10 commits October 10, 2024 19:35

only iter arrow on filter if formatting is set

76e09a1

add features property to support feature inference

ee45f7f

fix features property

b828575

dont re-encode featuers

f76701b

avoid re-encoding outputs of map

15a8cfe

map should not preserve formatting

884bba1

update comment

d979672

update map features property

190d062

return bool for mapped ex iterable is typed

85b7d4d

pass return features to mapped exampels iterable constructor

3129274

lhoestq reviewed Oct 11, 2024

View reviewed changes

don't iter arrow with formatted filter to avoid re formatting

45f55b4

avoid re-formatting data

5e31fe0

lhoestq reviewed Oct 14, 2024

View reviewed changes

alex-hh added 3 commits October 14, 2024 21:24

rename return features -> features

49a84fe

update refs to return_features

002f5b4

decode features in batched map

2479264

alex-hh mentioned this pull request Oct 14, 2024

fast array extraction #7227

Open

alex-hh added 2 commits October 15, 2024 22:08

preserve formatting in with_format

68bfa39

fix features (mapped ex iterable

38f78d2

alex-hh added 2 commits October 31, 2024 11:28

Merge branch 'main' into iterable-map-with-format

f59a8e6

update shard

ca2deb4

alex-hh added 2 commits November 2, 2024 09:05

remove formatted examples iterable from with_format

4efcf11

avoid reapplying features when chaining filter, map

f997f8c

preserve formatting in map

bd8bbd3

		if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
		formatter = get_formatter(self.formatting.format_type)

		)
		batched_examples_iterator = formatted_arrow_examples_iterator(ex_iterable, formatter, batched=self.batched)

apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207

Are you sure you want to change the base?

apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207

Conversation

alex-hh commented Oct 8, 2024 • edited Loading

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh commented Oct 8, 2024 • edited Loading

alex-hh commented Oct 8, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Oct 9, 2024

Choose a reason for hiding this comment

alex-hh Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

lhoestq Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Oct 9, 2024

alex-hh commented Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Oct 9, 2024

lhoestq commented Oct 9, 2024 • edited Loading

alex-hh commented Oct 9, 2024

lhoestq commented Oct 9, 2024

alex-hh commented Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

alex-hh commented Oct 12, 2024 • edited Loading

lhoestq commented Oct 14, 2024

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

alex-hh Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

lhoestq Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

alex-hh Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh commented Oct 25, 2024 • edited Loading

alex-hh commented Oct 31, 2024 • edited Loading

lhoestq commented Nov 5, 2024

alex-hh commented Oct 8, 2024 •

edited

Loading

lhoestq left a comment •

edited

Loading

alex-hh commented Oct 8, 2024 •

edited

Loading

alex-hh commented Oct 8, 2024 •

edited

Loading

alex-hh Oct 9, 2024 •

edited

Loading

lhoestq Oct 9, 2024 •

edited

Loading

alex-hh commented Oct 9, 2024 •

edited

Loading

lhoestq commented Oct 9, 2024 •

edited

Loading

alex-hh commented Oct 9, 2024 •

edited

Loading

alex-hh commented Oct 12, 2024 •

edited

Loading

lhoestq Oct 14, 2024 •

edited

Loading

alex-hh Oct 14, 2024 •

edited

Loading

alex-hh Oct 15, 2024 •

edited

Loading

alex-hh Oct 14, 2024 •

edited

Loading

lhoestq Oct 15, 2024 •

edited

Loading

alex-hh Oct 15, 2024 •

edited

Loading

alex-hh Oct 21, 2024 •

edited

Loading

alex-hh commented Oct 25, 2024 •

edited

Loading

alex-hh commented Oct 31, 2024 •

edited

Loading