You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Z-ordering tables doesn’t sort data within partitions (files) and consequently data skipping on the Parquet level, based on row group metadata, is inefficient.
Motivation
To increase read efficiency by leveraging mdc on the row group level. The absence of ordering within Parquet files is already noted as a drawback in the design details. Global sort is considered, but deemed too slow. Sorting within partitions, on the other hand, is relatively fast because it does not introduce a shuffle. It can be optionally applied after the current repartitionByRange step. To the best of my knowledge, this approach has not been considered.
Further details
I originally discussed this problem in the Slack channel with @Kimahriman, who suggested I raise an issue here.
I've implemented the feature by adding configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions defaulting to false. When the property is enabled, the partitions are sorted on repartitionKeyColName after repartitionByRange.
I ran a comparison based on the Delta Lake Z Order blog post and notebook by @MrPowers. I don't have local disk for the large data set (G1_1e9_1e2_0_0.csv), so I used a medium-sized one instead (G1_1e8_1e8_100_0.csv) and timed query_c on four table versions:
version 0: unoptimized
version 1: compacted
version 2: z-ordered on id1 and id2
version 3: z-ordered on id1 and id2, and sorted within partitions
On a 2021 MBP with 16 GB RAM. The results were
version 0
[id052,id45689,1.0]
Time taken: 4524 ms
version 1
[id052,id45689,1.0]
Time taken: 3137 ms
version 2
[id052,id45689,1.0]
Time taken: 1280 ms
version 3
[id052,id45689,1.0]
Time taken: 112 ms
The id column values queried are different because the original combination did not exist in my data set. Update: I ran the experiment on the larger data set (G1_1e9_1e9_100_0.csv) using cloud storage and the results are
version 0
[id038,id8508161,4.0]
Time taken: 667717 ms
version 1
[id038,id8508161,4.0]
Time taken: 589716 ms
version 2
[id038,id8508161,4.0]
Time taken: 48994 ms
version 3
[id038,id8508161,4.0]
Time taken: 6386 ms
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
maltevelin
changed the title
[Feature Request] [Spark] Optionally sort within partitions when Z-ordering.
[Feature Request] [Spark] Optionally sort within partitions when Z-ordering
Dec 29, 2024
Feature request
Which Delta project/connector is this regarding?
Overview
Z-ordering tables doesn’t sort data within partitions (files) and consequently data skipping on the Parquet level, based on row group metadata, is inefficient.
Motivation
To increase read efficiency by leveraging mdc on the row group level. The absence of ordering within Parquet files is already noted as a drawback in the design details. Global sort is considered, but deemed too slow. Sorting within partitions, on the other hand, is relatively fast because it does not introduce a shuffle. It can be optionally applied after the current
repartitionByRange
step. To the best of my knowledge, this approach has not been considered.Further details
I originally discussed this problem in the Slack channel with @Kimahriman, who suggested I raise an issue here.
I've implemented the feature by adding configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
defaulting tofalse
. When the property is enabled, the partitions are sorted onrepartitionKeyColName
afterrepartitionByRange
.I ran a comparison based on the Delta Lake Z Order blog post and notebook by @MrPowers. I don't have local disk for the large data set (
G1_1e9_1e2_0_0.csv
), so I used a medium-sized one instead (G1_1e8_1e8_100_0.csv
) and timedquery_c
on four table versions:id1
andid2
id1
andid2
, and sorted within partitionsOn a 2021 MBP with 16 GB RAM. The results were
The
id
column values queried are different because the original combination did not exist in my data set.Update: I ran the experiment on the larger data set (
G1_1e9_1e9_100_0.csv
) using cloud storage and the results areWillingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
I have opened PR #4006.
The text was updated successfully, but these errors were encountered: