Releases: delta-io/delta
Delta Lake 3.1.0
We are excited to announce the release of Delta Lake 3.1.0. This release includes several exciting new features.
Few Highlights
- Delta-Spark: Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds.
- Delta-Spark: Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
- Delta-Spark: Support for querying tables shared through Delta Sharing protocol.
- Kernel: Support for data skipping for given query predicates to reduce the number of files read during the table scan.
- Uniform: Enhanced Iceberg support for Delta tables that enables MAP and LIST types and ease of use improvements to enable Uniform on a Delta table.
- Delta-Flink: Flink write job startup time latency improvement using Kernel.
Details by each component.
Delta Spark
Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.1.0/index.html
- API documentation: https://docs.delta.io/3.1.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.1.0/
The key features of this release are:
- Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds. Refer to the documentation on deletion vectors for more information.
- Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
- (Preview) Liquid clustering for better table layout Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
- Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
- Support for Hive Metastore schema sync. Adds a mechanism for syncing the table schema to HMS. External tools can now directly consume the schema from HMS instead of accessing it from the Delta table directory. See the documentation on how to enable this feature.
- Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance. See the documentation for details on how to enable this feature.
- Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size and reduce the small file problem as data is written and benefit subsequent reads on the table. See the documentation for details on how to enable this feature.
Other notable changes include:
- Peformance improvement by removing redundant jobs when performing DML operations with deletion vectors.
- Update command now writes deletions vectors by default when the table has deletion vectors enabled.
- Support for writing partition columns to data files.
- Support for phaseout of v2 checkpoint table feature.
- Fix an issue with case-sensitive column names in Merge.
- Make VACCUM command to be Delta protocol aware so that it can only vacuum tables with protocol that it supports.
Delta Sharing Spark
- Documentation: https://docs.delta.io/3.1.0/delta-sharing.html
- Maven artifacts: delta-sharing-spark_2.12, delta-sharing-spark_2.13
This release of Delta adds a new module called delta-sharing-spark which enables reading Delta tables shared using the Delta Sharing protocol in Apache Spark™. It is migrated from https://github.com/delta-io/delta-sharing/tree/main/spark repository to https://github.com/delta-io/delta/tree/master/sharing repository. Last release version of delta-sharing-spark is 1.0.4 from the previous location. Next release of delta-sharing-spark is with the current release of Delta which is 3.1.0.
Supported read types are: read snapshot of the table, incrementally read the table using streaming or read the changes (Change Data Feed) between two versions of the table.
“Delta Format Sharing” is newly introduced since delta-sharing-spark 3.1, which supports reading shared Delta tables with advanced Delta features such as deletion vectors and column mapping.
Below is an example of reading a Delta table shared using the Delta Sharing protocol in a Spark environment. For more examples refer to the documentation.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("...")
.master("...")
.config(
"spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension"
).config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog"
).getOrCreate()
val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"
// Batch query
spark.read
.format("deltaSharing")
.option("responseFormat", "delta")
.load(tablePath)
.show(10)
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.1.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13
Delta Universal Format (UniForm) allows you to read Delta tables from Iceberg and Hudi (coming soon) clients. Delta 3.1.0 provided the following improvements:
- Enhanced Iceberg support through IcebergCompatV2. IcebergCompatV2 adds support for
LIST
andMAP
data types and improves compatibility with popular Iceberg reader clients. - Easier retrieval of the Iceberg metadata file location via familiar SQL syntax DESCRIBE EXTENDED TABLE.
- A new SQL command to enable UniForm REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) on existing Delta tables. See the documentation for details.
- Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount which enables efficient data skipping when the tables are read as Iceberg in queries containing predicates.
Delta Kernel
- API documentation: https://docs.delta.io/3.1.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the [Delta protocol detai...
Delta Lake 3.0.0
We are excited to announce the final release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.
Highlights
Here are the most important aspects of 3.0.0:
Spark 3.5 Support
Unlike the initial preview release, Delta Spark is now built on top of Apache Spark™ 3.5. See the Delta Spark section below for more details.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.0.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13
Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this release. UniForm takes advantage of the fact that all table storage formats, such as Delta, Iceberg, and Hudi, actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata and commits to Hive metastore, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create a UniForm-enabled table using the following command:
CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg');
Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details, and the key implementations here and here.
Delta Kernel
- API documentation: https://docs.delta.io/3.0.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).
You can use this library to do the following:
- Read data from Delta tables in a single thread in a single process.
- Read data from Delta tables using multiple threads in a single process.
- Build a complex connector for a distributed processing engine and read very large Delta tables.
- [soon!] Write to Delta tables from multiple threads / processes / distributed engines.
Reading a Delta table with Kernel APIs is as follows.
TableClient myTableClient = DefaultTableClient.create() ; // define a client
Table myTable = Table.forPath(myTableClient, "/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient); // define which version of table to scan
Predicate scanFilter = ... // define the predicate
Scan myScan = mySnapshot.getScanBuilder(myTableClient) // specify the scan details
.withFilters(scanFilter)
.build();
Scan.readData(...) // returns the table data
Full example code can be found here.
For more information, refer to:
- User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default TableClient API Java documentation
This release of Delta contains the Kernel Table API and default TableClient API definitions and implementation which allow:
- Reading Delta tables with optional Deletion Vectors enabled or column mapping (name mode only) enabled.
- Partition pruning optimization to reduce the number of data files to read.
Welcome Delta Connectors to the Delta repository!
All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.
Delta Spark
Delta Spark 3.0.0 is built on top of Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.
- Documentation: https://docs.delta.io/3.0.0/index.html
- API documentation: https://docs.delta.io/3.0.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.0.0/
The key features of this release are:
- Support for Apache Spark 3.5
- Delta Universal Format - Write as Delta, read as Iceberg! See the highlighted section above.
- Up to 10x performance improvement of UPDATE using Deletion Vectors - Delta UPDATE operations now support writing Deletion Vectors. When enabled, the performance of UPDATEs will receive a significant boost.
- More than 2x performance improvement of DELETE using Deletion Vectors - This fix improves the file path canonicalization logic by avoiding calling expensive
Path.toUri.toString
calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table). - Up to 2x faster MERGE operation - MERGE now better leverages data skipping, the ability to use the insert-only code path in more cases, and an overall improved execution to achieve up to 2x better performance in various scenarios.
- Support streaming reads from column mapping enabled tables when
DROP COLUMN
andRENAME COLUMN
have been used. This includes streaming support for Change Data Feed. See the documentation here for more details. - Support specifying the columns for which Delta will collect file-skipping statistics via the table property
delta.dataSkippingStatsColumns
. Previously, Delta would only collect file-skipping statistics for the first N columns in the table schema (default to 32). Now, users can easily customize this. - Support zero-copy convert to Delta from Iceberg tables on Apache Spark 3.5 using
CONVERT TO DELTA
. This feature was excluded from the Delta Lake 2.4 release since Iceberg did not yet support Apache Spark 3.4 (or 3.5). This command generates a Delta table in the same location and does not rewrite any parquet files. - Checkpoint V2 - Introduced a new Checkpoint V2 format in Delta Protocol Specification and implemented read/write support in Delta Spark. The new checkpoint v2 format provides more reliability over the existing v1 checkpoint format.
- Log Compactions - Introduced new log compaction files in Delta Protocol Specification which could be useful in reducing the frequency of Delta checkpoints. Added read support for log compaction files in Delta Spark.
- Safe casts enabled by default for UPDATE and MERGE operations - Delta UPDATE and MERGE operations now result in an error when values cannot be safely ca...
Delta Lake 2.4.0
We are excited to announce the release of Delta Lake 2.4.0 on Apache Spark 3.4. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.4.0/
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.4.0/
The key features in this release are as follows
- Support for Apache Spark 3.4.
- Support writing Deletion Vectors for the
DELETE
command. Previously, when deleting rows from a Delta table, any file with at least one matching row would be rewritten. With Deletion Vectors these expensive rewrites can be avoided. See What are deletion vectors? for more details. - Support for all write operations on tables with Deletion Vectors enabled.
- Support
PURGE
to remove Deletion Vectors from the current version of a Delta table by rewriting any data files with deletion vectors. See the documentation for more details. - Support reading Change Data Feed for tables with Deletion Vectors enabled.
- Support
REPLACE WHERE
expressions in SQL to selectively overwrite data. Previously “replaceWhere” options were only supported in the DataFrameWriter APIs. - Support
WHEN NOT MATCHED BY SOURCE
clauses in SQL for the Merge command. - Support omitting generated columns from the column list for SQL
INSERT INTO
queries. Delta will automatically generate the values for any unspecified generated columns. - Support the
TimestampNTZ
data type added in Spark 3.3. UsingTimestampNTZ
requires a Delta protocol upgrade; see the documentation for more information. - Other notable changes
- Increased resiliency for S3 multi-cluster reads and writes.
- Allow changing the column type of a
char
orvarchar
column to a compatible type in theALTER TABLE
command. The new behavior is the same as in Apache Spark and allows upcasting fromchar
orvarchar
tovarchar
orstring
. - Block using
overwriteSchema
with dynamic partition overwrite. This can corrupt the table as not all the data may be removed, and the schema of the newly written partitions may not match the schema of the unchanged partitions. - Return an empty
DataFrame
for Change Data Feed reads when there are no commits within the timestamp range provided. Previously an error would be thrown. - Fix a bug in Change Data Feed reads for records created during the ambiguous hour when daylight savings occurs.
- Fix a bug where querying an external Delta table at the root of an S3 bucket would throw an error.
- Remove leaked internal Spark metadata from the Delta log to make any affected tables readable again.
Note: the Delta Lake 2.4.0 release does not include the Iceberg to Delta converter because iceberg-spark-runtime
does not support Spark 3.4 yet. The Iceberg to Delta converter is still supported when using Delta 2.3 with Spark 3.3.
Credits
Alkis Evlogimenos, Allison Portis, Andreas Chatzistergiou, Anton Okolnychyi, Bart Samwel, Bo Gao, Carl Fu, Chaoqin Li, Christos Stavrakakis, David Lewis, Desmond Cheong, Dhruv Shah, Eric Maynard, Fred Liu, Fredrik Klauss, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jintian Liang, Johan Lasperas, Lars Kroll, Lukas Rupprecht, Matthew Powers, Ming DAI, Ming Dai, Naga Raju Bhanoori, Paddy Xu, Prakhar Jain, Rahul Shivu Mahadev, Rui Wang, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wenchen Fan, Xi Liang, Yaohua Zhao, Yuming Wang
Delta Lake 2.3.0
We are excited to announce the release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.3.0/
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/2.3.0/
The key features in this release are as follows
- Zero-copy convert to Delta from Iceberg tables using
CONVERT TO DELTA
. This generates a Delta table in the same location and does not rewrite any parquet files. See the documentation for details. - Support
SHALLOW CLONE
for Delta, Parquet, and Iceberg tables to clone a source table without copying the data files.SHALLOW CLONE
creates a copy of the source table’s definition but refers to the source table’s data files. - Support idempotent writes for DML operations. This feature adds idempotency to
INSERT
/DELETE
/UPDATE
/MERGE
etc. operations using SQL configurationsspark.databricks.delta.write.txnAppId
andspark.databricks.delta.write.txnVersion
. - Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java
DeltaTable
APIs. SQL Support will be added in Spark 3.4. - Support
CREATE TABLE LIKE
to create empty Delta tables using the definition and metadata of an existing table or view. - Support reading Change Data Feed (CDF) in SQL queries using the
table_changes
table-valued function. - Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when
DROP COLUMN
andRENAME COLUMN
have been used. See the documentation for more details. - Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration
delta.enableFastS3AListFrom
totrue
to enable it. - Record
VACUUM
operations in the transaction log. With this feature,VACUUM
operations and their associated metrics (e.g.numDeletedFiles
) will now show up in table history. - Support reading Delta tables with deletion vectors.
- Other notable changes
- Support schema evolution in
MERGE
forUPDATE SET <assignments> and INSERT (...) VALUES (...) actions
. Previously, schema evolution was only supported forUPDATE SET *
andINSERT *
actions. - Add
.show()
support forCOUNT(*)
aggregate pushdown. - Enforce idempotent writes for
df.saveAsTable
for overwrite and append mode. - Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
- Automatically generate partition filters for additional generation expressions.
- Support the
trunc
anddate_trunc
functions. - Support for the
date_format
function with formatyyyy-MM-dd
.
- Support the
- Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
- Fix
replaceWhere
with the DataFrame V2 overwrite API to correctly evaluate less than conditions. - Fix dynamic partition overwrite for tables with more than one partition data type.
- Fix schema evolution for
INSERT OVERWRITE
with complex data types when the source schema is read incompatible. - Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
- Fix a bug in
VACUUM
where sometimes the default retention period was used to remove files instead of the retention period specified in the table properties. - Include the table name in the DataFrame returned by the
deltaTable.details()
Python/Scala/Java API. - Improve the log message for
VACUUM table_name DRY RUN
.
- Support schema evolution in
Credits
Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu
Delta Lake 2.0.2
We are excited to announce the release of Delta Lake 2.0.2 on Apache Spark 3.2. This release contains important bug fixes and a few high-demand usability improvements over 2.0.1 and it is recommended that users update to 2.0.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.0.2/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.0.2/
This release includes the following bug fixes and improvements:
- Record VACUUM operation in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g.
numDeletedFiles
) will now show up in table history. - Support idempotent writes for DML operations. This feature adds idempotency to INSERTS/DELETE/UPDATE/MERGE etc. operations using SQL configurations
spark.databricks.delta.write.txnAppId
andspark.databricks.delta.write.txnVersion
.
Support passing Hadoop configurations via DeltaTable APIfrom delta.tables import DeltaTable hadoop_config = { "fs.azure.account.auth.type": "OAuth", "fs.azure.account.oauth.provider.type": "...", "fs.azure.account.oauth2.client.id": "...", "fs.azure.account.oauth2.client.secret": "...", "fs.azure.account.oauth2.client.endpoint": "..." } delta_table = DeltaTable.forPath(spark, <table-path>, hadoop_config)
- Minor convenience improvement to the
DeltaTableBuilder:executeZOrderBy
Java API which allows users to pass in varargs instead of a List. - Fail fast on malformed delta log JSON entries. Previously, Delta queries could return inaccurate results whenever JSON commits in the
_delta_log
were malformed. For example, anadd
action with a missing}
would be skipped. Now, queries will fail fast, preventing inaccurate results. - Fix “Could not find active SparkSession” bug by passing in the SparkSession when resolving tables in the DeltaTableBuilder.
Credits:
Helge Brügner, Jiaheng Tang, Mitchell Riley, Ryan Johnson, Scott Sandre, Venki Korukanti, Jintao Shen, Yann Byron
Delta Lake 2.2.0
We are excited to announce the release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.2.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.2.0/
The key features in this release are as follows:
-
LIMIT
pushdown into Delta scan. Improve the performance of queries containingLIMIT
clauses by pushing down theLIMIT
into Delta scan during query planning. Delta scan uses theLIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could makeLIMIT
queries faster by 10-100x depending upon the table size. -
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as
SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x. -
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify
NO STATISTICS
clause in the command. Example:CONVERT TO DELTA table_name NO STATISTICS
-
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
-
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTime
toexpireTime
. If you already have TTL enabled, please follow the migration steps here. -
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
-
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
-
Other notable changes
- Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
- Support for multiple
where()
calls in Optimize scala/python API - Support for passing Hadoop configurations via DeltaTable API
- Support partition column names starting with
.
or_
in CONVERT TO DELTA command. - Improvements to metrics in table history
- Fix a metric in MERGE command
- Source type metric for CONVERT TO DELTA
- Metrics for DELETE on partitions
- Additional vacuum stats
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Fix a bug in
MERGE INTO
when there are multipleUPDATE
clauses and one of the UPDATEs is with a schema evolution. - Fix a bug where sometimes active
SparkSession
object is not found when using Delta APIs - Fix an issue where partition schema couldn’t be set during the initial commit.
- Catch exceptions when writing
last_checkpoint
file fails. - Fix an issue when restarting a streaming query with
AvailableNow
trigger on a Delta table. - Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
Delta Lake 2.2.0
We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/latest/index.html
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1102
- Python artifacts: https://test.pypi.org/project/delta-spark/2.2.0rc1/
The key features in this release are as follows:
-
LIMIT
pushdown into Delta scan. Improve the performance of queries containingLIMIT
clauses by pushing down theLIMIT
into Delta scan during query planning. Delta scan uses theLIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could makeLIMIT
queries faster by 10-100x depending upon the table size. -
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as
SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x. -
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify
NO STATISTICS
clause in the command. Example:CONVERT TO DELTA table_name NO STATISTICS
-
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
-
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTime
toexpireTime
. If you already have TTL enabled, please follow the migration steps here. -
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
-
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
-
Other notable changes
- Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
- Support for multiple
where()
calls in Optimize scala/python API - Support for passing Hadoop configurations via DeltaTable API
- Support partition column names starting with
.
or_
in CONVERT TO DELTA command. - Improvements to metrics in table history
- Fix a metric in MERGE command
- Source type metric for CONVERT TO DELTA
- Metrics for DELETE on partitions
- Additional vacuum stats
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Fix a bug in
MERGE INTO
when there are multipleUPDATE
clauses and one of the UPDATEs is with a schema evolution. - Fix a bug where sometimes active
SparkSession
object is not found when using Delta APIs - Fix an issue where partition schema couldn’t be set during the initial commit.
- Catch exceptions when writing
last_checkpoint
file fails. - Fix an issue when restarting a streaming query with
AvailableNow
trigger on a Delta table. - Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.
How to use the preview release
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
- spark-submit: Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ to the command line arguments. For example:
spark-submit --packages io.delta:delta-core_2.12:2.2.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ examples/examples.py
- Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta
2.2.0rc1
by just providing the--packages io.delta:delta-core_2.12:2.2.0rc1
argument. - Maven project:
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1102/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0rc1</version>
</dependency>
- SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.2.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1102/
- Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.2.0rc1
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
Delta Lake 2.0.1
We are excited to announce the release of Delta Lake 2.0.1 on Apache Spark 3.2. This release contains important bug fixes to 2.0.0 and it is recommended that users update to 2.0.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.0.1/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.0.1/
This release includes the following bug fixes and improvements:
- Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTime
toexpireTime
. If you already have TTL enabled, please follow the migration steps here. - Fix a duplicate CDF rows issue in some cases in MERGE operation.
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Improve performance of the DELETE command by optimizing the step to search touched files to trigger column pruning.
- Fix for NotSerializableException when running RESTORE command in Spark SQL with Hadoop2.
- Fix incorrect stats collection issue in data skipping stats tracker.
Credits
Adam Binford, Allison Portis, Chen Shuai, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
Delta Lake 2.1.1
We are excited to announce the release of Delta Lake 2.1.1 on Apache Spark 3.3. This release contains important bug fixes to 2.1.0 and it is recommended that users update to 2.1.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.1.1/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.1.1/
This release includes the following bug fixes and improvements:
- Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTime
toexpireTime
. If you already have TTL enabled, please follow the migration steps here. - Fix for incorrect MERGE behavior when the Delta statistics are disabled.
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Improve performance of the DELETE command by optimizing the step to search affected files to trigger column pruning.
- Fix for NotSerializableException when running RESTORE command in Spark SQL with Hadoop2.
Credits
Adam Binford, Allison Portis, Chen Shuai, Felipe Pessoto, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
Delta Lake 2.1.0
We are excited to announce the release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.1.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts:https://pypi.org/project/delta-spark/2.1.0/
The key features in this release are as follows
- Support for Apache Spark 3.3.
- Support for [TIMESTAMP | VERSION] AS OF in SQL. With Spark 3.3, Delta now supports time travel in SQL to query older data easily. With this update, time travel is now available both in SQL and through the DataFrame API.
- Support for Trigger.AvailableNow when streaming from a Delta table. Spark 3.3 introduces Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches. This is now supported when using Delta tables as a streaming source.
- Support for SHOW COLUMNS to return the list of columns in a table.
- Support for DESCRIBE DETAIL in the Scala and Python DeltaTable API. Retrieve detailed information about a Delta table using the DeltaTable API and in SQL.
- Support for returning operation metrics from SQL Delete, Merge, and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
- Optimize performance improvements
- Added a config to use
repartition(1)
instead ofcoalesce(1)
in Optimize for better performance when compacting many small files. - Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs.
- Added a config to use
- Other notable changes
- Support for using variables in the VACUUM and OPTIMIZE SQL commands.
- Improvements for CONVERT TO DELTA with catalog tables.
- Autofill the partition schema from the catalog when it’s not provided.
- Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
- Support for Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. See the documentation for more details.
- Improve Update performance by enabling schema pruning in the first pass.
- Fix for
DeltaTableBuilder
to preserve table property case of non-delta properties when setting properties. - Fix for duplicate CDF row output for delete-when-matched merges with multiple matches.
- Fix for consistent timestamps in a MERGE command.
- Fix for incorrect operation metrics for DataFrame writes with a
replaceWhere
option. - Fix for a bug in Merge that sometimes caused empty files to be committed to the table.
- Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes.
Benchmark framework update
Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.
Credits
Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Carlos Peña, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf