Skip to content

RumbleDB 1.21.0 "Hawthorn blossom" beta

Compare
Choose a tag to compare
@ghislainfourny ghislainfourny released this 16 May 13:09
· 459 commits to master since this release
53f4df0

NEW! The jar for Spark 3.5 was added and is available for download.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.21, as they are no longer supported officially by the Spark team. Spark 3.4 is newly supported.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.21.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11.
rumbledb-1.21.0-for-spark-3.X.jar (3.2, 3.3, 3.4) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.21.0-for-spark-3.X.jar

Improvements

  • Automatically parallelizes range expressions with more than a million items with no need to call parallelize() any more.
  • some simple map expressions on homogeneous input are now faster (native SQL behind the scene).
  • general comparisons on equality are now considerably faster
  • reverse() is now more efficient and faster on homogeneous sequences
  • Fixed bug on equijoin involving homogeneous sequences
  • Add two functions jn:cosh and jn:sinh
  • Automatic optimization of general comparisons to value comparisons when it is detected that the sequences have at most one item (can be deactivated with --optimize-general-comparison-to-value-comparison on)
  • Better static type detection
  • It is now possible to force a sequential execution (without Spark) with --parallel-execution no. This also works with queries containing calls to parallelize() (which will be ineffective), json-doc(), and json-file() (which will simply stream-read from the disk). Other I/O functions (such as csv-file(), etc) will still involve Spark for reading, but immediately materialize for the rest of the execution.
  • It is now possible to deactivate Native Spark SQL execution (forcing a fallback to the use of UDFs by RumbleDB) with --native-execution no.
  • annotate expression (similar syntax to validate expression) allows directly annotating an item without checking for validity.
  • More static types are detected
  • Non-recursive functions are now automatically inlined for faster execution. This can be deactivated with --function-inlining no (reverting to behavior in previous versions)
  • TypeSwitch expressions now support DataFrame execution

Bugfixes

  • Fixed bug when reading longs from DataFrames
  • Fixed an issue with projection pushdowns in join queries
  • Fixed a few bugs with queries that navigate JSON in for clauses; they are compiled to native SQL whenever possible, but some chains were throwing errors (e.g., an array unboxing followed by object lookup)
  • Fixed a bug in which calling count() on a grouping variable did not return 1 when native SQL execution is activated
  • hexBinary and base64Binary values can now be used in order by clauses with parallel execution