Skip to content

Releases: RumbleDB/rumble

RumbleDB 1.14.0 "Acacia" beta

05 Jul 12:36
1dfacb8
Compare
Choose a tag to compare
Pre-release
  • Rumble now outputs error messages displaying the faulty line of code and pointing to the place of error.
  • Machine Learning estimators and models can now run at scale (in parallel) on very large amounts of data. This is automatically detected.
  • Many stability improvements in the Machine Learning library
  • Machine Learning Pipelines are now supported with stages given as function items
  • Static typing is now always done and used to optimize even more
  • Initial (experimental) support for user-defined types with the JSound Compact syntax. Types can be used everywhere builtin types can be used (instance of, treat as, type annotations for variables...).
  • New validate type expression to validate against user-defined types and (if the type is DF-compatible) to create object* instances as optimized dataframes.
  • Features must be assembled with the VectorAssembler transformer prior to being used with an estimator or transformer (for example, at the start of a pipeline). featuresCol and InputCol must specify the name (as a string) of the assembled feature vector field. This is now fully consistent with the Spark ML framework.

Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.14.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.14.0.jar package.

Rumble 1.12.0 "Ashoka Tree" beta

04 May 12:44
e0c216f
Compare
Choose a tag to compare
Pre-release
  • Fixed performance issue when a big for clause follows other small clauses
  • Fixed grouping and ordering of floats
  • Fixed a bug that prevented grouping with keys of incompatible types when hashcodes collided.
  • Experimental (and incomplete) support for XQuery 3.1 syntax (prefix queries with xquery version "3.1"; to activate)
  • project() calls are pushed down if the argument is structured (e.g., coming from parquet-file(), etc).
  • Performance improvements for round() and abs()
  • Variable references ($x) are resolved quicker
  • Support for general function types (including their signature) and type checking (including statically)
  • When iterating on schema-based data (Parquet, Avro, structured-json-file()...) in a FLWOR expression, some let, for, where, group-by and order-by clauses will be automatically faster if they only involve literals, variable references, object/array lookups, and value comparison (native mapping to Spark SQL)
  • Fixed several bugs in switch expressions
  • Switch expressions and conditional expressions can handle/forward structured data faster (underlying DataFrames)

Rumble 1.11.0 "Banyan Tree" beta

03 Mar 12:32
Compare
Choose a tag to compare
Pre-release
  • experimental support for static typing (--static-typing yes) following the W3C standard.
  • performance improvements in arithmetics, logics, comparison
  • spaces are now supported in paths to json-file()
  • HTTP URLs are now supported by unparsed-text() and unparsed-text-lines()
  • yearMonthDuration, dayTimeDurations, hexBinary, base64Binary can now be compared for inequality in addition to equality
  • performance improvements for comparison
  • the effective boolean value is now correctly taken in quantified expressions
  • quantified expressions now work in parallel as well (they leverage the FLWOR iterators)
  • support for floats
  • sum(), avg() are now pushed down and work on large homogeneous as well as heterogeneous sequences
  • stability improvements and improved conformance for comparison, arithmetics and casts
  • dayTimeDuration and yearMonthDuration can now be compared
  • all constructors are now available (semantics identical to cast as)
  • switch and index-of no longer throw an error for incompatible types, which now follows the standard
  • empty function bodies are now allowed (in which case it is considered to return the empty sequence)
  • variable names $null, $array, $object are now allowed
  • annotate() can now automatically cast whenever it makes sense, and is thus more flexible
  • the Item hierarchy is now flat, with a public Item interface available in the Rumble Java API, and individual classes providing the implementation, which should lead to a small performance boost with lighter method calls.
  • fixed an issue (null pointer exception) when an ordering key is always the empty sequence
  • constant predicate lookups with small numbers (<= materialization cap) are pushed down, e.g., json-file("...")[1]
  • general support at the parser level of any type QName. prefixes like xs: and js: are now accepted but remain optional (e.g., xs:integer, js:null).
  • an error is appropriately thrown if an order by expression evaluates to more than an item or a non-atomic item
  • builtin functions can now be called with fn:, jn: and math: prefixes as well (depending on their namespace). It is still, however, possible to refer to them without prefix, i.e., this is backward compatible.

The main jar is for Spark 3, but there is another jar for Spark 2.

Rumble 1.10.0 "Buttonwood" beta

04 Jan 11:16
Compare
Choose a tag to compare
Pre-release
  • Fixed navigation issue with structured datasets when objects are nested in arrays.
  • Fixed a bug that prevented calling a user-defined functions repeatedly in a FLWOR expression in some cases
  • Any verbose messages are now printed to stderr, no longer stdout for those who want to pipeline the output in bash
  • Bugfixes in unary expressions (an error is now thrown for more than one item, and multiple unary signs, allowed by the spec are handled correctly)
  • Big integers can now be cast from strings
  • string() now returns serialized numbers consistent with JSON output
  • typeswitch now correctly matches the empty sequence type
  • improved stability for user-defined function calls consuming dataframe parameter. Seamless materialization for ? and 1 arities.
  • max() and min() are now pushed down to Spark and work on big sequences
  • +INF and INF (doubles) are now serialized to strings correctly
  • Fixed the division by 0 on doubles, to correctly produce +INF and -INF, and mod by 0 to produce NaN. idiv raises an error as per the spec.
  • It is now possible to build INF, -INF, und NaN double by casting from a string literal.
  • Fixed bug in the object lookup expression leading to a crash when the field to lookup depends on a variable, and the sequence of objects being looked up is partitioned on Spark. Same fix for array lookup expressions.
  • Fixed a crash happening in a FLWOR expression in a group-by clause executed in parallel, when none of the variables before and including this group clause is used anywhere in the remainder of the FLWOR expression.
  • Performance improvements in the processing of items.
  • Performance improvement for distinct-values call on heterogeneous sequences.
  • support for W3C-standard functions unparsed-text, unparsed-text-lines (in parallel) and parse-json (all with arity 1 for now)
  • Fixed a bug occasionally happening with JsonIter streaming by switching to another JSON parser (gson).

Rumble 1.9.1 "Ficus Bonsai" beta

18 Nov 11:00
Compare
Choose a tag to compare
Pre-release

Interim release with the following fixes and improvements:

  • There is a new CLI parameter --deactivate-jsoniter-streaming to set to yes if there is any error regarding the JsonIter dependency, the library we use to parse JSON (the error in question being "com.jsoniter.spi.JsonException: javassist.CannotCompileException: by java.lang.ClassFormatError: class com.jsoniter.IterImpl cannot access its superclass com.jsoniter.IterImplForStreaming"). This flag deactivates streaming (i.e., avoids dynamic code generation by JsonIter) and avoids the error. This is a known issue with the Rumble docker but it never happened on our own machines. We are actively investigating why the Rumble docker has this issue. If you deactivate JsonIter streaming, though, this makes json-doc() unavailable after using json-file() in the same Rumble application (which is why we activate JsonIter streaming by default).

  • The public Rumble API (also accessible via the Rumble Maven dependency) now allows passing any lists of items as an external variable. You can thus gather the results of a query as a list of items, and put it back as the input of another query in Java as a host language.

Rumble 1.9.0 "Ficus Bonsai" beta

28 Oct 15:41
Compare
Choose a tag to compare
Pre-release
  • Left-outer equi-joins with let clauses: if you have two large tabular datasets, Rumble can nest one into the other with just a few lines of code, and fast.
  • Inner equi-joins and generic joins with where clauses are detected.
  • Renamed --result-size to --materialization-size to avoid confusion, and adding more hints about --output-path for getting the complete output from a parallel query.
  • New CLI options --output-format and output-format-option:* for outputting structured output to other formats than JSON (Parquet, CSV...).
  • New CLI option --number-of-output-partitions to repartition the output as desired
  • New function local-text-file() to read a file as a sequence of string items, but without Spark parallelism (streaming instead). This makes Rumble faster for smaller files
  • Performance improvements for FLWOR queries on structured data (Avro, Parquet, structured JSON, CSV)...
  • Performance improvement for when parallelism is not used at all
  • Stability improvement for json-doc(), which will now also work after json-file() has been used.

Rumble 1.8.1 "Scots Pine" beta

21 Sep 08:55
Compare
Choose a tag to compare
Pre-release

Interim release with small fixes

  • Improve performance of joins whenever possible (quadratic -> linear)
  • fixed a bug with non-exact averages with avg()

Note that Rumble is in beta. Use at your own risks.

Rumble 1.8.0 "Scots pine"

04 Sep 13:02
Compare
Choose a tag to compare
Pre-release

New features

  • Support for joining two large datasets; automatic detection of joins if a for expression is a predicate expression, and the left-hand side can be evaluated independently of the former clauses. The right-hand-side is the joining criterion. Left outer joins are also supported in parallel (allowing empty).
  • outer joins ("allowing empty" in a for clause) are now supported both locally and in parallel.
  • support for empty sequence order least/greatest prolog setter (for order by clauses)
  • positional variables in for clauses are now supported both locally and in parallel (except for large-scale joins).
  • arbitrary large integer literals are now supported (an error was thrown before beyond 32 bits)
  • json-file() and json-doc() can both read over HTTP
  • you can store your JSONiq modules on the Web and import them with an HTTP URL
  • you can store your queries on the Web and execute them via the Rumble command line with their URL
  • an error with the appropriate code is now thrown if a collation is specified that is not supported (the W3C standard requires support for at least the Unicode codepoint collation, which Rumble recognizes and supports).
  • It is now possible to specify a hostname in the server mode (--host), and to filter for specific URI prefixes for security reasons (--allowed-uri-prefixes)

Bugfixes

  • big integers are now seamlessly supported: no more overflows, and arbitrary large integer literals are accepted in JSONiq code
  • fixed display bugs in debug mode (--print-iterator-tree yes)
  • fixed an error with local group-by queries nested inside local FLWORs
  • fixed an error when counting items in a variable that was not a post-grouping variable, in parallelized FLWORs.
  • fixed a bug encountered when a local iteration followed by a parallel for clause produced, and unioned, several Spark jobs internally.

Important: The jar for Spark 3.0.0 does not have Laurelin (ROOT parser) support. We are waiting for a 3.0.0-compatible Laurelin release. If you need to query ROOT files, please use Spark 2.4.6.

Rumble 1.7.0 "Phoenix Atlantica"

07 Jul 12:03
06979d2
Compare
Choose a tag to compare
Pre-release

New milestone in our feature coverage with the following changes prioritized based on user requests.

New features

  • Rumble is available for Spark 2.4.x as well as for Spark 3.0.0 (pick the right jar). The version for Spark 3.0.0 cannot read ROOT files yet, as we are waiting for the corresponding Laurelin release.
  • library modules are now supported, in order to share and import functions and global variables. Like main modules, library modules can be stored on any file system including S3 or HDFS, which also enables sharing code within the institution (local HDFS system) or even worldwide (S3 or even HTTP).
  • support for the W3C-standard trace function, for outputting intermediate values to the log.
  • support for try-catch expressions to catch and handle dynamic errors
  • support (read-only) for HTTP scheme for reading query files, data, importing modules, etc.

Bugfixes

  • fixed a bug in position semantics in predicate expressions, so that it also works if the position is not a constant.
  • Bugfix: query files are now tested for EOF, and errors will now be thrown if there are extra characters after the complete JSONiq query.
  • it is now possible to define functions and variables in the local namespace, following the W3C standard
  • [BREAKING CHANGE] relative paths passed to input functions are now resolved correctly in a query if it is read from a file, i.e., according to the absolute query file location. In previous releases, relatives paths were resolved against the working directory. If you pass paths via external variables on the command line and (rightfully) expect them to be resolved against the working directory, declare the external variable with an "as anyURI" type annotation so Rumble knows your intent.
  • improvements in error messages when reading from and writing to file systems. Path resolution was also consolidated to provide the same experience everywhere.

Rumble 1.6.4 "Yucca"

02 Jun 15:07
Compare
Choose a tag to compare
Rumble 1.6.4 "Yucca" Pre-release
Pre-release

Interim release with bugfixes.

  • Support for DivisionByZero error code (div, mod).
  • Fixed a bug that sometimes led the Rumble shell to keep throwing the same error for subsequent queries
  • More informative error message when a range expression is not supplied with integers
  • Fix bug that prevented conditional expressions to be executable in parallel
  • New functions normalize-unicode and encode-for-uri
  • Support for running typeswitch in parallel