How DataFusion could support other compute engines (libcudf, velox) #8498

spencerwilson · 2023-12-11T18:24:07Z

spencerwilson
Dec 11, 2023

👋 I'm wondering if anyone has contemplated how compute functions other than arrow::compute could be used with DataFusion? I've not studied their APIs in depth yet, but two libraries that maybe can be considered compute kernels are:

libcudf, a CUDA library to compute analytics on Nvidia GPUs
Velox, a "database acceleration library which provides reusable, extensible, and high-performance data processing components"

I can think of two broad approaches to using these libraries from DataFusion.

Granular: New set of ExecutionPlans for each engine

Create a PhysicalPlanner impl that creates a tree of ExecutionPlans that use these other engines. For example, there'd be 3 different ExecutionPlan impls of the "hash-join" operation:

HashJoinExec, which uses the functions in arrow::compute
another that uses some functions in libcudf
another that uses the corresponding operator impl in Velox

And if one uses "the libcudf PhysicalPlanner" (for example), you'll get a tree of the libcudf-based ExecutionPlans.

Coarse: Create a single-node `ExecutionPlan`

In this approach, have a single impl of ExecutionPlan called VeloxExec or LibCudfExec that completely encapsulates the other query engine.

If there are obvious reasons why this makes no sense, would love to hear those, too! I haven't studied the libcudf and velox APIs in detail so I could believe that maybe their APIs just aren't conducive to integration with DataFusion.

The primary potential value here is flexibility, especially if the system has resources that could compute more efficiently than the CPU. For example, if these are to be believed—

—then there may be workloads for which a GPU is a great choice for computation. And it would be keeping with DataFusion's value of modularity.

Answered by alamb

Dec 12, 2023

👋 I'm wondering if anyone has contemplated how compute functions other than arrow::compute could be used with DataFusion

I think in general, many people do this today using the various extension APIs that come with DataFusion (docs):

DataFusion supports extension at many points:

read from any datasource (TableProvider)
define your own catalogs, schemas, and table lists (CatalogProvider)
build your own query language or plans (LogicalPlanBuilder)
declare and use user-defined functions (ScalarUDF, and AggregateUDF, WindowUDF)
add custom optimizer rewrite passes (OptimizerRule and PhysicalOptimizerRule)
extend the planner to use user-defined logical and physical nodes (QueryPlanner)
You c…

View full answer

alamb · 2023-12-12T21:31:44Z

alamb
Dec 12, 2023
Collaborator

👋 I'm wondering if anyone has contemplated how compute functions other than arrow::compute could be used with DataFusion

I think in general, many people do this today using the various extension APIs that come with DataFusion (docs):

DataFusion supports extension at many points:

read from any datasource (TableProvider)
define your own catalogs, schemas, and table lists (CatalogProvider)
build your own query language or plans (LogicalPlanBuilder)
declare and use user-defined functions (ScalarUDF, and AggregateUDF, WindowUDF)
add custom optimizer rewrite passes (OptimizerRule and PhysicalOptimizerRule)
extend the planner to use user-defined logical and physical nodes (QueryPlanner)
You can find examples of each of them in the datafusion-examples directory.

I personally think the idea of implementing ExecutionPlans and an associated rewrite rule that could take advantage of a GPU's hardware sounds like a pretty neat idea and would be interested in seeing some sort of POC

I can think of two broad approaches to using these libraries from DataFusion.

I agree these are two catagories and they probably make sense in different circumstances -- for example, keeping all the data on the GPU might argue for a more coarse grained approach if using libcuda

As for using Velox, it would be interesting to do in so far that its implementation of some particular operators are faster / more feature complete than the ones in DataFusion.

I would personally be be surprised if Velox's implementation was significantly faster, given it uses the same basic columnar architecture and techniques as DataFusion, however a performance comparison would definitely be interesting.

Since Velox can't be used standalone from what I understand (as in needs a SQL frontend / planner / dataframe API) I don't know of any benchmarks comparing it to DataFusion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How DataFusion could support other compute engines (libcudf, velox) #8498

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How DataFusion could support other compute engines (libcudf, velox) #8498

spencerwilson Dec 11, 2023

Granular: New set of ExecutionPlans for each engine

Coarse: Create a single-node ExecutionPlan

Replies: 1 comment

alamb Dec 12, 2023 Collaborator

spencerwilson
Dec 11, 2023

Coarse: Create a single-node `ExecutionPlan`

alamb
Dec 12, 2023
Collaborator