How DataFusion could support other compute engines (libcudf, velox) #8498
-
👋 I'm wondering if anyone has contemplated how compute functions other than arrow::compute could be used with DataFusion? I've not studied their APIs in depth yet, but two libraries that maybe can be considered compute kernels are:
I can think of two broad approaches to using these libraries from DataFusion. Granular: New set of ExecutionPlans for each engineCreate a PhysicalPlanner impl that creates a tree of ExecutionPlans that use these other engines. For example, there'd be 3 different
And if one uses "the libcudf PhysicalPlanner" (for example), you'll get a tree of the libcudf-based ExecutionPlans. Coarse: Create a single-node
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think in general, many people do this today using the various extension APIs that come with DataFusion (docs):
I personally think the idea of implementing
I agree these are two catagories and they probably make sense in different circumstances -- for example, keeping all the data on the GPU might argue for a more coarse grained approach if using libcuda As for using Velox, it would be interesting to do in so far that its implementation of some particular operators are faster / more feature complete than the ones in DataFusion. I would personally be be surprised if Velox's implementation was significantly faster, given it uses the same basic columnar architecture and techniques as DataFusion, however a performance comparison would definitely be interesting. Since Velox can't be used standalone from what I understand (as in needs a SQL frontend / planner / dataframe API) I don't know of any benchmarks comparing it to DataFusion. |
Beta Was this translation helpful? Give feedback.
I think in general, many people do this today using the various extension APIs that come with DataFusion (docs):