Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

December 2024 ASF Board Report #10157

Closed
alamb opened this issue Apr 20, 2024 · 5 comments
Closed

December 2024 ASF Board Report #10157

alamb opened this issue Apr 20, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 20, 2024

Is your feature request related to a problem or challenge?

Per https://whimsy.apache.org/roster/committee/datafusion the DataFusion ASF board report schedule is

March, June, September, December

Describe the solution you'd like

I would like to draft a board report for the ASF board meeting, ideally with community help.

The meetings are typically in the second or third week of the month

Describe alternatives you've considered

I plan to do this in the same style that worked well in Arrow (see an example from @andygrove
here https://lists.apache.org/thread/7w4mgy98qomc6drvj2fo81gvhq6p0boc) -- make a google doc (or issue) that people can add relevant content to and then the chair (me for the time being) submits it to the board

Additional context

No response

@alamb alamb added the enhancement New feature or request label Apr 20, 2024
@alamb alamb self-assigned this Apr 20, 2024
@alamb alamb closed this as completed Jun 18, 2024
@alamb alamb reopened this Jun 18, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2024

Here is a draft report: https://docs.google.com/document/d/1b_C8uwMJVSrw9N1Oc8_fzFdpT0YExaRiuXJ8ulAXaYs/edit?tab=t.0

@andygrove is there any chance you can help with the Comet section?
@timsaucer perhaps you can write a few notes about the Python subproject

@milenkovicm
Copy link
Contributor

@alamb, @andygrove quick drafts summary for ballista, feel free to modify as necessary:

As described in apache/datafusion-ballista#1066 and announced by @andygrove in
https://lists.apache.org/thread/bkbxx9rbo8dbfolybxw9v0z1638do725 focus was in three directions

  1. lighter codebase, easier to maintain
  2. change focus from "Apache DataFusion Ballista Distributed Query Engine" to "Making Apache DataFusion Applications Distributed"
  3. making it easier to customize each ballista component

40+ commits later, we have API which can make datafusion applications distributed with single line change:

use ballista::prelude::*;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  
  // create DataFusion SessionContext with ballista standalone cluster started 
  let ctx = datafusion::prelude::SessionContext::standalone();

  ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;

  let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
  df.show().await?;
  Ok(())
}

and ongoing planning for next release apache/datafusion-ballista#974.

Also, benchmark result has been updated, showing huge benefit keeping up with latest datafusion

query compare

Short term focus would be:

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2024

40+ commits later, we have API which can make datafusion applications distributed with single line change:

Wow that is pretty amazing. Thank you @milenkovicm

I have added a link to your update in the report. very cool

@alamb
Copy link
Contributor Author

alamb commented Dec 10, 2024

I made a ticket to track the next release and took a pass over the document today.

I plan to submit the doc tomorrow Dec 11 per the plan

@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2024

Here is the final report that I submitted. Thanks to @phillipleblanc @andygrove @timsaucer @milenkovicm for the help writing it 🙏

## Description:
The mission of Apache DataFusion is the creation and maintenance of software 
related to an extensible query engine

## Project Status:
Current project status: New + Ongoing (high activity)
Issues for the board: None

## Membership Data:
Apache DataFusion was founded 2024-04-16 (8 months ago)
There are currently 42 committers and 14 PMC members in this project.
The Committer-to-PMC ratio is 3:1.

Community changes, past quarter:
- No new PMC members. Last addition was Jay Zhan on 2024-08-11.
- Piotr Findeisen was added as committer on 2024-12-03
- Jax Liu was added as committer on 2024-10-18
- Ifeanyi Ubah was added as committer on 2024-11-04
- Ruiqiu Cao was added as committer on 2024-12-10
- Michael Ward was added as committer on 2024-09-13

## Project Activity:
### Overall

We have completed adopting [sqlparser crate] into the project and made our
first release as part of the Apache Software Foundation.

[sqlparser crate]: https://github.com/apache/datafusion-sqlparser-rs

### DataFusion core

https://github.com/apache/datafusion

We continue the monthly release cadence versions. The [42.0.0 release] and
[43.0.0 release] had 73 and 96 unique contributors. We continue to [discuss
 the roadmap] in the open, and gathered a collection of [DataFusion
 related articles] onto our page.

We recently finished [significant performance improvements] as well as long
standing projects to migrate documentation to code and use the same API for
all user defined window functions. We also added FFI bindings to make it
easier to use multiple versions of DataFusion.

As more people build systems using DataFusion we are beginning to focus more
on keeping the core more stable, as it is [sometimes painful] to update to new
DataFusion versions.

[42.0.0 release]: https://github.com/apache/datafusion/
 blob/main/dev/changelog/42.0.0.md
[43.0.0 release]: https://github.com/apache/datafusion/
 blob/main/dev/changelog/42.0.0.md
[roadmap ticket]: https://github.com/apache/datafusion/issues/11442
[discuss the roadmap]: https://github.com/apache/datafusion/issues/13274
[DataFusion related articles]: https://datafusion.apache.org/
 user-guide/concepts-readings-events.html
[significant performance improvements]: https://datafusion.apache.org/blog/
 2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
[sometimes painful]: https://github.com/apache/datafusion/issues/13525

### Sub project: DataFusion Python

https://github.com/apache/datafusion-python

We continue the monthly release cadence versions. The [datafusion-python
41.0.0] release and [datafusion-python 42.0.0] had 5 and 6 unique
contributors. Release for version 43.0.0 is underway at the time of this
writing.

We recently added support for [user defined window functions], including
significant updates to the user documentation on how to author user defined
functions. Additionally we released a [blog post on UDFs] demonstrating how
users can incorporate custom UDFs that can lead to 10x speed improvements by
writing Rust backed Python functions.

We added support for foreign table providers via the FFI bindings in the core
project. This enables external parties to provide Python interfaced table
providers that support features such as push down filtering, including across
different versions of DataFusion.

[datafusion-python 41.0.0]: https://github.com/apache/datafusion-python
/pull/866
[datafusion-python 42.0.0]: https://github.com/apache/datafusion-python
/pull/901
[blog post on UDFs]: https://datafusion.apache.org/blog
/2024/11/19/datafusion-python-udf-comparisons/


### Sub project: DataFusion Comet

https://github.com/apache/datafusion-comet

The Comet project recently released version 0.4.0 with a focus on performance
& stability. See [Blog post] 

[Blog post]: https://datafusion.apache.org/blog/
2024/11/20/datafusion-comet-0.4.0/

Much of the current development focus is on improving complex type support,
particularly the ability to read complex types from Parquet and Iceberg
sources.

### Sub project: DataFusion Ballista

https://github.com/apache/datafusion-ballista

Since the last board report, the Ballista subproject has become much more
active and added new active maintainers.

The focus has changed from "Apache DataFusion Ballista Distributed Query
Engine" to "Making Apache DataFusion Applications Distributed"

The community has simplified the project by removing unfinished features and
refocusing as a way to scale out existing DataFusion applications by providing
a tighter integration with the core DataFusion project.

See more [details here]

[details here]: https://github.com/apache/datafusion/
issues/10157#issuecomment-2514694231

### Sub project: Sqlparser

https://github.com/apache/datafusion-sqlparser-rs

The sqlparser project became part of the DataFusion project this quarter.

In addition to ongoing additions to SQL dialect support, we made our first
release as part of the Apache DataFusion project, and have started introducing
spans (source locations), a long requested feature.


## Community Health:
It is still hard to keep track of everything going on, which is a good thing.
While it is always a struggle to get enough code review capacity, we have many
active committers, and the community in general helps each other out with
reviews. We continue to actively grow our committer and PMC ranks.

We have upcoming in person meetups scheduled for Chicago, Boston, and
Amsterdam.

@alamb alamb closed this as completed Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants