-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(influx_tools): Add export to parquet files #25297
base: master-1.x
Are you sure you want to change the base?
Conversation
6869ba3
to
bd44db9
Compare
7c930bb
to
2bb73ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick review, but I'm not familiar with arrow and certainly missed some things. I can do a more thorough review if we paired to walk through the algorithm once.
2bb73ce
to
46aef0b
Compare
@davidby-influx thanks for the thorough review! I tried to address all issues and commented on the three unresolved ones. Will schedule a meeting for walking through the code. Thanks again! |
23e7a05
to
d7216ca
Compare
d7216ca
to
a7d0f1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The same query returns different number of rows in exported Parquet "db":
|
Tested with db with simulating 1-month of monitoring data of a small data center (9 measurements like cpu, disk etc, 10 tags). DB files size on disk 4.1 GB, 5 shards. Exported Parquet size on disk 11 GB, took 1h6m on somewhat obsolete laptop (Core i7 CPU, 8-core, 16 GB RAM, SSD). Memory usage during export was stable (RSS peak ~2 GB). InfluxDB measuement structure example:
Parquet:
Measurement without tags:
|
I will repeat the test to verify the number of rows (mis)match. |
My apologies, it was a mistake on my side. Row count matches. InfluxDB:
Parquet:
|
|
It's GTG by me 👍 |
To run the exporter in this PR do the following (assuming you are using a BASH-compatible shell)
# git clone https://github.com/influxdata/influxdb.git
# cd influxdb/
# git fetch origin pull/25297/head:v1-bulk-exporter-parquet
# git checkout v1-bulk-exporter-parquet
# export PKG_CONFIG=${PWD}/pkg-config.sh
# go build ./...
# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet --help
# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config <path to influxdb config dir>/influxdb.conf -database <database to export> |
Do we have a compiled version to test with or do I still need to clone the repo and build the go binary? |
I converted all of the BASH commands into a Python script and ran. It generates an error during the build. |
Here is the Python script in Zip format for Github. |
@dburton-influxdata using |
I'll be taking this over from Darren. A couple of questions regarding this tool I do have:
I might have more questions as I test the tool, but these are the ones that are top of mind for me right now. |
Thanks for your investigations @jwei-influx! Let me answer your questions:
Yes. There are the
This is beyond the scope of this tool. This tool outputs the data as-is (plus potential type and name changes as discussed above) without the ability to split, merge or modify data.
As I mentioned above, you can modify field names (and types) using the
This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...
Happy to answer them. ;-) |
I haven't looked closely at this PR, and I'm no longer managing the team that is working on Parquet import. @helenosheaa might direct the right person to answer thoughtfully and accurately. |
Closes #
Superseeds #25253
Describe your proposed changes here.
This PR adds a command to export data into per-shard parquet files. To do so, the command iterates over the shards, creates a cumulative schema over the series of a measurement (i.e. a super-set of tags and fields) and exports the data to a parquet file per measurement and shard.
To test the tool run