Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(influx_tools): Add export to parquet files #25297

Open
wants to merge 14 commits into
base: master-1.x
Choose a base branch
from

Conversation

srebhan
Copy link
Member

@srebhan srebhan commented Sep 9, 2024

Closes #
Superseeds #25253

Describe your proposed changes here.

  • I've read the contributing section of the project README.
  • Signed CLA (if not already signed).

This PR adds a command to export data into per-shard parquet files. To do so, the command iterates over the shards, creates a cumulative schema over the series of a measurement (i.e. a super-set of tags and fields) and exports the data to a parquet file per measurement and shard.

To test the tool run

go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config influxdb.conf -database telegraf

@srebhan srebhan force-pushed the v1-bulk-exporter-parquet branch 2 times, most recently from 6869ba3 to bd44db9 Compare September 9, 2024 14:12
@srebhan srebhan force-pushed the v1-bulk-exporter-parquet branch from 7c930bb to 2bb73ce Compare September 17, 2024 19:39
.circleci/config.yml Outdated Show resolved Hide resolved
cmd/influx_tools/main.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/command.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick review, but I'm not familiar with arrow and certainly missed some things. I can do a more thorough review if we paired to walk through the algorithm once.

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved
cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved
@srebhan srebhan force-pushed the v1-bulk-exporter-parquet branch from 2bb73ce to 46aef0b Compare September 18, 2024 10:41
@srebhan
Copy link
Member Author

srebhan commented Sep 18, 2024

@davidby-influx thanks for the thorough review! I tried to address all issues and commented on the three unresolved ones. Will schedule a meeting for walking through the code. Thanks again!

@srebhan srebhan force-pushed the v1-bulk-exporter-parquet branch from 23e7a05 to d7216ca Compare September 19, 2024 20:01
@srebhan srebhan force-pushed the v1-bulk-exporter-parquet branch from d7216ca to a7d0f1b Compare September 19, 2024 20:03
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cmd/influx_tools/parquet/cursors.go Show resolved Hide resolved
@alespour
Copy link
Contributor

alespour commented Oct 2, 2024

I'm not sure what to make of this: I have v1 db with several measurements like cpu disk etc, each with ~8M rows

> select count(usage_user) from cpu

name: cpu
time count
---- -----
0    8631360

The same query returns different number of rows in exported Parquet "db":

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'all/cpu-*.parquet'"

count(usage_user)
-----------------
28771200 

Log attached.
cpu-export.log

@alespour
Copy link
Contributor

alespour commented Oct 2, 2024

  • tested measurement without tags - OK
  • tested single & all measurements export - OK, except the discrepancy of number of rows

Tested with db with simulating 1-month of monitoring data of a small data center (9 measurements like cpu, disk etc, 10 tags). DB files size on disk 4.1 GB, 5 shards.

Exported Parquet size on disk 11 GB, took 1h6m on somewhat obsolete laptop (Core i7 CPU, 8-core, 16 GB RAM, SSD). Memory usage during export was stable (RSS peak ~2 GB).

InfluxDB measuement structure example:

> show tag keys from cpu
name: cpu
tagKey
------
arch
datacenter
hostname
os
rack
region
service
service_environment
service_version
team

> show field keys from cpu
name: cpu
fieldKey         fieldType
--------         ---------
usage_guest      float
usage_guest_nice float
usage_idle       float
usage_iowait     float
usage_irq        float
usage_nice       float
usage_softirq    float
usage_steal      float
usage_system     float
usage_user       float

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "describe select * from 'all/cpu-*.parquet'"

column_name          column_type  null  key  default  extra
-------------------  -----------  ----  ---  -------  -----
time                 TIMESTAMP    YES                      
arch                 VARCHAR      YES                      
datacenter           VARCHAR      YES                      
hostname             VARCHAR      YES                      
os                   VARCHAR      YES                      
rack                 VARCHAR      YES                      
region               VARCHAR      YES                      
service              VARCHAR      YES                      
service_environment  VARCHAR      YES                      
service_version      VARCHAR      YES                      
team                 VARCHAR      YES                      
usage_guest          DOUBLE       YES                      
usage_guest_nice     DOUBLE       YES                      
usage_idle           DOUBLE       YES                      
usage_iowait         DOUBLE       YES                      
usage_irq            DOUBLE       YES                      
usage_nice           DOUBLE       YES                      
usage_softirq        DOUBLE       YES                      
usage_steal          DOUBLE       YES                      
usage_system         DOUBLE       YES                      
usage_user           DOUBLE       YES                      

Measurement without tags:

alespour@master-node:/bigdata/x$ duckdb -column -s "select * from 'notags/*.parquet'"
time                        lat    lon  
--------------------------  -----  -----
2024-10-02 13:03:55.643371  49.95  14.47
2024-10-02 13:04:04.423014  49.91  14.49
2024-10-02 13:04:12.726653  49.94  14.53

@alespour
Copy link
Contributor

alespour commented Oct 2, 2024

I will repeat the test to verify the number of rows (mis)match.

@alespour
Copy link
Contributor

alespour commented Oct 2, 2024

My apologies, it was a mistake on my side. Row count matches.

InfluxDB:

> select count(usage_user) from cpu
name: cpu
time count
---- -----
0    28771200

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'cpu/*.parquet'"
count(usage_user)
-----------------
28771200  

@srebhan srebhan mentioned this pull request Oct 2, 2024
2 tasks
@alespour
Copy link
Contributor

alespour commented Oct 3, 2024

  • tested other types - OK
Creating the following schemata for 1 measurement(s):
  Measurement "types" with 0 tag(s) and  5 field(s):
    Column	Kind		Datatype
    ------	----		--------
    time	timestamp	timestamp (nanosecond)
    label	field		string
    lat		field		float
    lon		field		float
    match	field		boolean
    scale	field		integer
alespour@master-node:/bigdata/x$ sudo duckdb -column -s "describe from 'types/*.parquet'"
column_name  column_type  null  key  default  extra
-----------  -----------  ----  ---  -------  -----
time         TIMESTAMP    YES                      
label        VARCHAR      YES                      
lat          DOUBLE       YES                      
lon          DOUBLE       YES                      
match        BOOLEAN      YES                      
scale        BIGINT       YES
alespour@master-node:/bigdata/x$ sudo duckdb -column -s "select * from 'types/*.parquet' limit 1"
time                        label  lat    lon    match  scale
--------------------------  -----  -----  -----  -----  -----
2024-10-03 07:58:33.419431  a1     49.94  14.53  true   4    

@alespour
Copy link
Contributor

alespour commented Oct 3, 2024

It's GTG by me 👍

@srebhan
Copy link
Member Author

srebhan commented Nov 18, 2024

To run the exporter in this PR do the following (assuming you are using a BASH-compatible shell)

  1. Clone the repo and checkout the PR
# git clone https://github.com/influxdata/influxdb.git
# cd influxdb/
# git fetch origin pull/25297/head:v1-bulk-exporter-parquet 
# git checkout v1-bulk-exporter-parquet
  1. Build InfluxDB v1
# export PKG_CONFIG=${PWD}/pkg-config.sh
# go build ./...
  1. Run the exporter (with the help flag)
# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet --help
  1. Run the exporter with the config of an existing server instance
# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config <path to influxdb config dir>/influxdb.conf -database <database to export>

@dburton-influxdata
Copy link

Do we have a compiled version to test with or do I still need to clone the repo and build the go binary?

@dburton-influxdata
Copy link

I converted all of the BASH commands into a Python script and ran. It generates an error during the build.
exporter_build_errors_python exporter_script.txt

@dburton-influxdata
Copy link

Here is the Python script in Zip format for Github.
exporter_script.zip

@srebhan
Copy link
Member Author

srebhan commented Nov 21, 2024

@dburton-influxdata using os.environ does NOT export the variable to subprocesses like the go command! You would need to use os.putenv but I don't understand why you need to use python for the whole thing...

@jwei-influx
Copy link

I'll be taking this over from Darren. A couple of questions regarding this tool I do have:

  1. Has there been any consideration about how the tool is intended to handle mixed field type shards?
  2. If we need to do any sort of custom partitioning on the eventual 3.0 system, are we able to do that with this tool? Or conversely, is the resulting parquet file from this tool able to be slotted in behind a custom partitioning scheme that is pre-applied to the 3.0 instance?
  3. Are we able to do any sort of manipulation of the tags and fields using this tool, or potentially by editing the resulting parquet files?
  4. Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

I might have more questions as I test the tool, but these are the ones that are top of mind for me right now.

@srebhan
Copy link
Member Author

srebhan commented Dec 20, 2024

Thanks for your investigations @jwei-influx! Let me answer your questions:

  1. Has there been any consideration about how the tool is intended to handle mixed field type shards?

Yes. There are the --resolve-types and --resolve-names command-line options to fix type conflicts and name conflicts (between tags and fields) respectively.

  1. If we need to do any sort of custom partitioning on the eventual 3.0 system, are we able to do that with this tool? Or conversely, is the resulting parquet file from this tool able to be slotted in behind a custom partitioning scheme that is pre-applied to the 3.0 instance?

This is beyond the scope of this tool. This tool outputs the data as-is (plus potential type and name changes as discussed above) without the ability to split, merge or modify data.

  1. Are we able to do any sort of manipulation of the tags and fields using this tool, or potentially by editing the resulting parquet files?

As I mentioned above, you can modify field names (and types) using the --resolve-types and --resolve-names command-line options. Beyond this, the tool is not intended for manipulating or editing data or schemata but for exporting of existing data!

  1. Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...

I might have more questions as I test the tool, but these are the ones that are top of mind for me right now.

Happy to answer them. ;-)

@jacobmarble
Copy link
Member

  1. Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...

I haven't looked closely at this PR, and I'm no longer managing the team that is working on Parquet import. @helenosheaa might direct the right person to answer thoughtfully and accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants