-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] cannot import name 'FdedupRayTransformConfiguration' from 'fdedup_transform_ray' #898
Comments
@MFahadShahid :which version of the data-prep-kit are you using. We have recently changed the implementation of fuzzy dedup - it is a pipeline of 4 transforms: signature calculation, cluster analysis, get duplicate list and data cleaning. As such, there is no more |
I'm using data-prep-kit version 0.2.3 and following the documentation on the main page (https://ibm.github.io/data-prep-kit/). I'm currently at the "Run your first data prep pipeline" section, as shown in the attached image. This section includes two notebooks (sample-notebook.ipynb and demo_with_launcher.ipynb), both of which use FdedupRayTransformConfiguration. Could these notebooks be outdated? |
@MFahadShahid The fdedup has undergone a number of improvements and we have not yet update the documentation. Sorry about the confusion. For example on how to use the 0.2.3 release, please see this notebook: |
@cmadam is there an example similar to this local_python ededup example where "data_local_config" is used? Thanks |
Search before asking
Component
Transforms/universal/fdedup
What happened + What you expected to happen
I have setup a virtual environment and followed the mentioned steps for installing data-prep-kit. I'm testing the end-to-end pipeline examples (sample notebook and demo-with-launcher) and facing the following error:
cannot import name 'FdedupRayTransformConfiguration' from 'fdedup_transform_ray' (/opt/conda/envs/data-prep-kit/lib/python3.11/site-packages/fdedup_transform_ray.py)
Reproduction script
input_folder = "sample_data/docid_out"
output_folder = "sample_data/fdedup_out"
import os
import sys
from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration
local_conf = {
"input_folder": input_folder,
"output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
fdedup_params = {
# columns used
"fdedup_doc_column": "contents",
"fdedup_id_column": "int_id_column",
"fdedup_cluster_column": "hash_column",
"data_local_config": ParamsUtils.convert_to_ast(local_conf)
}
params = common_config_params| fdedup_params
Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)
launch
fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())
fdedup_launcher.launch()
Anything else
No response
OS
Red Hat Enterprise Linux (RHEL)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: