-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ExpectColumnPairValuesAToBeGreaterThanB Spark Databricks #10559
Comments
@victorgrcp hi there thank you for the detailed report, and I appreciate the workaround you shared. I am glad you were able to unblock yourself. i'm looking into this and i am actually not able to reproduce. what version of spark are you on? I am able to get this expectation to work with both success and failure cases.
result: |
Hi @adeola-ak, I updated to GX 1.2.0 and I'm on Spark version 3.5.0. Still the same problem. My Context setup:
|
Hi, I had exactly the same issue as @victorgrcp when upgrading my processes to GX v1.3.0. I've applied the same workaround using |
Describe the bug
I'm using an Spark Data Source and Spark Dataframes as Data Assets. When I try to validate the ExpectColumnPairValuesAToBeGreaterThanB expectation it raises an error. I'm going to copy a small part of the exception raised:
{
"success": false,
"expectation_config": {
"type": "expect_column_pair_values_a_to_be_greater_than_b",
"kwargs": {
"column_A": "tpep_dropoff_datetime",
"column_B": "tpep_pickup_datetime",
"batch_id": "ds_samples_nyctaxi-da_df_trips"
},
"meta": {
"columns": [
"tpep_pickup_datetime",
"tpep_dropoff_datetime"
]
},
"id": "7310fd00-2153-43e9-8673-e8d7c4688abd"
},
"result": {},
"meta": {},
"exception_info": {
"('column_pair_values.a_greater_than_b.unexpected_count', '452c8f1abbd4f1d85e1503a16beb23ec', 'or_equal=None')": {
"exception_traceback": "Traceback (most recent call last):\n File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/great_expectations/execution_engine/execution_engine.py", line 533, in _process_direct_and_bundled_metric_computation_configurations\n metric_computation_configuration.metric_fn( # type: ignore[misc] # F not callable\n File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py", line 625, in _spark_map_condition_unexpected_count_value\n return filtered.count()\n ^^^^^^^^^^^^^^^^\n File "/databricks/spark/python/pyspark/sql/connect/dataframe.py", line 300, in count\n table, _ = self.agg(F._invoke_function("count", F.lit(1)))._to_table()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/databricks/spark/python/pyspark/sql/connect/dataframe.py", line 1971, in _to_table\n table, schema, self._execution_info = self._session.client.to_table(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1014, in to_table\n table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(req, observations)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1755, in _execute_and_fetch\n for response in self._execute_and_fetch_as_iterator(\n File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1731, in _execute_and_fetch_as_iterator\n
...
"exception_message": "[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "tpep_dropoff_datetime". It's probably because of illegal references like
df1.select(df2.col(\"a\"))
. SQLSTATE: 42704\n\nJVM stacktrace:\norg...}"To Reproduce
Expected behavior
Not an Exception error.
Environment:
Additional context
I tried with a Pandas DF and it worked, but I need to use the UnexpectedRowsExpectation expectation for other more complex validations. I replaced the ExpectColumnPairValuesAToBeGreaterThanB for UnexpectedRowsExpectation to workaround this datetime validation.
unexpected = gx.expectations.UnexpectedRowsExpectation( unexpected_rows_query = ( "SELECT * FROM {batch} WHERE tpep_dropoff_datetime < tpep_pickup_datetime" ) )
Seems to work for now, but I wanted to raise this bug.
Thank you :)
The text was updated successfully, but these errors were encountered: