You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
importpolarsasplimportpanderaaspaimportpandera.polarsaspaplfromfunctoolsimportpartial# Create a simple dataframe with null values and an invalid valuedf=pl.DataFrame({
"col1": ['1', '2', None, 'x'],
"col2": ['valid', None, None, 'valid']
})
# Define a simple schema with nullable fields and invalid values checkinvalids= ['x']
schema_field=partial(
pa.Field,
nullable=True,
notin=invalids
)
classPolarsSchema(papl.DataFrameModel):
col1: str=schema_field()
col2: str=schema_field()
classConfig:
drop_invalid_rows=TrueclassPandasSchema(pa.DataFrameModel):
col1: str=schema_field()
col2: str=schema_field()
classConfig:
drop_invalid_rows=True# Test Polars validationprint("Original DataFrame:")
print(df)
print("\nUsing Polars validation:")
print(df.pipe(PolarsSchema.validate, lazy=True))
print("\nUsing Pandas validation:")
print(
df.to_pandas()
.pipe(PandasSchema.validate, lazy=True)
.pipe(pl.from_pandas)
)
Expected behavior
Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.
Desktop (please complete the following information):
The problem boils down to the differences between how Pandas and Polars filter out rows with null values:
importpolarsasplimportpandasaspddf_polars=pl.DataFrame({"col1": ["1", "2", None, "x"]})
df_pandas=pd.DataFrame({"col1": ["1", "2", None, "x"]})
# This will filter out Null valuesprint("Polars filtering")
print(df_polars.filter(~pl.col("col1").is_in(["x"])))
# This will not filter out Null valuesprint("Pandas filtering")
print(df_pandas.query('~col1.isin(["x"])'))
After Pandera runs validation on Polars DataFrame with drop_invalid_rows = True, it filters out invalid rows using filter method, which then drops the rows that have Null values.
I found the test for dropping invalid rows in polars was not functioning as intended; there may be another bug with dropping invalid rows for columns of the wrong type, although I am not entirely sure what the intended behavior is in that case.
Hoping to dig in a little more in the next couple days.
Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.
Desktop (please complete the following information):
Screenshots
Console Outputs:
Additional context
The behavior is consistent - Polars validation always drops the null rows while Pandas validation preserves them
The text was updated successfully, but these errors were encountered: