Different null handling behavior between Polars and Pandas validation #1835

alexismanuel · 2024-10-24T19:03:54Z

Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pa
import pandera.polars as papl
from functools import partial

# Create a simple dataframe with null values and an invalid value
df = pl.DataFrame({
    "col1": ['1', '2', None, 'x'],
    "col2": ['valid', None, None, 'valid']
})

# Define a simple schema with nullable fields and invalid values check
invalids = ['x']
schema_field = partial(
    pa.Field,
    nullable=True,
    notin=invalids
)

class PolarsSchema(papl.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

class PandasSchema(pa.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

# Test Polars validation
print("Original DataFrame:")
print(df)

print("\nUsing Polars validation:")
print(df.pipe(PolarsSchema.validate, lazy=True))

print("\nUsing Pandas validation:")
print(
    df.to_pandas()
    .pipe(PandasSchema.validate, lazy=True)
    .pipe(pl.from_pandas)
)

Expected behavior

Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.

Desktop (please complete the following information):

OS: Ubuntu 22.04
Python version: 3.12
Pandera version: 0.20.4
Polars version: 1.11.0
Pandas version: 2.2.3

Screenshots

Console Outputs:

Original DataFrame:
shape: (4, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
│ x    ┆ valid │
└──────┴───────┘

Using Polars validation:
shape: (2, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
└──────┴───────┘

Using Pandas validation:
shape: (3, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
└──────┴───────┘

Additional context

The behavior is consistent - Polars validation always drops the null rows while Pandas validation preserves them

The text was updated successfully, but these errors were encountered:

ksolarski · 2024-12-06T12:45:50Z

The problem boils down to the differences between how Pandas and Polars filter out rows with null values:

import polars as pl
import pandas as pd

df_polars = pl.DataFrame({"col1": ["1", "2", None, "x"]})
df_pandas = pd.DataFrame({"col1": ["1", "2", None, "x"]})

# This will filter out Null values
print("Polars filtering")
print(df_polars.filter(~pl.col("col1").is_in(["x"])))

# This will not filter out Null values
print("Pandas filtering")
print(df_pandas.query('~col1.isin(["x"])'))

Console Output:

Polars filtering
shape: (2, 1)
┌──────┐
│ col1 │
│ ---  │
│ str  │
╞══════╡
│ 1    │
│ 2    │
└──────┘
Pandas filtering
   col1
0     1
1     2
2  None

After Pandera runs validation on Polars DataFrame with drop_invalid_rows = True, it filters out invalid rows using filter method, which then drops the rows that have Null values.

baldwinj30 · 2024-12-31T02:36:04Z

I started digging into this a little here: main...baldwinj30:pandera:investigate_test_error

I found the test for dropping invalid rows in polars was not functioning as intended; there may be another bug with dropping invalid rows for columns of the wrong type, although I am not entirely sure what the intended behavior is in that case.

Hoping to dig in a little more in the next couple days.

alexismanuel added the bug Something isn't working label Oct 24, 2024

baldwinj30 mentioned this issue Dec 31, 2024

bugfix/1835: Keep nulls in polars when dropping invalid rows and nullable=True #1890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different null handling behavior between Polars and Pandas validation #1835

Different null handling behavior between Polars and Pandas validation #1835

alexismanuel commented Oct 24, 2024

ksolarski commented Dec 6, 2024

baldwinj30 commented Dec 31, 2024

Different null handling behavior between Polars and Pandas validation #1835

Different null handling behavior between Polars and Pandas validation #1835

Comments

alexismanuel commented Oct 24, 2024

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Screenshots

Additional context

ksolarski commented Dec 6, 2024

baldwinj30 commented Dec 31, 2024