Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different null handling behavior between Polars and Pandas validation #1835

Open
2 of 3 tasks
alexismanuel opened this issue Oct 24, 2024 · 2 comments
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@alexismanuel
Copy link
Contributor

Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pa
import pandera.polars as papl
from functools import partial

# Create a simple dataframe with null values and an invalid value
df = pl.DataFrame({
    "col1": ['1', '2', None, 'x'],
    "col2": ['valid', None, None, 'valid']
})

# Define a simple schema with nullable fields and invalid values check
invalids = ['x']
schema_field = partial(
    pa.Field,
    nullable=True,
    notin=invalids
)

class PolarsSchema(papl.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

class PandasSchema(pa.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

# Test Polars validation
print("Original DataFrame:")
print(df)

print("\nUsing Polars validation:")
print(df.pipe(PolarsSchema.validate, lazy=True))

print("\nUsing Pandas validation:")
print(
    df.to_pandas()
    .pipe(PandasSchema.validate, lazy=True)
    .pipe(pl.from_pandas)
)

Expected behavior

Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04
  • Python version: 3.12
  • Pandera version: 0.20.4
  • Polars version: 1.11.0
  • Pandas version: 2.2.3

Screenshots

Console Outputs:

Original DataFrame:
shape: (4, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
│ x    ┆ valid │
└──────┴───────┘

Using Polars validation:
shape: (2, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
└──────┴───────┘

Using Pandas validation:
shape: (3, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
└──────┴───────┘

Additional context

The behavior is consistent - Polars validation always drops the null rows while Pandas validation preserves them

@alexismanuel alexismanuel added the bug Something isn't working label Oct 24, 2024
@ksolarski
Copy link
Contributor

The problem boils down to the differences between how Pandas and Polars filter out rows with null values:

import polars as pl
import pandas as pd

df_polars = pl.DataFrame({"col1": ["1", "2", None, "x"]})
df_pandas = pd.DataFrame({"col1": ["1", "2", None, "x"]})

# This will filter out Null values
print("Polars filtering")
print(df_polars.filter(~pl.col("col1").is_in(["x"])))

# This will not filter out Null values
print("Pandas filtering")
print(df_pandas.query('~col1.isin(["x"])'))

Console Output:

Polars filtering
shape: (2, 1)
┌──────┐
│ col1 │
│ ---  │
│ str  │
╞══════╡
│ 1    │
│ 2    │
└──────┘
Pandas filtering
   col1
0     1
1     2
2  None

After Pandera runs validation on Polars DataFrame with drop_invalid_rows = True, it filters out invalid rows using filter method, which then drops the rows that have Null values.

@baldwinj30
Copy link
Contributor

I started digging into this a little here: main...baldwinj30:pandera:investigate_test_error

I found the test for dropping invalid rows in polars was not functioning as intended; there may be another bug with dropping invalid rows for columns of the wrong type, although I am not entirely sure what the intended behavior is in that case.

Hoping to dig in a little more in the next couple days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants