Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PydanticModel to pandera.engines.polars_engine (polars engine support for pydantic models) #1874

Open
mitches-got-glitches opened this issue Dec 12, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@mitches-got-glitches
Copy link

mitches-got-glitches commented Dec 12, 2024

I would like to use Pydantic models in Pandera schemas but for polars, not pandas.

The current example shows the following:

import pandas as pd
from pandera.engines.pandas_engine import PydanticModel


class PydanticSchema(pa.DataFrameModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

Where PydanticModel is imported from pandas_engine. This doesn't work with polars, and if I switch the import from module polars_engine the object does not exist in that module.

I have been considering this note:

Since the PydanticModel datatype applies the BaseModel constructor to each row of the dataframe, using PydanticModel might not scale well with larger datasets.

Which suggests that maybe this is not the best way of doing things and I may lose out on polars vectorisation and speed benefits by doing this. Maybe that is why the feature has not been developed yet.

To add further context, what I would like is be able to define my data model and constraints once and only once. I am currently defining my models inheriting from SQLModel, since I want the functionality that brings. I have also been generating mock data with polyfactory which needs models that inherit from pydantic's BaseModel. I then want to use my data model to validate data files on ingest, and ideally I don't want to loop through applying the validation row by row - I want to validate in a vectorised way.

Alternatives I've considered:

  1. Rewriting the data model out in an object that inherits from pa.DataFrameModel in a manner consistent with the docs on use with polars.
  2. Writing something that can convert a pydantic.BaseModel into a pa.DataFrameModel dynamically - I don't know whether this is possible or desired, or whether something exists already for this.

While these alternatives allow me to take advantage of the full vectorisation, option 1 requires maintaining the data model in two different places which could result in sync issues. Option 2 requires building something new - which may have some use to the package API.

Is this feature on the roadmap (polars support for pydantic models)? Or does anyone have any advice which may help me achieve my goals? Thanks!

@mitches-got-glitches mitches-got-glitches added the enhancement New feature or request label Dec 12, 2024
@mitches-got-glitches
Copy link
Author

mitches-got-glitches commented Dec 12, 2024

It would also be great to have some kind of portability between pydantic.BaseModel and the pyspark DataFrame model in pandera. I could open a separate ticket for this if needed.

@mitches-got-glitches
Copy link
Author

I've just noted the supported feature matrix across engines which is very useful (it's on the front page, so my bad).

Consider this issue a +1 support for Pydantic integration with polars. It might be useful to add an additional emoji or two into here indicating what might be on the roadmap or is actively under development. 🛠📅📈

@cosmicBboy
Copy link
Collaborator

It would also be great to have some kind of portability between pydantic.BaseModel and the pyspark DataFrame model in pandera. I could open a separate ticket for this if needed.

Yes! mind making a ticket for this?

It might be useful to add an additional emoji or two into here indicating what might be on the roadmap or is actively under development. 🛠📅📈

Great idea, let me look into this early next year 💡

Which suggests that maybe this is not the best way of doing things and I may lose out on polars vectorisation and speed benefits by doing this. Maybe that is why the feature has not been developed yet.

In a way, this is the reason... the pydantic + pandas-pandera integration is our first attempt at unlocking pydantic models on pandas, but it's not the most performant implementation.

Writing something that can convert a pydantic.BaseModel into a pa.DataFrameModel dynamically - I don't know whether this is possible or desired, or whether something exists already for this.

This would be a more principled approach would be to have some sort of translation layer from pydantic-native types into pandera-native types (which itself translates to some underlying framework like polars). I think some constrained set of mappings from pydantic types to pandera dtypes would be a good start to unlocking this functionality.

To start, can you give example pydantic models that you would like to "just work" with pandera polars schemas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants