Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_types argument inplace ignored #1851

Open
2 of 3 tasks
lukepeck opened this issue Nov 11, 2024 · 0 comments
Open
2 of 3 tasks

check_types argument inplace ignored #1851

lukepeck opened this issue Nov 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@lukepeck
Copy link

Describe the bug
Possible edge case concerning the inplace argument for the pa.check_types decorator where _check_arg only executes the
schema's validate method if the argument does not have a 'pandera' attribute, or if the argument.schema is None or != schema model determined earlier.

This will produce an edge case where if the argument is valid versus the corresponding schema, it will not execute validate. Thus when pa.check_types(inplace=False) is called the expected behaviour of copying the input data does not happen leading to side effects that persist in outer scopes (see example below).

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
from pandera.typing import DataFrame


class ExampleSchema(pa.DataFrameModel):
    column1: int


class MyClass:

    @pa.check_types(inplace=False)   # should copy input argument/ dataframes
    def my_method(self, input_dataframe: DataFrame[ExampleSchema]) -> None:
        # inplace = False should copy input_dataframe meaning any modifications here
        # should not persist in the outer scope.
        input_dataframe["column2"] = 0.0
        return


if __name__ == "__main__":
    c = MyClass()
    example_df = DataFrame[ExampleSchema]({"column1": [1]})
    print(example_df.head())  # only column1 exists
    c.my_method(example_df)
    print(example_df.head())  # column2 exists as a side effect of my_method

Expected behavior

After calling my_method in the above example with pa.check_types(inplace=False), I expected a copy of input_dataframe to be made and thus any operations to not persist to the example_df in the __main__ scope (i.e. column2 should not be present).

Desktop (please complete the following information):

  • OS: Ubuntu 20.04.6 LTS
  • Version: 0.20.4

Output

   column1
0        1
   column1  column2
0        1      0.0

Fix?

For the above example adding the below condition to _check_arg here, seems to work (but haven't done any wider checks):

                if (
                    not hasattr(arg_value, "pandera")
                    or arg_value.pandera.schema is None
                    # don't re-validate a dataframe that contains the same
                    # exact schema
                    or arg_value.pandera.schema != schema
                    or inplace is False    # This is new
                ):
                    try:
                        arg_value = schema.validate(
                            arg_value,
                            head,
                            tail,
                            sample,
                            random_state,
                            lazy,
                            inplace,
                        )
@lukepeck lukepeck added the bug Something isn't working label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant