Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error message for DataFrame.apply #153

Open
datapythonista opened this issue Aug 23, 2019 · 8 comments
Open

Better error message for DataFrame.apply #153

datapythonista opened this issue Aug 23, 2019 · 8 comments
Labels

Comments

@datapythonista
Copy link
Member

The next example seems reasonable to me:

df = pandas.DataFrame({'col1': ['1', '2', '3'],
                       'col2': ['9', '9', '9']})
df.apply(int)

And looks like it should convert the data in the DataFrame to integers, by calling the int() function for every element.

This would be true for Series.apply, but DataFrame.apply parameter is a function that receives a whole Series at a time, not individual (scalar) values. The function that receives one value at a time is DataFrame.applymap.

This is how pandas is designed, and while probably a bit confusing is reasonable. So, the previous example actually fails. The error is:

TypeError: ("cannot convert the series to <class 'int'>", 'occurred at index col1')

Feel free to disagree, but personally I think the error message doesn't do a great job at telling the user what's wrong, or give hints on how to fix it. I think something like the next should be more useful:

TypeError: The function `int` passed to `DataFrame.apply` should expect a `Series` as the argument. To apply a function that receives a single item at a time use `DataFrame.applymap`.

While this may look straight-forward, this is easy and surely not as easy as replacing the error message. The current reported message is reported by the Series when is trying to be converted to an integer by int(pandas.Series()), so it has nothing to do with apply.

I think it's doable to have an appropriate error message, but not sure about the implications.

Feel free to discuss your proposals on how to fix it here, or to try your approach and open a PR, and have the discussion there.

@WuraolaOyewusi
Copy link
Contributor

I agree with you. I've been noticing better error messages and suggestions on how to fix them in some libraries like sklearn.
It's better that an error message is useful than succinct. If we come up with very useful and succinct,great.
I will experiment with this and get back

@WuraolaOyewusi
Copy link
Contributor

I don't know if there is a pandas convention on how to add suggestion to error messages.

TypeError: ("cannot convert series to <class 'int'>", 'occurred at index col1',"to convert individual value to <class 'int'>,try DataFrame.applymap")

@datapythonista
Copy link
Member Author

The proposals are done with issues in the pandas repository. In this case, the problem is not that much about which is the exact error message, but rather how to be able to identify the problem in the code. As it's explained at the end of the description, the error message being raised is not from the apply function, but it's a "generic" Series error.

@WuraolaOyewusi
Copy link
Contributor

Hmmmmm

@martinagvilas
Copy link
Member

How is it going @WuraolaOyewusi? I'm interested in collaborating on this.

@datapythonista would it be useful to add a test at the beginning of df.apply asserting that the function to apply is not int or float (otherwise throw the error you described)? Or are you thinking of an even more abstract solution like throwing a type of error if the error comes from the Series class, and another type if it comes from apply?

@datapythonista
Copy link
Member Author

int and float are examples, you can also have math.log that expects a scalar, and should fail with the same new error. The solution is not trivial. An idea would be that when the Series is casted to something it can't, it raises a subclass of TypeError (e.g. InvalidCastError), then from apply this specific exception could be captured, and then I guess it's save to tell the user that the function should receive a Series as parameter but it doesn't.

@martinagvilas
Copy link
Member

@datapythonista makes sense! I will try to implement something similar and tell you how it goes.

@WuraolaOyewusi
Copy link
Contributor

@martinagvilas . I haven't figured out a way to go about this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants