Change default string storage from "python" to "pyarrow" (if installed) for for NA-variant of StringDtype #60287
Labels
API Design
NA - MaskedArrays
Related to pd.NA and nullable extension arrays
Strings
String extension data type and string data
Milestone
Historically, the default value for the string storage (globally configurable through
pd.options.mode.string_storage
) ofStringDtype
was"python"
, and users needed to explicitly ask for"pyarrow"
. For example:and this is still the behaviour on
main
.For the new NaN-variant of
StringDtype
, however, we implemented the default string storage option"auto"
meaning "use pyarrow if installed, otherwise use python". So on a system with pyarrow installed:Essentially we interpret the default
string_storage
option setting of"auto"
differently for the NaN vs NA variant of the string dtype, which you can see in the code here:pandas/pandas/core/arrays/string_.py
Lines 152 to 163 in 5f23ace
Proposal: I think it makes sense to also switch to "pyarrow" as the default string storage (if installed) for the nullable StringDtype. This is somewhat a breaking change (although mostly for the dtype object itself, because behaviour-wise for string operations, there should be hardly any difference between both backends), so I would keep this for 3.0 and properly document it in the whatsnew notes.
The text was updated successfully, but these errors were encountered: