-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
EnhancementPerformanceMemory or execution speed performanceMemory or execution speed performanceStringsString extension data type and string dataString extension data type and string data
Milestone
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
For pandas.Series.str.get_dummies
now it will only return data type of numpy.int64
. It would be nice if other data types can be specified.
Feature Description
Add a new parameter to str.get_dummies
Alternative Solutions
N/A
Additional Context
As pandas.Series.str.get_dummies
is the easiest method in pandas to implement multi-encoding, it would be great if more data types are supported. The int64 used now can easily cause OOM problem in many cases. Indeed, it is this problem I came across that encouraged me to request this feature here.
Traceback (most recent call last):
File "D:\CodeSpace\comp9727-assn2\preprocessing.py", line 13, in <module>
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 101, in wrapper
return func(self, *args, **kwargs)
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 1919, in get_dummies
result, name = self._data.array._str_get_dummies(sep)
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\object_array.py", line 369, in _str_get_dummies
dummies = np.empty((len(arr), len(tags2)), dtype=np.int64)
numpy.core._exceptions.MemoryError: Unable to allocate 25.8 GiB for an array with shape (231637, 14942) and data type int64
Metadata
Metadata
Assignees
Labels
EnhancementPerformanceMemory or execution speed performanceMemory or execution speed performanceStringsString extension data type and string dataString extension data type and string data