Skip to content

Latest commit

 

History

History
370 lines (245 loc) · 14.5 KB

CHANGES.rst

File metadata and controls

370 lines (245 loc) · 14.5 KB
orphan:
.. currentmodule:: dirty_cat

Release 0.4.1

Major changes

Minor changes

  • Improvement of date column detection and date format inference in :class:`TableVectorizer`. The format inference now finds a format which works for all non-missing values of the column, instead of relying on pandas behavior. If no such format exists, the column is not casted to a date column. :pr:`543` by :user:`Leo Grinsztajn <LeoGrin>`

Release 0.4.0

Major changes

Minor changes

Bug fixes

Release 0.3.0

Major changes

Notes

Release 0.2.2

Bug fixes

Release 0.2.1

Major changes

Bug-fixes

Notes

Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

Notes

Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

Bug-fixes

Release 0.1.1

Major changes

Bug-fixes

Release 0.1.0

Major changes

Bug-fixes

Release 0.0.7

  • MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables.
  • datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
    • The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
    • The field "description" has been renamed to "DESCR".
  • SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
  • SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Release 0.0.6

  • SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
    • computing the vocabulary count vectors in fit instead of transform
    • computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the :class:`SimilarityEncoder`.
  • SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder` to be created when categories was a list.
  • SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Release 0.0.5

  • SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
  • SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
  • SimilarityEncoder: Performance improvements in the ngram similarity.
  • SimilarityEncoder: Expose a get_feature_names method.