DatumBox v0.8.0 Release Notes

Release Date: 2017-01-14 // over 7 years ago
    • Storage/Persistence:
      • The old database connectors are replaced with a new storage mechanism where every engine is a separate module.
      • The InMemory engine now stores objects independently and keeps track of their references in a catalog.
      • The MapDB engine stores objects in directories and makes the asynchronous writes configurable.
      • The models are no longer persisted automatically after the end of fit(). Instead we have full control over how/when they are stored using the save method.
      • The Dataframes can now be persisted using the save and load methods.
      • All the file-based engines persist models using a hierarchical directory structure, making the sharing of models easier.
    • Speed & Memory:
      • All the algorithms that use matrices now support disk-based training.
      • Improved the serialization speed by passing Key/Value information to BigMaps.
      • Faster and more memory efficient methods for Combinations and Arithmetics classes.
      • Addition of the FixedBatchSpliterator class in the concurrency package.
      • Speed optimizations on DPMM algorithms.
    • Code Improvements & New Algorithms:
      • New Model Selection package:
        • Added a new metrics submodule with the most important validation metrics for: classification, clustering, regression and recommendation.
        • Added a new splitter submodule that partitions the data into chunks: K-fold & Shuffle splits are supported.
        • Model selection is now performed by combining the splitters and metrics packages with the Validator class.
        • The ValidationMetrics are no longer stored inside the model and k-Fold validation is no longer part of model's methods.
      • New Preprocessing package:
        • The old data transformation package is replaced with a new that decouples numerical scaling from categorical encoding.
        • The following numeric scalers are now supported: MinMaxScaler, StandardScaler, MaxAbsScaler and BinaryScaler.
        • The following categorical encoders are now supported: OneHotEncoder and CornerConstraintsEncoder.
      • Improved Feature Selection package:
        • All feature selection algorithms focus on specific datatypes, making it possible to chain different methods together.
        • Simplified the inheritance of feature selection algorithms.
      • Improved Code Architecture:
        • The common module is now lighter and more reusable. Some of its classes & interfaces were moved to the core module.
        • The Storage mechanism is split into separate modules, enabling the support of new storage engines.
        • The new Tests module allows reusing Test Base classes and configuration files. Moreover using CI, we now test the code using different configurations (one for each storage engine) across all the popular operating systems.
      • Usability:
        • All the algorithms are initialized/loaded using the MLBuilder.
        • The Training Parameters are now provided during the initialization of the algorithms rather than using a setter.
        • The recommendersystem package is renamed to recommendation.
        • The Modeler now receives a list of feature selector parameters, allowing chaining methods together.
        • TextClassifier inherits directly from Modeler.
        • The Configuration objects use separate properties files.
        • Refactored and removed duplicate code, improved naming of packages, classes and configuration entries.
    • Dependencies:
      • Reduced the number of dependencies:
        • The lp_solve is removed in favour of a pure Java simplex solver. As a result datumbox does not depend on any system libraries.
        • The commons-lang which was used for HTML parsing is replaced with a faster custom HTMLParser implementation.
      • The libsvm, commons-math, commons-csv, slf4j and logback-classic are updated to the latest stable versions.