Lingua v0.5.0 Release Notes

Release Date: 2019-08-12 // over 4 years ago
  • Languages

    • โž• added 12 new languages: Bengali, Chinese (not differentiated between traditional and simplified, as of now), Gujarati, Hebrew, Hindi, Japanese, Korean, Punjabi, Tamil, Telugu, Thai, Urdu

    ๐Ÿ”‹ Features

    ๐Ÿ‘ The LanguageDetectorBuilder now supports the additional method withMinimumRelativeDistance() that allows to specify the minimum distance between the logarithmized and summed up probabilities for each possible language. If two or more languages yield nearly the same probability for a given input text, it is likely that the wrong language may be returned. By specifying a higher value for the minimum relative distance, Language.UNKNOWN is returned instead of risking false positives.

    โœ… Test report generation can now use multiple CPU cores, allowing to run as many reports as CPU cores are available. This has been implemented as an additional attribute for the respective Gradle task: ./gradlew writeAccuracyReports -PcpuCores=...

    The REPL now allows to freely specify the languages you want to try out by entering the desired ISO 639-1 codes. Before, it has only been possible to choose between certain language combinations.

    ๐Ÿ‘Œ Improvements

    • The overall detection algorithm has been improved, yielding slightly more accurate results for those languages that are based on the Latin alphabet.

    ๐Ÿ› Bug Fixes

    ๐Ÿ›  Thanks to the great work of contributor Bernhard Geisberger, two bugs could be fixed.

    The fix in pull request #8 solves the problem of not being able to recreate the MapDB cache files automatically in case the data has been corrupted.

    The fix in pull request #9 makes the class LanguageDetector completely thread-safe. Previously, in some rare cases it was possible that two threads mutated one of the internal variables at the same time, yielding inaccurate language detection results.

    Thank you, Bernhard.