CoreNLP v4.2.0 Release Notes
Release Date: 2020-11-17 // over 2 years ago-
Overview
๐ This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.
โจ Enhancements
- โฌ๏ธ Upgrade libraries (EJML, JUnit, JFlex)
- โ Add character offsets to Tregex responses from server
- ๐ Improve cleaning of treebanks for English models
- Speed up loading of Wikidict annotator
- ๐ New utility for tagging CoNLL-U files in place
- ๐ป Command line tool for processing TokensRegex
๐ Fixes
- Output single token NER entities in inline XML output format
- โ Add currency symbol part of speech training data
- ๐ Fix issues with tree binarizing
Previous changes from v4.0.0
-
Overview
๐ The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.
โจ Enhancements
- UD v2.0 tokenization standard for English, French, German, and Spanish
- ๐ New mwt annotator for handling multiword tokens in French, German, and Spanish.
- ๐ New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
- French NER
- ๐ New Chinese segmentation based off CTB9
- ๐ Improved handling of double codepoint characters
- Easier syntax for specifying language specific pipelines and NER pipeline properties
- ๐ Improved CoNLL-U processing
- ๐ Improved speed and memory performance for CRF training
- ๐ Tregex support in CoreSentence
- โก๏ธ Updated library dependencies
๐ Fixes
- NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
- NPE in EntityMentionsAnnotator during language check
- NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
- ๐ง NPE in NERCombinerAnnotator in certain configurations of models on/off
- Incorrect handling of eolonly option in ArabicSegmenterAnnotator
- Apply named entity granularity change prior to coref mention detection
- ๐ Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
- Incorrect handling of reading in German treebank files
- ๐ SR parser crashes when given bad training input