languagetool-standalone/CHANGES.md
The add-on for LibreOffice/OpenOffice is not part of this repository anymore. Find it at https://github.com/languagetool-org/languagetool-for-libreoffice.
There were also minor rule improvements for Galician, Belarusian, Esperanto, Arabic, and Russian.
/languages endpoint now lists language codes like fr-FR and es-ES for languages
that actually don't have a variant (e.g. there is no fr-CA). These codes can also be used
for the language parameter when sending a request. fr-FR will internally be mapped
to fr etc. (https://github.com/languagetool-org/languagetool/issues/7421)--api parameter for the command-line version has been removed. It had
long been deprecated and replaced by --json.warmup setting for the config file, which had no effect anymore, has been removed.--word2vecmodel and --neuralnetworkmodel options have been removed,
as these features were not maintained and had never been used on languagetool.org.grammar_custom.xml into the same directory that contains the
grammar.xml file for your language. This file will be loaded in addition to
grammar.xml. It can contain custom rules that you want to use now and with future
versions of LanguageTool, without modifying existing files. The grammar_custom.xml
needs to use the same XML syntax as grammar.xml and it must not introduce rule IDs
that are in use by other rules in other files already.--word2vecModel option has been deprecated--allow-origin option doesn't require a parameter anymore
in order to avoid confusion about whether * needs to be quoted
on Windows. Using --allow-origin without a parameter now implies *.firstupper for case_conversion attribute in grammar.xml (see issue #3241).us[we/PRP,we/PRP_O1P]; mine[mine/PRP$,I/PRP$_P1S]does[do/VBZ]n't[not/RB]; Harper[Harper/NNP,harper/NN]'s['s/POS]nl-BE). "Dutch" (nl) is
still the default. nl-BE-specific rules can be added to nl-BE/grammar.xmlRegexAntiPatternFilter which can be used to have antipatterns
for <regexp> rules. Use like this:
<regexp>my regex</regexp>
<filter class="org.languagetool.rules.patterns.RegexAntiPatternFilter" args="antipatterns:regex1|regex2"/>
antipatterns: cannot contain spaces.--level PICKY on the command line, level=picky with the HTTP API.)ProhibitedCompoundRule has its own ID now, so it can be separately turned on/offConfusionProbabilityRule has its own ID now, so it can be separately turned on/offchunk_re for <token>, which specifies a chunk as a regular expressionde/de-DE-AT/grammar.xml_ and / can now be escaped in spelling.txt and spelling_custom.txt using
the backslash. For example, foo\/s will add foo/s to the dictionary.filter arguments: prefix and suffix to be used for matching the part-of-speech of parts of words
with prefix and suffix added to original token, e.g.: <filter class="org.languagetool.rules.ru.RussianPartialPosTagFilter"
args="no:2 regexp:(.*) postag_regexp:(ADV) prefix:не suffix: "/>
replace_custom.txt for several languages so users can have their own very simple replace
rules without worrying about updates (they still need to copy the file to the new LT version, though).com.gitlab.dumonts:hunspell to 1.1.1 to make spell checking work on older Linux
distributions like RHEL 7.ORD for ordinal numbers (e.g., first, second, twenty-third etc.)compounds.txt now automatically expands ß to ss when using German (Switzerland)spelling.txt now supports prefix_verb syntax like vorüber_eilen so
the speller will accept all forms of "eilen" prefixed by "vorüber"org.languagetool.dev.wikipedia.atom
has been removed. It hadn't been maintained for years and didn't work properly
anymore.spelling_global.txt has been added. Words or phrases added here will
be accepted for all languages.prohibit_custom.txt and spelling_custom.txt can be used to make your
own additions to spelling.txt and prohibit.txt without having to edit those
files after a LanguageTool update (you will still need to manually copy those
files).
Paths to these files (xx = language code):
./org/languagetool/resource/xx/hunspell/prohibit_custom.txt
./org/languagetool/resource/xx/hunspell/spelling_custom.txt
Note that you can simply create these files if they don't exist for your language yet.lang-xx=... and lang-xx-dictPath=...) now
also supports hunspell dictionaries. Just let lang-xx-dictPath point to the
absolute path of the .dic file. Note that hunspell is quite slow when it
comes to offering suggestions for misspelled words.AbstractSimpleReplaceRule2 has been fixed so that it's now case-insensitive.
If you implement a sub class of it and you want the old behavior, please implement
isCaseSensitive() and have it return true. (Issue #2051)default="temp_off" attribute in grammar.xml files will
turn off a rule/rulegroup, but keep it activated for our nightly regression tests.added.txt and removed.txt
(except for Catalan and Polish; for German removing compounds
in removed.txt might not work) (#884)PCT for punctuation marks (.,;:…!?)FRENCH_WHITESPACE has been split into FRENCH_WHITESPACE (on
by default) and FRENCH_WHITESPACE_STRICT (off by default).
FRENCH_WHITESPACE only complains if there's no space at all before
?, !, ;, :, or ». FRENCH_WHITESPACE_STRICT complains
if there's no space or a common space instead of a non-breaking space
before these characters.false-friends.xml are not supported anymore because
their precision isn't good enough. See confusion_sets_l2_de.txt for active DE/EN pairs.
Use My handy is broken. to test the rule. As before, this will only create
an error if motherTongue is set to a German language code.prohibit.txt: lines starting with .* will prohibit all words ending with
the subsequent string (e.g., .*artigel will prohibit Versandartigel)altLanguages will only be considered for words with >= 3 charactersresource/en/en-US-GB.txt contains a mapping from US to British
English and vice versa. It's not used to detect correct or incorrect spellings,
but only to improve error messages so that they explicitly explain that
the incorrect word is actually a different variant (like 'colour' in an en-US
text).mydomain.org/ are now detected as domains and not
considered spelling errors anymore. Note that the slash is still needed
to avoid missing real errors.replacements list now has an optional new item shortDescription
for each value. It can contain a short definition/hint about the word. Currently,
the only words that have a short description are ones that have a description
in confusion_sets.txt (i.e. a text after the | symbol).interpretAs part of getTextWithMarkup() (#1393)raw_pos for the <pattern> element in grammar.xml.
If set to yes, the postag will refer to the part-of-speech tags before
disambiguation.<antipattern> in disambiguation.xmlpreferredLanguages: up to a certain limit (currently
50 characters), only these languages will be considered for language detection.
This has to be a comma-delimited list of language codes without variants (e.g.
use 'en', not 'en-US').
This only works with fasttext configured as the language detector.lang-xx=languagename and lang-xx-dictPath=/path/to/morfologik.dict.
xx needs to be the language code. The JSON result will contain spellCheckOnly: true
for these languages.altLanguages parameter: takes a list of language
codes. Unknown words of the main languages (as specified by the language parameter)
will cause errors of type "Hint" if accepted by one of these languages.
We expect clients to interpret this like style issues, e.g. these words should
be underlined with a light blue instead of red.
Support for this is experimental, i.e. it might be removed again or implemented
in a different way.noopLanguages parameter: takes a list of language
codes of languages that are not supported by LT but that will be detected and
mapped to a no-op language without rules. Useful for clients that rely on
language auto-detection and whose users might use languages not supported by LT.
NOTE 1: only works with fastText configured
NOTE 2: setting languages here will worsen language detection quality on averageconfidence to detectedLanguage object in the JSON response that contains
the probability score for the detected language as computed by the detection algorithm.\n-- \ndata parameter that describes
markup. For example:
{"annotation":[
{"text": "A "},
{"markup": "<b>"},
{"text": "test"},
{"markup": "</b>"}
]}
markup parts and run the check only
on the text parts. The error offset positions will still refer to the
original input including the markup, so that suggestions can easily be applied.
You can optionally use interpretAs to have markup interpreted as whitespace, like this:
{"markup": "<p>", "interpretAs": "\n\n"}
) still need to be converted to Unicode characters
before feeding them into LT.
(Issue: https://github.com/languagetool-org/languagetool/issues/757)blockedReferrers setting now also considers the Origin headerblockedReferrers setting of foobar.org will now automatically match http://foobar.org,
http://www.foobar.org, https://foobar.org, and https://www.foobar.orgfasttextModel (see https://fasttext.cc/docs/en/language-identification.html)
and fasttextBinary (see https://fasttext.cc/docs/en/support.html). With these
options set, the automatic language detection is much better than the built-in one.mode parameter with all, textLevelOnly, or allButTextLevelOnly as value:
Will check only text-level rules or all other rules. As there are fewer text-level rules,
this is usually much faster and the access limit for characters per minute that can be
checked is more generous for this mode.type in JSON. This is supposed to help clients choose the color
with which they underline/mark errors. Please do not rely on this yet, it might change
or even be removed.prohibit.txt: lines ending with ".*" will prohibit all words starting with
the previous stringdetectedLanguage (under language) that
contains information about the automatically detected language. This way
clients can suggest switching to that language, e.g. in cases where the
user had selected the wrong language.blockedReferrers: a comma-separated list
of HTTP referrers that are blocked and will not be serveddbDriver, dbUrl, dbUsername,
dbPassword to allow user-specific dictionaries*SpellerRule classes (e.g. MorfologikRussianSpellerRule)
have changedLanguageIdentifier will now only consider the first 1000 characters when
identifying the language of a text. This improves performance for long texts.NL_PREFERRED_WORD_RULE that suggests preferred words (e.g., 'fiets' for 'rijwiel')<url> to rules-* for ignore.txt: entries ending with -* are ignored only if
they are part of a hyphenated compound (e.g, Fair-Trade-* allows Fair-Trade-Kakao)Lehrzeile instead of Leerzeile, requires ngram data (rule id DE_PROHIBITED_COMPOUNDS)ResultCache have been removed from MultiThreadedJLanguageTool
as using them caused incorrect results. (https://github.com/languagetool-org/languagetool/issues/897)MISC and moved the rules to more specific categoriesalgortherm (algorithm) or theromator (thermometer).
In the worst case (every single word of a text misspelled), this has a performance
penalty of about 30%.0xFFFFRuleMatch can now have a URL, too. The URL usually points to a page that
describes the error or grammar rule in more detail. Before, only the Rule
could have a URL. A RuleMatch URL will overwrite the Rule URL in the
JSON output.RuleMatch now also has information about the sentence the error occurred in
(it used to have only position information and the caller was expected to find
the error context and/or sentence position in the original text).requestLimit and requestLimitPeriodInSeconds now both
need to be set for the limit to worktimeoutRequestLimit: similar to requestLimit, but this one limits
not all requests but blocks once this many timeouts have been caused by the IP in the
time span set by requestLimitPeriodInSecondsrequestLimitInBytes: similar to requestLimit, but this one limits
the aggregated size of requests caused by an IP in the time span set
by requestLimitPeriodInSecondsmaxErrorsPerWordRate: set the maximum allowed errors per word, e.g.
0.3 if the maximum is about one error per three words. More errors will stop the
check with an exception. This is useful so no processing time gets wasted for texts
with a huge amount of errors that are only caused by the wrong language being
selected (leading to most words being detected as spelling errors).sentence property with the text of the sentence
the error occurred in.spelling-de-AT.txt and
spelling-de-CH.txt for de-AT and de-CH that will be considered in addition
to spelling.txtcompounds.txt; these
endings indicate that the mid-word parts of the compound need to be lower-cased
(e.g., 'Geräte Wahl' -> 'Gerätewahl')AnnotatedText (built via AnnotatedTextBuilder) can now contain
document-level meta data. This might be used by rules in the future.Kuß now suggests only Kuss and
also has a message explaining the user that Kuß is an old spellingapiVersion property of the JSON output is now a number
instead of a string (issue #712)spelling.txt allows multi-word entries: the words/tokens (separated by " ") of one
line are converted to a DisambiguationPatternRule in which each word is a case-sensitive
and non-inflected PatternToken (result: the entire multi-word entry is ignored by
the spell checker)--languageModel option to the embedded server, thanks to
Michał Janik (issue #404)ResultCache has been added to speed up the LT serverEnglishRule, GermanRule, CatalanRule, and FrenchRuleare now
deprecated. These are empty abstract classes that never had any real
use. Rules that extend these classes will directly extend Rule or
TextLevelRule in a future release.TextLevelRule instead of RuleLithuanian class has been deprecated. Lithuanian in LT hasn't been maintained
for years and there's no new maintainer in sight. It has also very low usage
on languagetool.org and very few error detection rules anyway, so we'll remove its
support from LT in the next release.Malayalam class has been deprecated. Malayalam in LT hasn't been maintained
for years and there's no new maintainer in sight. It has also very low usage
on languagetool.org and very few error detection rules anyway, so we'll remove its
support from LT in the next release.confusion_sets.txt file where word pairs could be added.
See http://wiki.languagetool.org/finding-errors-using-n-gram-data
for more information but note that we cannot offer the required
ngram data yet for Portuguese, as we rely on the Google ngram
data and Portuguese isn't part of that.RussianWordCoherencyRuleremoved.txt for words that need to be removed from the dictionaryA new method for removing overlapping errors has been implemented. By default,
it is enabled for the HTTP API and LibreOffice outputs, and disabled for the
command-line output. If necessary, priorities for rules and categories can bet set
in Language.getPriorityForId(String id). Default value is 0, positive integers have
higher priority and negative integers have lower priority.
Language.getShortName() has been deprecated, use Language.getShortCode()
instead
Language.getShortNameWithCountryAndVariant() has been deprecated, use
Language.getShortCodeWithCountryAndVariant() instead
Languages.getLanguageForShortName() has been deprecated, use
Languages.getLanguageForShortCode() instead
The following languages have been unmaintained for a long time. A warning has been shown for some time on languagetool.org and in the stand-alone GUI for these languages. This warning has now been extended to Java in the form of a deprecation, i.e. the constructors of the following languages have been deprecated. That does not mean they are going to be removed in the next version, but it's a warning that we cannot offer support for them or guarantee they will be included in the future:
If you're interested in contributing to one of these languages, please post to our forum at http://forum.languagetool.org.
The uppercase sentence start rule (id UPPERCASE_SENTENCE_START) now ignores
immunized tokens - this way users can add lowercase words to disambiguation.xml
so the rule won't complain about these lowercase words at the beginning of a sentence.
--json option as an alternative to --api (deprecated XML output)
See https://languagetool.org/http-api/swagger-ui/#/default
for a documentation of the new API.MISUSED_TERMS_EU_PUBLICATIONSRule.getCorrectExamples() now returns a list of CorrectExamples
instead of a list of Strings.--api option - we recommend using LanguageTool
in server mode (JSON API), which is faster as it has no start up
overhead for each call. See https://languagetool.org/http-api/swagger-ui/#/default
for a documentation of the new API.languagetool-http-client has been added with a class
RemoteLanguageTool that you can use to query a remote LanguageTool server
via HTTP or HTTPSLanguageComboBox/v2/.
It is documented at https://languagetool.org/http-api/swagger-ui/#/default.
Please do not use the old XML-based HTTP API anymore.
Information about migrating from the old to the new API
can be found at https://languagetool.org/http-api/migration.phptext) now cause a 400 Bad Request
response (it used to produce 500 Internal Server Error)preferredVariants to specify which variant is preferred
when the language is auto-detected: Example:
language=auto&preferredVariants=en-GB,de-AT - if English text is detected,
British English will be used, if German text is detected, German (Austria)
will be used.LanguageToolHttpHandlerencoding parameteracceptPhrases(List<String> phrases) to SpellingCheckRule
so you can avoid false alarms on names and technical terms
that consist of more than one word.enabledCategories and disabledCategories
that take a comma-separated list of categories to enable/disable.
Fixes https://github.com/languagetool-org/languagetool/pull/326.shortmsg attribute if available, which
is a short version of the msg attribute.categoryid attribute if available. It's
supposed not to change in future versions (while category might
change).--enablecategories and --disablecategories
to activate/deactivate all rules in a category
(https://github.com/languagetool-org/languagetool/issues/66)fromx and tox) could be wrong. Also, rules that work
across paragraphs like the German word coherency rule wouldn't
work. Both bugs have been fixed but with the side-effect that
large files will now be loaded into memory completely. If
you're using LanguageTool on large files (several MB) you might
need to split these files now before you check them.
If you need the old behavior, use the --line-by-line switch.
https://github.com/languagetool-org/languagetool/issues/254IllegalArgumentException for long sentences
(https://github.com/languagetool-org/languagetool/issues/364)NullPointerException for tokens containing
soft hyphens that might be disambiguated.vintrenes+F+sub:bes:plu:utr:gen/115,70,85,976,941,947
vinåndstermometrenes+F+sub:bes:plu:neu:gen/70,118,85,976en/removed.txt so incorrect readings of the POS tagger can be avoided without
rebuilding the binary dictionary (https://github.com/languagetool-org/languagetool/issues/306)Ich gebe dir ein kleine Kaninchen.
where the determiner is indefinite but the adjective fits only for a definite determinerde/removed.txt so incorrect readings of the POS tagger can be avoided without
rebuilding the binary dictionary<regexp>...<regexp> as a simple alternative
to <pattern><token>...</token></pattern>. Note that this is limited:
E.g. it's not possible to address POS tags and the <suggestion> cannot
change the case of the match.
Available attributes: type with value smart (treats space in the regular
expression as \s+ or a non-breaking space) or exact (smart is the default),
mark to specify which part of the match gets underlined (everything by default,
use 1 to only underline the first group etc.)\u00A0) are now treated like regular spaces. Before,
using a non-breaking space could cause a rule not to match.<filter> can now also be used in disambiguation.xmlGeneralCatalan has been removed, use Catalan insteadSuggestionExtractorTool and SuggestionExtractor have been removedConfusionProbabilityRule has been moved to package org.languagetool.rules.ngramsConfusionProbabilityRule.getWordTokenizer() is now called
ConfusionProbabilityRule.getGoogleStyleWordTokenizer()RuleAsXmlSerializer has been renamed to RuleMatchAsXmlSerializerStringTools.isWhitespace() now returns true for a token that is
a non-breaking space or a narrow non-breaking spaceRuleFilter is not an interface anymore but an abstract classLanguageModel interface has been redesigned, see BaseLanguageModel
for a class similar to the previous implementationBerkeleyLanguageModel was added to support BerkeleyLM language models.
See https://github.com/adampauls/berkeleylm for the software and e.g.
http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/ for pre-built models.
To use the new models your language class needs to overwrite the getLanguageModel(File)
method. For now, we recommend you continue using the Lucene-based models at
http://languagetool.org/download/ngram-data/. <filter class="org.languagetool.rules.fr.FrenchPartialPosTagFilter"
args="no:1 regexp:(.*)-tu postag_regexp:V.*(ind|con|sub).*2\ss negate_pos:yes"/>
confusion_sets.txt file
where word pairs can be added.
See http://wiki.languagetool.org/finding-errors-using-n-gram-data
for information on where to download the ngram data.en
or de with the 1grams, 2grams, and 3grams directories
(also see http://wiki.languagetool.org/finding-errors-using-n-gram-data)rulesFile to use a .languagetool.cfg file
to configure which options should be enabled/disabled in a server
(https://github.com/languagetool-org/languagetool/pull/281)getAntiPatterns() with patterns to
be ignored. See the javadoc for details of what needs to
be considered to make this work. See org.languagetool.rules.de.CaseRule
for an example.--languagemodel option)
has been rewritten and homophones.txt has been renamed to confusion_sets.txt
and now only has few items enabled by default, the rest is commented out
to improve quality (less false alarms).
Also see http://wiki.languagetool.org/finding-errors-using-big-dataUppercaseSentenceStartRule didn't properly reset its state so that
different errors could be found when e.g. JLanguageTool.check() got
called twice with the same text.Authenticator.setDefault() is now only called if it's allowed by
the Java security manager. In rare cases, this might affect using
external XML rule files as documented at
http://wiki.languagetool.org/tips-and-tricks#toc9 (Github issue #255)JLanguageTool object
for every check, as done by the embedded server (or multithreaded
LT users in general)--api option that printed invalid XML
for large documents or when the input was STDIN (Github issue #251)--api
option makes more senseMultiThreadedJLanguageTool.shutdown() to clean up the thread poolLanguage.REAL_LANGUAGES is now Languages.get()Language.LANGUAGES is now Languages.getWithDemoLanguage() - but you will probably
want to use Languages.get()Language have also been moved to LanguagesLanguage.addExternalRuleFile() and Language.getExternalRuleFiles()
have been removed. To add rules, load them with PatternRuleLoader
and call JLanguageTool.addRule().getAllRules(), getAllActiveRules(), and getPatternRulesByIdAndSubId()
in class JLanguageTool used to call reset() for all rules. This is
not the case anymore. reset() is now called when one of the check()
methods is called. This shouldn't make a difference for all common use-cases.Language.setName() has been removed. If you need to set the name,
overwrite the getName() method instead.Rule.getCorrectExamples()/getIncorrectExamples(), PatternToken.getOrGroup()/getAndGroup()
and RuleMatch.getSuggestedReplacements() now return an unmodifiable listAbstractSimpleReplaceRule.getFileName() and AbstractWordCoherencyRule.getFileName()
have been removed, the sub classes are now themselves responsible for loading their dataAbstractCompoundRule are now responsible for loading the
compound data themselves using CompoundRuleDataAbstractCompoundRule.setShort(String) has been removed and added as
a constructor parameter instead.osl::Thread::Create failed error message, see https://bugs.documentfoundation.org/show_bug.cgi?id=90740See CHANGES.txt for changes before 2.9.1.