The use of large data for machine learning procedures requires careful preprocessing, which is fundamental for the accuracy of the output. If we look at the classification of texts at word level, it is important that all texts are pre-processed, i.e. each text is divided into sentences and then into individual words. The OpenNLP Wrapper for Node.js is perfectly suited to do this preprocessing, because it is not only configurable in detail but is also subject to the MIT License, so private and commercial use is free.
Installation
npm install opennlp
OpenNLP offers the following methods to process / analyze text:
- Chunker
- Name Finder
- Part of Speach Tagging
- Sentence Detector
- Tokenizer
For this purpose, Apache provides a variety of training models at http://opennlp.sourceforge.net/models-1.5/, which can be downloaded and integrated into different languages (for example English and German). Additionally, you need the openly-tools-1.6.0.jar, which can be downloaded here.
Configuration
In the following configuration, the OpenNLP wrapper is initialized with the German models for the Sentence Detector and the Tokenizer.
var openNLPOptions = {
"models" : {
"tokenizer": nlpPath + '/de-token.bin',
"sentenceDetector": nlpPath + '/de-sent.bin',
},
"jarPath": nlpPath + "/opennlp-tools-1.6.0.jar"
};
Code language: JavaScript (javascript)
The big advantage here is that only the models have to be exchanged for processing another language and no other changes have to be made.
Sentence Detector
The Sentence Detector splits the text input at each point and detects whether it is the end of a sentence or a point in a date or abbreviation.
var openNLP = require("opennlp");
var sentence = 'Am 13. Juni 2014 wurde die deutsche Fußball Nationalmannschaft ' +
'Weltmeister. Das war der 4. Weltmeister Titel!';
var sentenceDetector = new openNLP().sentenceDetector;
sentenceDetector.sentDetect(sentence, function(err, results) {
console.log(results)
/*
Ausgabe:
[
'Am 13. Juni 2014 wurde die deutsche Fußball Nationalmannschaft Weltmeister.',
'Das war der 4. Weltmeister Titel!'
]
*/
});
Code language: JavaScript (javascript)
Tokenizer
The Tokenizer splits the sentence into individual words and punctuation marks. Spaces between punctuation marks are not discarded, but are recognized as tokens.
var openNLP = require("opennlp");
var sentences = "Am 12. Juni 2014 wurde die deutsche Fußball Nationalmannschaft Weltmeister.";
var tokenizer = new openNLP().tokenizer;
tokenizer.tokenize(sentence, function(err, results) {
console.log(results);
/*
Ausgabe: [
'Am',
'12',
'.',
'Juni',
'2014',
'wurde',
'die',
'deutsche',
'Fußball',
'Nationalmannschaft',
'Weltmeister',
'.'
]
*/
})
Code language: JavaScript (javascript)
Optimization
If you use a large amount of text data, the OpenNLP wrapper is very good for processing the text, but it does not work efficiently because the tokenization is sequential and not parallel. Since the duration of the tokenization, therefore, increases linearly with the number of sentences, we had to find an efficient solution for this. Therefore we wrote our own minimalistic OpenNLP wrapper, which contains the sentence detector and the tokenizer. This can be found on https://github.com/flore2003/opennlp-wrapper and is subject to the MIT License. The wrapper is therefore not fully compatible with the previously described wrapper.
Installation
npm install opennlp-wrapper
The configuration, sentence splitting, and tokenization are slightly different from the original project, so a simple exchange of the library is not sufficient.
Configuration
The following configuration describes the OpenNLP wrapper we developed, initialized with the German models for the Sentence Detector and the Tokenizer. Note that the configuration has been slightly modified compared to the original library.
var openNLPOptions = {
models : {
"tokenizer": nlpPath + 'de-token.bin',
"sentenceDetector": nlpPath + 'de-sent.bin'
},
"jarPath": nlpPath + "opennlp-tools-1.6.0.jar"
};
var openNLP = new OpenNLP(openNLPOptions);
Code language: JavaScript (javascript)
Sentence Detector
In the original library on the OpenNLP instance sentenceDetector.sentDetect() was called. The only thing you have to do here is to call detectSentences() on the OpenNLP instance.
var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft ' +
'Fußballweltmeister. Das war der 4. Weltmeister Titel!';
openNLP.detectSentences(input, function(err, sentences) {
console.log(sentences);
});
Code language: JavaScript (javascript)
Tokenizer
In the original library, tokenizer.tokenize was called on the OpenNLP instance. Here only the method call tokenize() on the OpenNLP instance is sufficient.
var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft Fußballweltmeister.';
openNLP.tokenize(input, function(err, tokens) {
console.log(tokens);
});
Code language: JavaScript (javascript)
Remarks and tips
In order for the data to be suitable for a machine learning process, such as Naive Bayes, further preprocessing steps are required. For the processing of German text, we used a caulking machine that solves the following problem: Consider the words “Kauf, Käufe, kaufen, kaufe”, which all refer to the purchase of an item, but are used differently, but still say the same thing. The stemmer extracts the word stem from each word, which in this case corresponds to “buy”. If this procedure is carried out on all words that have been determined by tokenizing, the number of different words shrinks many times over.
The Snowball Stemmer is highly recommended for this purpose, as it is not only very easy to use but also supports over 13 different languages.
After you have stemmed all words you should remove the stop words. These include in German among other things certain articles, indefinite articles and conjunctions. In English the words ‘a’, ‘of’, ‘the’, ‘I’, ‘it’, ‘you’ and ‘and’ are part of the stop words. A very good variant of stemming stop words to certain ones is to sort all words after stemming in descending order of frequency and to look at the first 100 to 200 words of this list and analyze which words are really meaningless and can be discarded.
In the next article, we will build on this knowledge and develop a Naive Bayes classifier, which is used for example in spam detection.