Preprocessing of Text for Machine Learning in node.js

Preprocessing_text_machine_learning_node.js

The use of large data for machine learning procedures requires careful preprocessing, which is fundamental for the accuracy of the output. If we look at the classification of texts at word level, it is important that all texts are pre-processed, i.e. each text is divided into sentences and then into individual words. The OpenNLP Wrapper for Node.js is perfectly suited to do this preprocessing, because it is not only configurable in detail but is also subject to the MIT License, so private and commercial use is free.

Installation

npm install opennlp

OpenNLP offers the following methods to process / analyze text:

  • Chunker
  • Name Finder
  • Part of Speach Tagging
  • Sentence Detector
  • Tokenizer

For this purpose, Apache provides a variety of training models at http://opennlp.sourceforge.net/models-1.5/, which can be downloaded and integrated into different languages (for example English and German). Additionally, you need the openly-tools-1.6.0.jar, which can be downloaded here.

Configuration

In the following configuration, the OpenNLP wrapper is initialized with the German models for the Sentence Detector and the Tokenizer.

var openNLPOptions = {
    "models" : {
        "tokenizer": nlpPath + '/de-token.bin',
        "sentenceDetector": nlpPath + '/de-sent.bin',
    },
    "jarPath": nlpPath + "/opennlp-tools-1.6.0.jar"
};Code language: JavaScript (javascript)

The big advantage here is that only the models have to be exchanged for processing another language and no other changes have to be made.

Sentence Detector

The Sentence Detector splits the text input at each point and detects whether it is the end of a sentence or a point in a date or abbreviation.

var openNLP = require("opennlp");
var sentence = 'Am 13. Juni 2014 wurde die deutsche Fußball Nationalmannschaft ' + 
               'Weltmeister. Das war der 4. Weltmeister Titel!';
var sentenceDetector = new openNLP().sentenceDetector;
sentenceDetector.sentDetect(sentence, function(err, results) {
    console.log(results)
    /*
    Ausgabe: 
    [ 
        'Am 13. Juni 2014 wurde die deutsche Fußball  Nationalmannschaft Weltmeister.', 
        'Das war der 4. Weltmeister Titel!'
    ]
    */
});Code language: JavaScript (javascript)

Tokenizer

The Tokenizer splits the sentence into individual words and punctuation marks. Spaces between punctuation marks are not discarded, but are recognized as tokens.

var openNLP = require("opennlp");
var sentences = "Am 12. Juni 2014 wurde die deutsche Fußball Nationalmannschaft Weltmeister.";
var tokenizer = new openNLP().tokenizer;
tokenizer.tokenize(sentence, function(err, results) {
    console.log(results);
    /*
    Ausgabe: [ 
                'Am',
                '12',
                '.',
                'Juni',
                '2014',
                'wurde',
                'die',
                'deutsche',
                'Fußball',
                'Nationalmannschaft',
                'Weltmeister',
                '.' 
            ]
    */
})Code language: JavaScript (javascript)

Optimization

If you use a large amount of text data, the OpenNLP wrapper is very good for processing the text, but it does not work efficiently because the tokenization is sequential and not parallel. Since the duration of the tokenization, therefore, increases linearly with the number of sentences, we had to find an efficient solution for this. Therefore we wrote our own minimalistic OpenNLP wrapper, which contains the sentence detector and the tokenizer. This can be found on https://github.com/flore2003/opennlp-wrapper and is subject to the MIT License. The wrapper is therefore not fully compatible with the previously described wrapper.

Installation

npm install opennlp-wrapper

The configuration, sentence splitting, and tokenization are slightly different from the original project, so a simple exchange of the library is not sufficient.

Configuration

The following configuration describes the OpenNLP wrapper we developed, initialized with the German models for the Sentence Detector and the Tokenizer. Note that the configuration has been slightly modified compared to the original library.

var openNLPOptions = {
    models : {
        "tokenizer": nlpPath +  'de-token.bin',
        "sentenceDetector": nlpPath + 'de-sent.bin'
    },
    "jarPath": nlpPath +  "opennlp-tools-1.6.0.jar"
};
var openNLP = new OpenNLP(openNLPOptions);Code language: JavaScript (javascript)

Sentence Detector

In the original library on the OpenNLP instance sentenceDetector.sentDetect() was called. The only thing you have to do here is to call detectSentences() on the OpenNLP instance.

var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft ' + 
            'Fußballweltmeister. Das war der 4. Weltmeister Titel!';

openNLP.detectSentences(input, function(err, sentences) {
    console.log(sentences);
});Code language: JavaScript (javascript)

Tokenizer

In the original library, tokenizer.tokenize was called on the OpenNLP instance. Here only the method call tokenize() on the OpenNLP instance is sufficient.

var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft Fußballweltmeister.';

openNLP.tokenize(input, function(err, tokens) {
    console.log(tokens);
});Code language: JavaScript (javascript)

Remarks and tips

In order for the data to be suitable for a machine learning process, such as Naive Bayes, further preprocessing steps are required. For the processing of German text, we used a caulking machine that solves the following problem: Consider the words “Kauf, Käufe, kaufen, kaufe”, which all refer to the purchase of an item, but are used differently, but still say the same thing. The stemmer extracts the word stem from each word, which in this case corresponds to “buy”. If this procedure is carried out on all words that have been determined by tokenizing, the number of different words shrinks many times over.

The Snowball Stemmer is highly recommended for this purpose, as it is not only very easy to use but also supports over 13 different languages.

After you have stemmed all words you should remove the stop words. These include in German among other things certain articles, indefinite articles and conjunctions. In English the words ‘a’, ‘of’, ‘the’, ‘I’, ‘it’, ‘you’ and ‘and’ are part of the stop words. A very good variant of stemming stop words to certain ones is to sort all words after stemming in descending order of frequency and to look at the first 100 to 200 words of this list and analyze which words are really meaningless and can be discarded.

In the next article, we will build on this knowledge and develop a Naive Bayes classifier, which is used for example in spam detection.

Share this Article on Social Media

Facebook
Reddit
Twitter
WhatsApp
LinkedIn
Email
Telegram

Do you have questions about what you just read?
Get in Contact with us!

Thank You!

We have received your message and someone will get back to you shortly!

Let's Start Building
Something Great Together!

Are you ready to get started on the development of your product? Wait no longer! Enter your email below and one of our team members will contact you soon!

This Website Uses Cookies

We use cookies to provide social media features and to analyze our traffic. You can consent to the use of such technologies by closing this notice, by interacting with any link or button outside of this notice, or by continuing to browse otherwise. You can read more about our cookie consent policy here.