Machine Learning

Preprocessing of Text for Machine Learning in node.js

Patrick Helmig

•

02.05.18

•

7 minute read

The use of large data for machine learning procedures requires careful preprocessing, which is fundamental for the accuracy of the output. If we look at the classification of texts at word level, it is important that all texts are pre-processed, i.e. each text is divided into sentences and then into individual words. The OpenNLP Wrapper for Node.js is perfectly suited to do this preprocessing, because it is not only configurable in detail but is also subject to the MIT License, so private and commercial use is free.

Installation

npm install opennlp

OpenNLP offers the following methods to process / analyze text:

Chunker
Name Finder
Part of Speach Tagging
Sentence Detector
Tokenizer

For this purpose, Apache provides a variety of training models at http://opennlp.sourceforge.net/models-1.5/, which can be downloaded and integrated into different languages (for example English and German). Additionally, you need the openly-tools-1.6.0.jar, which can be downloaded here.

Configuration

In the following configuration, the OpenNLP wrapper is initialized with the German models for the Sentence Detector and the Tokenizer.

var openNLPOptions = {
    "models" : {
        "tokenizer": nlpPath + '/de-token.bin',
        "sentenceDetector": nlpPath + '/de-sent.bin',
    },
    "jarPath": nlpPath + "/opennlp-tools-1.6.0.jar"
};Code language: JavaScript (javascript)

The big advantage here is that only the models have to be exchanged for processing another language and no other changes have to be made.

Sentence Detector

The Sentence Detector splits the text input at each point and detects whether it is the end of a sentence or a point in a date or abbreviation.

var openNLP = require("opennlp");
var sentence = 'Am 13. Juni 2014 wurde die deutsche Fußball Nationalmannschaft ' + 
               'Weltmeister. Das war der 4. Weltmeister Titel!';
var sentenceDetector = new openNLP().sentenceDetector;
sentenceDetector.sentDetect(sentence, function(err, results) {
    console.log(results)
    /*
    Ausgabe: 
    [ 
        'Am 13. Juni 2014 wurde die deutsche Fußball  Nationalmannschaft Weltmeister.', 
        'Das war der 4. Weltmeister Titel!'
    ]
    */
});Code language: JavaScript (javascript)

Tokenizer

The Tokenizer splits the sentence into individual words and punctuation marks. Spaces between punctuation marks are not discarded, but are recognized as tokens.

var openNLP = require("opennlp");
var sentences = "Am 12. Juni 2014 wurde die deutsche Fußball Nationalmannschaft Weltmeister.";
var tokenizer = new openNLP().tokenizer;
tokenizer.tokenize(sentence, function(err, results) {
    console.log(results);
    /*
    Ausgabe: [ 
                'Am',
                '12',
                '.',
                'Juni',
                '2014',
                'wurde',
                'die',
                'deutsche',
                'Fußball',
                'Nationalmannschaft',
                'Weltmeister',
                '.' 
            ]
    */
})Code language: JavaScript (javascript)

Optimization

If you use a large amount of text data, the OpenNLP wrapper is very good for processing the text, but it does not work efficiently because the tokenization is sequential and not parallel. Since the duration of the tokenization, therefore, increases linearly with the number of sentences, we had to find an efficient solution for this. Therefore we wrote our own minimalistic OpenNLP wrapper, which contains the sentence detector and the tokenizer. This can be found on https://github.com/flore2003/opennlp-wrapper and is subject to the MIT License. The wrapper is therefore not fully compatible with the previously described wrapper.

Installation

npm install opennlp-wrapper

The configuration, sentence splitting, and tokenization are slightly different from the original project, so a simple exchange of the library is not sufficient.

Configuration

The following configuration describes the OpenNLP wrapper we developed, initialized with the German models for the Sentence Detector and the Tokenizer. Note that the configuration has been slightly modified compared to the original library.

var openNLPOptions = {
    models : {
        "tokenizer": nlpPath +  'de-token.bin',
        "sentenceDetector": nlpPath + 'de-sent.bin'
    },
    "jarPath": nlpPath +  "opennlp-tools-1.6.0.jar"
};
var openNLP = new OpenNLP(openNLPOptions);Code language: JavaScript (javascript)

Sentence Detector

In the original library on the OpenNLP instance sentenceDetector.sentDetect() was called. The only thing you have to do here is to call detectSentences() on the OpenNLP instance.

var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft ' + 
            'Fußballweltmeister. Das war der 4. Weltmeister Titel!';

openNLP.detectSentences(input, function(err, sentences) {
    console.log(sentences);
});Code language: JavaScript (javascript)

Tokenizer

In the original library, tokenizer.tokenize was called on the OpenNLP instance. Here only the method call tokenize() on the OpenNLP instance is sufficient.

var input = 'Am 12. Juni 2014 wurde die Fußball Nationalmannschaft Fußballweltmeister.';

openNLP.tokenize(input, function(err, tokens) {
    console.log(tokens);
});Code language: JavaScript (javascript)

Remarks and tips

In order for the data to be suitable for a machine learning process, such as Naive Bayes, further preprocessing steps are required. For the processing of German text, we used a caulking machine that solves the following problem: Consider the words “Kauf, Käufe, kaufen, kaufe”, which all refer to the purchase of an item, but are used differently, but still say the same thing. The stemmer extracts the word stem from each word, which in this case corresponds to “buy”. If this procedure is carried out on all words that have been determined by tokenizing, the number of different words shrinks many times over.

The Snowball Stemmer is highly recommended for this purpose, as it is not only very easy to use but also supports over 13 different languages.

After you have stemmed all words you should remove the stop words. These include in German among other things certain articles, indefinite articles and conjunctions. In English the words ‘a’, ‘of’, ‘the’, ‘I’, ‘it’, ‘you’ and ‘and’ are part of the stop words. A very good variant of stemming stop words to certain ones is to sort all words after stemming in descending order of frequency and to look at the first 100 to 200 words of this list and analyze which words are really meaningless and can be discarded.

In the next article, we will build on this knowledge and develop a Naive Bayes classifier, which is used for example in spam detection.

Share this Article on Social Media

Title Photo by Romain Vignes
from Unsplash

Preprocessing of Text for Machine Learning in node.js

Installation

Configuration

Sentence Detector

Tokenizer

Optimization

Installation

Configuration

Sentence Detector

Tokenizer

Remarks and tips

Share this Article on Social Media

Other Articles to Read

Demystifying AGI: Navigating the Future of Autonomous Agents and Their Impact on Business

Machine Learning Performance Indicators

Machine Learning Classification in Python – Part 2: Model Implementation and Performance Determination

Do you have questions about what you just read?
Get in Contact with us!

Case Studies

About

Jobs

Blog

Contact

We have received your message and someone will get back to you shortly!

Let's Start Building
Something Great Together!

This Website Uses Cookies

Preprocessing of Text for Machine Learning in node.js

Installation

Configuration

Sentence Detector

Tokenizer

Optimization

Installation

Configuration

Sentence Detector

Tokenizer

Remarks and tips

Share this Article on Social Media

Other Articles to Read

Do you have questions about what you just read? Get in Contact with us!

We have received your message and someone will get back to you shortly!

Let's Start Building Something Great Together!

This Website Uses Cookies

Do you have questions about what you just read?
Get in Contact with us!

Let's Start Building
Something Great Together!