Natural language processing

Ing. Daniel Hládek PhD.

daniel.hladek@tuke.sk

Natural language is highly ambiguous

  • We can say the same thing in different ways
  • One statement can have many different meanings
  • We often transmit non-verbal information during communication:
    • Feelings
    • Gestures
    • Accent and style of speech

Homonyms:

    I'm sitting at school right now. I am not familiar with civil law.
    That car costs 10,000 euros. The car is standing on the side of the road.

Synonyms:

    I went to Bratislava. I went to Blava.

Indeterminate order of words in a sentence:

    Today is a nice day. It's a nice day today. The day is nice today.

Neologisms and slang terms:

Google it and then post it on fb.

Emotions and social conventions:

    Sir! You did a great job!

Typos:

    See the lecture.

Computer language is unambiguous We need methods for working with uncertainty

There is a growing need to process large amounts of human-generated text or spoken speech

Natural Language Processing (NLP)

A combination of several techniques from the field of:

  • Machine learning
  • Linguistics
  • Theory of formal languages
  • Statistics
  • Psychology

Natural language processing helps in common activities by acquiring knowledge

data => information => knowledge

text => features => findings

Knowledge is useful information

(can be converted into money).

Typical NLP tasks

Your every day Google, Facebook, Apple

Some Google NLP services:

  • Question answering
  • Full text search
  • Advertising targeting
  • Machine translation

Some Facebook NLP Services:

  • Sentiment evaluation (for ad. targeting)
  • hate speech detection
  • Spam detection

Some Apple NLP Service

  • Siri assistant

Working with uncertainty in NLP

  • Classification of contexts or their sequences
  • Overwriting the sequence of symbols

Classification of contexts

Mapping:

    c => S
  • C: context: Sentence, Document
  • S: symbol: Some knwoledge about the context: Morphological marker, lemma, clause...

Tokenization

Process of identification of atomic units of meaning:

  • interpunction
  • words
  • subword units
  • letters, phones

Feature function

It helps us in classification if we know which part of the context is important for classification.

Feature function

Such a binary context function that is true only if the given flag occurs in the context. A suitable set of symptom functions helps us to solve the problem.

  • Word
  • Ending, Root of the word
  • Previous word, Next word
  • First letter type

Feature function

Mapping

    Symbol => unit vector

    today => 0000100001

Classifier of contexts

Feature extraction, classification

    symbol => feature vector => class

Classifier of contexts

  • Human knowledge in the form of rules
  • Statistical information from training corpora
  • A combination of both approaches

Rules

  • Dictionaries
  • Formal grammar
  • Regular expressions

Statistical approaches

  • Hidden Markov Models
  • N-gram model
  • Support Vector Machine

Deep neural networks

  • LSTM, Convolutional networks, Transformers

Computationally demanding

Rewriting the sequence of symbols

Mapping:

    sequence => another sequence

Rewriting the sequence of symbols

  • machine translation
  • correction of typos and grammar
  • dialogue systems

Encoder-Decoder

Encoder:

symbols => signs => meaning vector

Decoder:

model and meaning vector => output symbols

Encoder Decoder

Deep neural networks

You too can do NLP

General programming language

Python

General libraries for machine learning

  • keras
  • pytorch

General libraries for NLP

  • Sleepy
  • Flair

Machine translation

  • fairseq

Extraction of semantic features

  • heads
  • fasttext
  • word2thing

Obtaining information

ation and log processing

Elasticsearch

Dialogue systems and language comprehension

RACE

Bibliography

Jurafsky, Martin: Natural Language Processing Christopher Manning: Natural Language Processing, Stanford University Online Video Lectures

Reload?