• No results found

CS-EJ Deep Learning with Python,

N/A
N/A
Protected

Academic year: 2021

Share "CS-EJ Deep Learning with Python,"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 Introduction to Natural Language Processing. Aalto University (Espoo, Finland) fitech.io (Finland).

(2) Overview What is Natural Language Processing? NLP Tasks. Python Libraries for NLP Text Representation Text Preprocessing Text Representation. Deep Learning Word vectors (embeddings) Word2Vec Keras embedding layer. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 2/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(3) Natural Language Processing. Definition Natural Language Processing (NLP) is an area of Computer Science dealing with methods to analyze, model, and understand human language. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 3/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(4) Speech recognition (or speech-to-text) Speech recognition is a capability which enables a program to process human speech into a written format. While it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice. (Speech Recognition, IBM). Source: Nvidia Developer Website. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 4/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(5) Part-of-speech tagging. Process of determining the part of speech of a particular word or piece of text based on its use and context.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 5/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(6) Named entity recognition. Named entities are sets of elements that are relevant to understanding a text. Named Entity Recognition (NER) is the process of finding entities that can be put under categories like names, organizations, locations, quantities, monetary values, percentages, etc. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 6/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(7) Sentiment analysis. Source: 5 Things You Need to Know about Sentiment Analysis and Classification. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 7/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(8) Natural language generation. It’s the task of putting structured information into human language. Chatbots are great examples of conversational systems which final goal is to generate a text that can sound human-produced. Designed by pch.vector / Freepik. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 8/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(9) Natural Language Toolkit (NLTK). A suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 9/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(10) SpaCy. It is described as a production-ready training system with support for 64+ languages and integration with Deep Learning frameworks like PyTorch and TensorFlow.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 10/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(11) Gensim. It is oriented to Topic Modeling. It offers implementations for Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel), etc.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 11/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(12) Hugging Face. Hugging Face is "the AI community building the future". Its mission is to democratize NLP and make models accessible. It provides resources like datasets, tokenizers, and transformers to perform NLP tasks such as sentiment analysis, coreference resolution, question answering, chatbots.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 12/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(13) Keras. It provides methods that allow designing and training of an ANN using a few lines of Python code. It is implemented as a wrapper for most popular deep learning frameworks like TensorFlow, Theano, and CNTK.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 13/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(14) Text Preprocessing 1. Remove HTML tags 2. Convert accented characters to ASCII characters 3. Expand contractions 4. Remove special characters 5. Lowercase all texts 6. Convert number words to numeric form 7. Tokenization 8. Remove numbers 9. Remove stopwords 10. Lemmatization or Stemming CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 14/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(15) Text Preprocessing. Tokenization “Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. These tokens are often loosely referred to as terms or words, but they could be words, numbers, acronyms, wordroots, or fixed-length character strings.” Tokenization (Stanford NLP). CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 15/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(16) Text Preprocessing. git. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 16/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(17) Text Preprocessing Stopwords “... (set of) extremely common words which would appear to be of little value in helping select documents matching a user need. The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing.“ Dropping common terms: stop words (Stanford NLP). CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 17/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(18) Text Preprocessing. Stopwords Stopwords Finnish (FI) Stopwords Swedish (SV) Stopwords Spanish (ES) Stopwords English (EN) Stopwords Norwegian (NO). CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 18/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(19) Text Preprocessing. Inflected Language “In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change” Inflection (Wikipedia). CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 19/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(20) Text Preprocessing Inflected Language. Stemming and Lemmatization in Python (DataCamp) CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 20/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(21) Text Preprocessing Stemming “Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.” Stemming and Lemmatization in Python (DataCamp). cats misunderstanding troubling destabilize. → → → →. cat misunderstand troubl destabil CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 21/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(22) Text Preprocessing Lemmatization “Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.” Stemming and Lemmatization in Python (DataCamp). runs, running, ran habit eating. → → →. run habit eat CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 22/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(23) Text Representation. Bag-of-Words This model assumes that the document is a vector from a vocabulary V = [w1 , w2 , . . . , w|V | ] and that the values of the components of the vector are the frequency of the i-th word of the vocabulary in the document.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 23/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(24) Text Representation. Bag-of-Words. |V | = 20. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 24/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(25) Text Representation. Bag-of-Words Then each document will be represented by a vector of 20 components, which is the length of the vocabulary. This vector will be sparse, which means that most of its components will be zero. X = [ [1, 1, 1, 0, 1,. 0, 1, 0, 0, 1,. 1, 0, 1, 0, 1,. 0, 0, 1, 1, 1],. [0, 0, 0, 1, 0,. 1, 0, 1, 1, 0,. 0, 1, 0, 1, 0,. 1, 1, 0, 0, 0],. ]. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 25/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(26) Text Representation Bag-of-Words. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 26/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(27) Text Representation. Problems of Bag-of-Words The meaning and the structure of documents cannot be expressed. Each word is independent of the others, word sequences or any other type of relationship cannot be expressed. If two documents have similar meanings but different vocabularies, calculating the similarity between the two of them can be difficult.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 27/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(28) Text Representation. Problems of Bag-of-Words words = { ’aalto’: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ’art’:. [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],. ’bold’:. [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],. ’is’:. [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],. ... ... ’university’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] }. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 28/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(29) Word vectors (embeddings). Word vectors are numerical vector representations of word semantics or meaning, including literal and implied meaning. So word vectors can capture the connotation of words, like "peopleness", "animalness", "placeness", "thingness", and even "conceptness". And they combine all that into a dense vector (no zeros) of floating-point values. This dense vector enables queries and logical reasoning. Hapke, H., 2019. Natural Language Processing in Action: Understanding, analyzing, and generating text with Python.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 29/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(30) Word vectors (embeddings). Word similarity Embedding Projector. Word analogies ~ − man ~ + woman ~ ~ king ≈ queen ~ ~ ~ ~ Finland − Helsinki + Sweden ≈ Stockholm ~ ~ ~ ~ big − biggest + short ≈ shortest ~ − Asia ~ ≈ Europe ~ + Spain ~ China You can read this relations as: "Finland is to Helsinki as Sweden is to Stockholm". CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 30/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(31) Word2Vec Word2Vec was developed by Tomas Mikolov in 2013 at Google. The goal is to create a model that can learn high-quality word vectors from huge data sets, typically billions of words, and millions of (unique) words in the vocabulary. This is achieved with the basic task of being able to predict what words occur in the context of other words.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 31/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(32) Word2Vec. 1. Continuous Bag-of-Words: Given its context, the goal is to predict the center word. It is faster than Skip-gram and has better representations for more frequent words. 2. Continuous Skip-gram: Given the center word, the goal is to predict the context. This method works well with a small amount of data and is found to represent rare words well.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 32/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(33) Word2Vec. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 33/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(34) Text vectorization. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 34/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(35) Embedding matrix The embedding matrix is a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i is our vectorizer’s vocabulary.. RNN W2L04 : Embedding matrix. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 35/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(36) Embedding layer. The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 36/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(37) Embedding layer. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 37/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(38) Embedding layer Input shape: 2D tensor with shape: Output shape: 3D tensor with shape:. (batch_size, input_length). (batch_size, input_length, output_dim).. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 38/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(39) Conclusion. The NLP field is vast and it’s not possible to cover everything in one session. There is an exciting new area of study using Transformers, where models like BERT and GPT are giving mind-blowing results in many NLP tasks. We have covered the basics: pre-processing, document representation, word embeddings, and the different libraries available for creating NLP applications.. CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021 39/39 Aalto University (Espoo, Finland) fitech.io (Finland) Introduction to Natural Language Processing.

(40)

References

Related documents

CS with Python with Practical File –Sumita Arora

The New Jersey Smart Growth Scorecard for Proposed Developments published by New Jersey Future and the Austin, Texas Smart Growth Criteria Matrix by the City of Austin

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n.. (C) Dhruv Batra Figure Credit: Andrea

IT335 Python Introduction for Commerce CIT 134A Programming in Python or CIT 134B Advanced Python Programming. n/a

We use with deep learning application platform to learn how the app using the training the latest trending machine learning projects on github project uses this.. The model we

Compute gradient with respect to softmax

▶ Alle Dozent:innen bieten Sprechstunden an: Termin nach Absprache und online..

WHERE TAKEN:CAMP BADO DANGWA LA TRINIDAD BENGUET DATE OF LAST DRUG TEST:.