Analyze Text Data Using Multiword Phrases - MATLAB & Simulink - MathWorks 中国 (2024)

Open Live Script

This example shows how to analyze text using n-gram frequency counts.

An n-gram is a tuple of n consecutive words. For example, a bigram (the case when n=2) is a pair of consecutive words such as "heavy rainfall". A unigram (the case when n=1) is a single word. A bag-of-n-grams model records the number of times that different n-grams appear in document collections.

Using a bag-of-n-grams model, you can retain more information on word ordering in the original text data. For example, a bag-of-n-grams model is better suited for capturing short phrases which appear in the text, such as "heavy rainfall" and "thunderstorm winds".

To create a bag-of-n-grams model, use bagOfNgrams. You can input bagOfNgrams objects into other Text Analytics Toolbox functions such as wordcloud and fitlda.

Load and Extract Text Data

Load the example data. The file factoryReports.csv contains factory reports, including a text description and categorical labels for each event. Remove the rows with empty reports.

filename = "factoryReports.csv";data = readtable(filename,TextType="string");

Extract the text data from the table and view the first few reports.

textData = data.Description;textData(1:5)
ans = 5×1 string "Items are occasionally getting stuck in the scanner spools." "Loud rattling and banging sounds are coming from assembler pistons." "There are cuts to the power when starting the plant." "Fried capacitors in the assembler." "Mixer tripped the fuses."

Prepare Text Data for Analysis

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessText listed at the end of the example, performs the following steps:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

Use the example preprocessing function preprocessText to prepare the text data.

documents = preprocessText(textData);documents(1:5)
ans = 5×1 tokenizedDocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattling bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse

Create Word Cloud of Bigrams

Create a word cloud of bigrams by first creating a bag-of-n-grams model using bagOfNgrams, and then inputting the model to wordcloud.

To count the n-grams of length 2 (bigrams), use bagOfNgrams with the default options.

bag = bagOfNgrams(documents)
bag = bagOfNgrams with properties: Counts: [480×921 double] Vocabulary: ["item" "occasionally" "get" "stuck" "scanner" "loud" "rattling" "bang" "sound" "come" "assembler" "cut" "power" "start" "fry" "capacitor" "mixer" "trip" "burst" "pipe" … ] Ngrams: [921×2 string] NgramLengths: 2 NumNgrams: 921 NumDocuments: 480

Visualize the bag-of-n-grams model using a word cloud.

figurewordcloud(bag);title("Text Data: Preprocessed Bigrams")

Analyze Text Data Using Multiword Phrases- MATLAB & Simulink- MathWorks 中国 (1)

Fit Topic Model to Bag-of-N-Grams

A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.

Create an LDA topic model with 10 topics using fitlda. The function fits an LDA model by treating the n-grams as single words.

mdl = fitlda(bag,10,Verbose=0);

Visualize the first four topics as word clouds.

figuretiledlayout("flow");for i = 1:4 nexttile wordcloud(mdl,i); title("LDA Topic " + i)end

Analyze Text Data Using Multiword Phrases- MATLAB & Simulink- MathWorks 中国 (2)

The word clouds highlight commonly co-occurring bigrams in the LDA topics. The function plots the bigrams with sizes according to their probabilities for the specified LDA topics.

Analyze Text Using Longer Phrases

To analyze text using longer phrases, specify the NGramLengths option in bagOfNgrams to be a larger value.

When working with longer phrases, it can be useful to keep stop words in the model. For example, to detect the phrase "is not happy", keep the stop words "is" and "not" in the model.

Preprocess the text. Erase the punctuation using erasePunctuation, and tokenize using tokenizedDocument.

cleanTextData = erasePunctuation(textData);documents = tokenizedDocument(cleanTextData);

To count the n-grams of length 3 (trigrams), use bagOfNgrams and specify NGramLengths to be 3.

bag = bagOfNgrams(documents,NGramLengths=3);

Visualize the bag-of-n-grams model using a word cloud. The word cloud of trigrams better shows the context of the individual words.

figurewordcloud(bag);title("Text Data: Trigrams")

Analyze Text Data Using Multiword Phrases- MATLAB & Simulink- MathWorks 中国 (3)

View the top 10 trigrams and their frequency counts using topkngrams.

tbl = topkngrams(bag,10)
tbl=10×3 table Ngram Count NgramLength __________________________________ _____ ___________ "in" "the" "mixer" 14 3 "in" "the" "scanner" 13 3 "blown" "in" "the" 9 3 "the" "robot" "arm" 7 3 "stuck" "in" "the" 6 3 "is" "spraying" "coolant" 6 3 "from" "time" "to" 6 3 "time" "to" "time" 6 3 "heard" "in" "the" 6 3 "on" "the" "floor" 6 3 

Example Preprocessing Function

The function preprocessText performs the following steps in order:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

function documents = preprocessText(textData)% Convert the text data to lowercase.cleanTextData = lower(textData);% Tokenize the text.documents = tokenizedDocument(cleanTextData);% Erase punctuation.documents = erasePunctuation(documents);% Remove a list of stop words.documents = removeStopWords(documents);% Remove words with 2 or fewer characters, and words with 15 or greater% characters.documents = removeShortWords(documents,2);documents = removeLongWords(documents,15);% Lemmatize the words.documents = addPartOfSpeechDetails(documents);documents = normalizeWords(documents,Style="lemma");end

See Also

tokenizedDocument | bagOfWords | removeStopWords | erasePunctuation | removeLongWords | removeShortWords | bagOfNgrams | normalizeWords | topkngrams | fitlda | ldaModel | wordcloud | addPartOfSpeechDetails

Related Topics

  • Create Simple Text Model for Classification
  • Analyze Text Data Containing Emojis
  • Analyze Text Data Using Topic Models
  • Train a Sentiment Classifier
  • Classify Text Data Using Deep Learning
  • Generate Text Using Deep Learning (Deep Learning Toolbox)
Analyze Text Data Using Multiword Phrases
- MATLAB & Simulink
- MathWorks 中国 (2024)

References

Top Articles
Questions to Ask in Real-Time During a Snowflake Interview.
Craigslist Ky Farm And Garden
Navicent Human Resources Phone Number
Forozdz
Restaurer Triple Vitrage
Www.craigslist Augusta Ga
Videos De Mexicanas Calientes
Nikki Catsouras Head Cut In Half
When Is the Best Time To Buy an RV?
Shreveport Active 911
Louisiana Sportsman Classifieds Guns
Simpsons Tapped Out Road To Riches
Rams vs. Lions highlights: Detroit defeats Los Angeles 26-20 in overtime thriller
Obsidian Guard's Cutlass
Craigslist Portland Oregon Motorcycles
NBA 2k23 MyTEAM guide: Every Trophy Case Agenda for all 30 teams
Amih Stocktwits
Craigslist Pet Phoenix
PowerXL Smokeless Grill- Elektrische Grill - Rookloos & geurloos grillplezier - met... | bol
Dover Nh Power Outage
2024 INFINITI Q50 Specs, Trims, Dimensions & Prices
Stoney's Pizza & Gaming Parlor Danville Menu
Optum Urgent Care - Nutley Photos
Craigslist Northfield Vt
Kabob-House-Spokane Photos
Rugged Gentleman Barber Shop Martinsburg Wv
FAQ's - KidCheck
Delta Township Bsa
Bayard Martensen
Pronóstico del tiempo de 10 días para San Josecito, Provincia de San José, Costa Rica - The Weather Channel | weather.com
Craigslistodessa
Syracuse Jr High Home Page
Persona 4 Golden Taotie Fusion Calculator
Panchitos Harlingen Tx
THE 10 BEST Yoga Retreats in Konstanz for September 2024
The Best Restaurants in Dublin - The MICHELIN Guide
Fifty Shades Of Gray 123Movies
Tedit Calamity
Postgraduate | Student Recruitment
Great Clips Virginia Center Commons
White County
War Room Pandemic Rumble
Hawkview Retreat Pa Cost
Kjccc Sports
Victoria Vesce Playboy
Value Village Silver Spring Photos
Aurora Southeast Recreation Center And Fieldhouse Reviews
New Starfield Deep-Dive Reveals How Shattered Space DLC Will Finally Fix The Game's Biggest Combat Flaw
Rovert Wrestling
Marion City Wide Garage Sale 2023
Epower Raley's
Https://Eaxcis.allstate.com
Latest Posts
Article information

Author: Catherine Tremblay

Last Updated:

Views: 5439

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Catherine Tremblay

Birthday: 1999-09-23

Address: Suite 461 73643 Sherril Loaf, Dickinsonland, AZ 47941-2379

Phone: +2678139151039

Job: International Administration Supervisor

Hobby: Dowsing, Snowboarding, Rowing, Beekeeping, Calligraphy, Shooting, Air sports

Introduction: My name is Catherine Tremblay, I am a precious, perfect, tasty, enthusiastic, inexpensive, vast, kind person who loves writing and wants to share my knowledge and understanding with you.