This tutorial covers basics together with fundamentals of text mining. It includes detailed explanation of diverse text mining price together with terminologies. This tutorial is designed for beginners who are novel to text analytics. It would assistance them to acquire started alongside text mining.
Text Mining Terminologies
Inverse Document Frequency measures how of import a discussion is. If a discussion appears oft inwards a document, together with so it should endure of import together with nosotros should give that discussion a high score. But if a discussion appears inwards besides many other documents, it’s in all likelihood non a unique identifier, thence nosotros should assign a lower marking to that word.
II. If N=3 (trigram), the n-grams would be:
Text Mining Terminologies
- Document is a sentence. For example, " Four marking together with 7 years agone our fathers brought forth on this continent, a novel nation, conceived inwards Liberty, together with dedicated to the proffer that all men are created equal."
- Tokens represent words. For example: "nation", "Liberty", "men".
- Terms may stand upwards for unmarried words or multiword units, such equally “civil war”
- Corpus is a collection of documents (database). For example, A corpus contains xvi documents (16 txt files).
- Stopwords are basically a ready of normally used words which yous desire to exclude while analyzing text. Examples of stopwords - 'a', 'an', 'the', 'to', 'of', 'ABC Company' etc.
- Document Term Matrix is a matrix consisting of documents inwards a row together with price inwards columns
Example of document term matrix :
7. Sparse terms - Terms occurring exclusively inwards real few documents (Sentences).
13. Term Frequency - Inverse Document Frequency (tf-idf) -
Document Term Matrix |
8. Tokenization - It is the procedure to split upwards unstructured information into tokens such equally words, phrase, keywords etc.
9. Stemming - For example, "interesting", "interest" together with "interested" are all stemmed to "interest". After that, nosotros tin give notice stalk to their master copy forms, so that the words would await "normal".
10. Polarity - Whether a document or judgement is positive, negative or neutral. This term is normally used inwards persuasion analysis.
11. Bag-of-words - Each judgement (or document) is a handbag of words ignoring grammer together with fifty-fifty discussion order. The price ' brand India' together with 'India make' bring the same probability score.
12. Part of Speech Tagging - It involves tagging every discussion inwards the document together with assigns business office of vocalisation communication - noun, verb, adjective, pronoun, unmarried noun, plural noun, etc.
13. Term Frequency - Inverse Document Frequency (tf-idf) -
It measures how of import a discussion is.
It consists of 2 price -
- Term Frequency (tf)
- Inverse Document Frequency (idf)
Term Frequency measures how oft a discussion (term) occurs inwards a document.
TF(t) = (Number of times term t appears) / (Total issue of terms).
IDF(t) = log to base of operations e(Total issue of documents / Number of documents containing term t)
Term Frequency Inverse Document Frequency
tf-idf = tf × idf
Example : Suppose a discussion 'good' appears 373 times inwards full vi documents which contains inwards full 122204 words (terms). Term Frequency (TF) would endure 0.00305 i.e. =373/122204. But this discussion appears inwards exclusively 1 document so IDF would endure ln(6/1) = 1.791759. Hence, tf-idf = TF * IDF = 0.0054.
Uses of TF-IDF
1. Building Stopwords
Terms having tf-idf value zilch or closed to zilch tin give notice endure used inwards stop-words list. These are all words that appear inwards all of the documents, so the idf term is zero.
2. Important Words
Sort TF-IDF values inwards descending order. The term which appear at overstep later sorting is the most of import word.
3. Text Clustering
- Calculate the tf-idf marking for the collection of documents
- Calculate pairwise distance matrix using cosine distance algorithm
- Performs hierarchical clustering together with visualize the clustering lawsuit alongside a dendrogram.
14. N-grams -
They are basically a ready of co-occurring words inside a given window.
- N-gram of size 1 - unigram
- N-gram of size 2 - bigram
- N-gram of size iii - trigram
For Example, for the sentence "The moo-cow jumps over the moon".
the cow, moo-cow jumps, jumps over, over the, the moonIn this case, nosotros bring v bigrams.
II. If N=3 (trigram), the n-grams would be:
the moo-cow jumps, moo-cow jumps over, jumps over the, over the moon
How many N-grams inwards a sentence?
If X=Number of words inwards a given judgement K, the issue of n-grams for judgement K would be: N-grams = X – (N-1)
Check out the detailed documentation : Trigrams together with Bigrams Explained