Branches of mechanical engineering: Text Mining Terminologies




This tutorial covers basics together with fundamentals of text mining. It includes detailed explanation of diverse text mining price together with terminologies. This tutorial is designed for beginners who are novel to text analytics. It would assistance them to acquire started alongside text mining.

Text Mining Terminologies
  1. Document is a sentence. For example, " Four marking together with 7 years agone our fathers brought forth on this continent, a novel nation, conceived inwards Liberty, together with dedicated to the proffer that all men are created equal."
  2. Tokens represent words. For example:  "nation", "Liberty", "men".  
  3. Terms may stand upwards for unmarried words or multiword units, such equally “civil war”
  4. Corpus is a collection of documents (database). For example, A corpus contains xvi documents (16 txt files).
  5. Stopwords are basically a ready of normally used words which yous desire to exclude while analyzing text. Examples of stopwords - 'a', 'an', 'the', 'to', 'of', 'ABC Company' etc.
  6. Document Term Matrix is a matrix consisting of documents inwards a row together with price inwards columns
Example of document term matrix :

Document Term Matrix

7. Sparse terms - Terms occurring exclusively inwards real few documents (Sentences).

8. Tokenization - It is the procedure to split upwards unstructured information into tokens such equally words, phrase, keywords etc.

9. Stemming -  For example, "interesting", "interest" together with "interested" are all stemmed to "interest". After that, nosotros tin give notice stalk to their master copy forms, so that the words would await "normal".

10. Polarity - Whether a document or judgement is positive, negative or neutral. This term is normally used inwards persuasion analysis.

11. Bag-of-words - Each judgement (or document) is a handbag of words ignoring grammer together with fifty-fifty discussion order. The price ' brand India' together with 'India make' bring the same probability score.

12. Part of Speech Tagging - It involves tagging every discussion inwards the document together with assigns business office of vocalisation communication - noun, verb, adjective, pronoun, unmarried noun, plural noun, etc.

13. Term Frequency - Inverse Document Frequency (tf-idf) - 

It measures how of import a discussion is.

It consists of 2 price -
  1. Term Frequency (tf)
  2. Inverse Document Frequency (idf)
Term Frequency measures how oft a discussion (term) occurs inwards a document.
TF(t) = (Number of times term t appears) / (Total issue of terms).
Inverse Document Frequency measures how of import a discussion is. If a discussion appears oft inwards a document, together with so it should endure of import together with nosotros should give that discussion a high score. But if a discussion appears inwards besides many other documents, it’s in all likelihood non a unique identifier, thence nosotros should assign a lower marking to that word.
IDF(t) = log to base of operations e(Total issue of documents / Number of documents containing term t)
Term Frequency Inverse Document Frequency
tf-idf = tf × idf
Example : Suppose a discussion 'good' appears 373 times inwards full vi documents which contains inwards full 122204 words (terms). Term Frequency (TF) would endure 0.00305 i.e. =373/122204. But this discussion appears inwards exclusively 1 document so IDF would endure ln(6/1) = 1.791759. Hence, tf-idf = TF * IDF = 0.0054.

Uses of TF-IDF

1. Building Stopwords

Terms having tf-idf value zilch or closed to zilch tin give notice endure used inwards stop-words list. These are all words that appear inwards all of the documents, so the idf term is zero.

2. Important Words

Sort TF-IDF values inwards descending order. The term which appear at overstep later sorting is the most of import word.

3. Text Clustering
  • Calculate the tf-idf marking for the collection of documents
  • Calculate pairwise distance matrix using cosine distance algorithm
  • Performs hierarchical clustering together with visualize the clustering lawsuit alongside a dendrogram.


14. N-grams - 

They are basically a ready of co-occurring words inside a given window.
    • N-gram of size 1 - unigram 
    • N-gram of size 2 - bigram 
    • N-gram of size iii - trigram
    For Example, for the sentence "The moo-cow jumps over the moon". 
      I. If N=2 (known equally bigrams), together with so the n-grams would be:
      the cow, moo-cow jumps, jumps over, over the, the moon
      In this case, nosotros bring v bigrams.

      II. If N=3 (trigram), the n-grams would be:
      the moo-cow jumps, moo-cow jumps over, jumps over the, over the moon

      How many N-grams inwards a sentence? 
        If X=Number of words inwards a given judgement K, the issue of n-grams for judgement K would be: N-grams = X – (N-1) 
            N-grams is used to purpose tokens such equally bigrams inwards the characteristic infinite instead of simply unigrams (one word). But diverse query papers warned the purpose of bigrams together with trigrams inwards your characteristic infinite may non necessarily yield whatsoever pregnant improvement.
              Trigrams vs. Bigrams
                The Trigrams practice bring an wages over bigrams but it is small.

                Check out the detailed documentation : Trigrams together with Bigrams Explained

                About Author:

                Deepanshu founded ListenData alongside a unproblematic objective - Make analytics slow to sympathise together with follow. He has closed to 7 years of sense inwards information scientific discipline together with predictive modeling. During his tenure, he has worked alongside global clients inwards diverse domains similar retail together with commercial banking, Telecom, HR together with Automotive.

                While I dear having friends who agree, I exclusively acquire from those who don't. 
                Let's Get Connected: Email | LinkedIn

                Sumber http://engdashboard.blogspot.com/

                Jangan sampai ketinggalan postingan-postingan terbaik dari Branches of mechanical engineering: Text Mining Terminologies. Berlangganan melalui email sekarang juga:

                Bali Attractions

                BACA JUGA LAINNYA:

                Bali Attractions