Before going to perform any operation on the text data, the data must be pre-processed. Because the text data often contains some special formats like number formats, date formats and the most common words that unlikely to help text mining such as prepositions, articles, and pro-nouns can be eliminated. Stemming or lemmatisation is a technique for the reduction of words into their root. Many words in the English language can be in multiple forms. So that, these words should be reduced to their roots. After applying stop word elimination and stemming the data is converted into vector space model. This is helpful to handle the data in terms of numeric values. Dimension reduction is an important step in text mining. Dimension reduction improves the performance of clustering techniques by reducing dimensions so that text mining procedures process data with a reduced number of terms. Singular value decomposition is a technique used to reduce the dimension of a vector. Finally clustering is introduced to make the data retrieval easy. K-means is an efficient clustering technique which is applied for clustering text documents.
Preprocessing for Text Mining
The first step of Morphological Analyses is the tokenization. The aim of the tokenisation is the exploration of the words in a sentence. Textual data is only a block of characters at the beginning. All following processes in information retrieval require the words of the data set. Hence, the requirement for a parser which processes the tokenisation of the documents. This may sound trivial as the text is already stored in machine-readable formats. Nevertheless, some problems are still left, like the removal of punctuation marks. Other characters like brackets, hyphens, etc. require a processing as well. Furthermore, tokenizer can cater for consistency in the documents. The main use of tokenization is identifying the meaningful keywords. The inconsistency can be different number and timformats.Another problem are abbreviations and acronyms which have to be transformed into a standard form.
The most common words that unlikely to help text mining such as prepositions, articles, and pro-nouns can be treated as stopwords. Because every text document deals with these words which are not necessary for text mining applications. These words are eliminated. Any group of words can be chosen as the stopwords for a given purpose. This process also reduces the text data and improves the system performance.
(e.g.): “the”, “a”, “an”,”
Stemming or lemmatisation is a technique for the reduction of words into their root. Many words in the English language can be reduced to their base form or stem e.g. agreed, agreeing, disagree, agreement and disagreement belong to agree. Furthermore are names transformed into the stem by removing the ” ’s”. The variation ”Peter’s” in a sentence is reduced to ”Peter” during the stemming process. The result of the removal may lead to an incorrect root. However, these stems do not have to be a problem for the stemming process, if these words are not used for human interaction. The stem is still useful, because all other inflections of the root are transformed into the same stem. Case sensitive systems could have problems when making a comparison between a word i capital letters and another with the same meaning in lower case.
Advances in data collection and storage capabilities during the past decades have led to an information overload in most sciences. Researchers working in domains as diverse as engineering, astronomy, biology, remote sensing, economics, and consumer transactions, face larger and larger observations and simulations on a daily basis. Such datasets, in contrast with smaller, more traditional datasets that have been studied extensively in the past, present new challenges in data analysis. Traditional statistical methods break down partly because of the increase in the number of observations, but mostly because of the increase in the number of variables associated with each observation.
The dimension of the data is the number of variables that are measured on each observation. High-dimensional datasets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments. One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables are important" for understanding the underlying phenomena of interest. While certain computationally expensive novel
methods can construct predictive models with high accuracy from high-dimensional data, it is still of interest in many applications to reduce the dimension of the original data prior to any modeling of the data.
Document clustering (also referred to as Text clustering) is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.
A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories,
Text mining refers to the discovery of non-trivial, previously,unknown, and potentially useful knowledge from a collection of texts. Since its origin, text mining has been considered an analog of data mining (interpreted as Kowledge Discovery in Databases, or KDD) applied to text repositories. Text mining is very important since nowadays, around 80% of the information stored in computers (not considering audio, video, and images) consists of