Document Clustering using Linear Partitioning and Reallocation using EM Algorithm
MS P J Gayathri, S C Punitha , Dr M Punithavalli
Abstract
Document clustering is a subset of the larger field of data clustering, which borrows concepts from the fields of information retrieval (IR), natural language processing (NLP), and machine learning (ML), there exist a wide variety of unsupervised clustering algorithms. In this paper presents a novel algorithm for document clustering based with an enhancement on the features of the existing algorithms. This paper illustrates the Principal Direction Divisive Partitioning (PDDP) algorithm and describes its drawbacks and introduces a combinatorial framework of the PDDP algorithm and then describes the simplified version of the EM algorithm called the spherical Gaussian EM (sGEM) algorithm. The PDDP algorithm recursively splits the data samples into two sub -clusters using the hyper plane normal to the principal direction derived from the covariance matrix, which is the central logic of the algorithm. However, the PDDP algorithm can yield poor results, especially when clusters are not well separated from one another. To improve the quality of the clustering results problem, it is resolved by reallocating new cluster membership using the sGEM algorithm with different settings. Furthermore, based on the theoretical background of the sGEM algorithm, it can be obvious to extend the framework to cover the problem of estimating the number of clusters using the Bayesian Information Criterion. Experimental results are given to show the effectiveness of the proposed algorithm with comparison to the existing algorithm.
Full Text:
PDF
This work is licensed under a
Creative Commons Attribution 3.0 License.
Copyright © 2001-2010 by Global Journals Inc. (US) – All Rights ReservedThe use of this site, and the terms and conditions for our providing information, is governed by our Disclaimer, Terms and Conditions and Privacy Policy.By using this site, this signifies and you acknowledge that you have read them and that you accept and will be bound by the terms thereof.All information, activities undertaken, materials, services and this website is subject to change anytime without any prior notice.
Best Viewed on FireFox Browsers with Flash Player and Resolution more than or equals 1024x768
USA Incorporation No.: 0423089 | USA Tax ID (Employer ID No.): 098-0673227 | License No.: 42125/022010/1186 | Registration No.: 430374 | Import-Export Code: 1109007027