Document Clustering using Self-Organizing Maps
A Multi-Features Layered Approach
Abstract
Cluster analysis of textual documents is a common technique for better ltering, navigation, under-
standing and comprehension of the large document collection. Document clustering is an autonomous method
that separate out large heterogeneous document collection into smaller more homogeneous sub-collections called
clusters. Self-organizing maps (SOM) is a type of arti cial neural network (ANN) that can be used to perform
autonomous self-organization of high dimension feature space into low-dimensional projections called maps. It
is considered a good method to perform clustering as both requires unsupervised processing. In this paper, we
proposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM using
four layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all at
the top layers. The documents are processed to extract these features to feed the SOM. The internal weights
and interconnections between these layers features(neurons) automatically settle through iterations with a small
learning rate to discover the actual clusters. We have performed extensive set of experiments on standard text
mining datasets like: NEWS20, Reuters and WebKB with evaluation measures F-Measure and Purity. The
evaluation gives encouraging results and outperforms some of the existing approaches. We conclude that SOM
with multi-features (lexical terms, phrases and sequences) and multi-layers can be very e ective in producing
high quality clusters on large document collections.
References
Ajith Abraham, Swagatam Das, and Amit Konar. Document clustering using differential evolution. In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on, pages 1784–1791. IEEE, (2006).
Meshrif Alruily, Aladdin Ayesh, and Abdulsamad Al-Marghilani. Using self organizing map to cluster arabic crime documents. In Computer Science and Information Technology (IMCSIT), Proceedings of the 2010 International Multiconference on, pages 357–363. IEEE, (2010).
Joachim Buhmann and Hans K¨uhnel. Complexity optimized data clustering by competitive neural networks. Neural Computation, 5(1):75–88, (1993).
Tommy WS Chow and MKM Rahman. Multilayer som with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9):1385–1402, (2009).
Todsanai Chumwatana, Kok Wai Wong, Hong Xie, et al. A som-based document clustering using frequent max substrings for non-segmented texts. Journal of Intelligent Learning Systems and Applications, 2(03):117, (2010).
Tarek F Gharib, Mohammed M Fouad, Abdulfattah Mashat, and Ibrahim Bidawi. Self organizing mapbased document clustering using wordnet ontologies. IJCSI International Journal of Computer Science Issues, 9(1):1694–0814, (2012).
Shyam M Guthikonda. Kohonen self-organizing maps. Wittenberg University, (2005).
Dino Isa, VP Kallimani, and Lam Hong Lee. Using the self organizing map for clustering of text documents. Expert Systems with Applications, 36(5):9584–9591, (2009).
Gerald Kowalski. Information retrieval systems: theory and implementation. Computers and Mathematics with Applications, 5(35):133, (1998).
Jouko Lampinen and Erkki Oja. Clustering properties of hierarchical self-organizing maps. In Mathematical Nonlinear Image Processing, pages 165–176. Springer, (1993).
Bjornar Larsen and Chinatsu Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–22. ACM, (1999).
Kristina Lerman. Document clustering in reduced dimension vector space. Unpublished Manuscript, (1999).
Yuan-Chao Liu, Ming Liu, and Xiao-Long Wang. Application of self-organizing maps in text clustering: a review. INTECH Open Access Publisher, (2012).
Yuanchao Liu, Xiaolong Wang, and Chong Wu. Consom: A conceptional self-organizing map model for text clustering. Neurocomputing, 71(4):857–862, (2008).
Christopher D Manning, Prabhakar Raghavan, Hinrich Sch¨utze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, (2008).
Muhammad Rafi, Sufyan Shahid, Junaid Aftab, Muhammad Faizan Uddin, and Muhammad Shahid Shaikh. Towards a soft computing approach to document clustering. In Proceedings of the 2017 International Conference on Machine Learning and Soft Computing, pages 74–81. ACM, (2017).
Osiski S. Dimensionality reduction techniques for search results clustering. Master’s thesis, University of Sheffield, UK, (2004).
Hinrich Sch¨utze and Craig Silverstein. Projections for efficient document clustering. In ACM SIGIR Forum, volume 31, pages 74–81. ACM, (1997).
Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, (2000).
Cornelis Joost Van Rijsbergen. Information retrieval. (1979).
Bill B Wang, Robert I Mckay, Hussein A Abbass, and Michael Barlow. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conferenceVolume 16, pages 69–78. Australian Computer Society, Inc., (2003).
MENDEL open access articles are normally published under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 4.0) https://creativecommons.org/licenses/by-nc-sa/4.0/ . Under the CC BY-NC-SA 4.0 license permitted 3rd party reuse is only applicable for non-commercial purposes. Articles posted under the CC BY-NC-SA 4.0 license allow users to share, copy, and redistribute the material in any medium of format, and adapt, remix, transform, and build upon the material for any purpose. Reusing under the CC BY-NC-SA 4.0 license requires that appropriate attribution to the source of the material must be included along with a link to the license, with any changes made to the original material indicated.