Clustering module

Clustering module makes use of unsupervised learning techniques for computing groups of related documents. Relatedness between any pair of documents is expressed as distance or similarity between the corresponding feature vectors, whereby several metric types are supported.


Typically, clustering is performed on term vectors resulting in documents being organized into a topical cluster hierarchy. As descriptions for the cluster can be generated using summarization techniques, the hierarchy is suitable for exploring  large data sets providing a “virtual table of contents”. Due to the incremental clustering capability the algorithms can also be applied for monitoring changes in dynamic repositories or for evolving knowledge structures.


A variety of clustering algorithms is supported providing a fast, scalable and versatile clustering solution. The choice and parameterization of the algorithm is performed automatically, depending on the size and characteristics of the data set, and depending on the specification of the clustering tasks, the latter including: guessing the number of clusters (given constraints on minimum and maximum), hierarchical clustering producing several layers of sub-clusters (optionally including hierarchy balancing), performance vs. quality considerations, etc.