Data Model

Knowledge discovery involves data driven processes. Data needs to be transformed and processed by algorithms to extract meaningful knowledge for humans.

In our model the basic objects are information entities named Knowlets. Knowlets can be seen as information container having different properties assigned. Algorithms manipulate the properties of those information entities during the execution of workflows transforming them into so called Artifacts.

Particular properties of information could be:

  • Metadata: Additional data assigned to an information entity either converted from the data source (file name, creation date, etc.) or through a metadata generating algorithm
  • Annotations: Extra information created by information extraction that can be added to an entity. An example could be a NLP pipeline adding word classes to words.
  • Features: Generated either by estimating statistic over selected content parts with particular annotations (like counting the frequency of nouns) or converting different metadata, features are statistic information for most knowledge discovery algorithms.

Transformation within information entities: Full lines identify information reducing transformations; plain lines indicate information preserving transformations. Slashed lines indicate information enriching transformations.

Information entities are embedded in structures just as files are structured in file systems. RDF has been used to provide such structural elements, differentiating the KnowMiner framework from other existing frameworks like UIMA.

A common data access has been realized to keep the access to properties and structural elements of information entities simple. The common data access allows defining different application scenarios depending on views on the data.