Technical Information

How to use itRelease Information

Dreamcatcher provide access to a customisable Data Science toolkit via a minimalistic GUI to be utilised by advanced or novice users.

Core Components Overview

Information Extractor

A high-level interface that extracts and formats data from given files, currently only Text Extraction is supported. The extractor uses an appropriate decoder to extract raw data from the given file, which is then analysed and formatted for Classification, Information Retrieval, and various other analysis purposes.

What the information extractor handles:

  • Extracting raw data from a variety of file formats
  • Tokenization
  • Cleaning raw data (stripping noisy characters etc.)
  • POS Tagging and parse tree generation
  • Entity recognition
  • n-gram generation

The information extractor automatically ammends which components it uses based on the language of the document (e.g. english, french) and the format of the file passed to it (e.g. .docx, .pdf).

Collection Interface

Stores representations of files in the user's Collection, each object contains data necessary for classification and retrieval tasks. The object is also responsible for managing features associated with files, this includes customisable feature selection, which supports Natural Language Processing methods (e.g. filtering word types) as well as statistical methods (information gain, gain ratio etc.).

What the Collection Interface handles:

  • Representing the users Collection (both logically and in vector space)
  • Dynamically updating feature weightings as new information is added
  • Item Retrieval (with queries or other documents)
  • Customisable feature selection
  • Managing Collection for Structured Multiclass Classification
  • Splitting/Merging Collections

Classifier Interface

This acts as a bridge between a Dreamcatcher Collection and one or more classifiers. It provides control over these classifiers through a common interface, displaying results and providing options for both manual and automatic optimisation. Currently Dreamcatcher has four classifiers to choose from.

The interface has a test function that tests the currently set classifier on the current collection and generates a confusion matrix with an F1 score. The results from this function can be used for manual tweaking of classifier parameters, they are also used in the automatic optimisation function with the objective of maximising the F1 score.

When the Classifier is active and receives unknown items, each prediction will have an associated confidence value (percentage). To avoid false positives a confidence threshold can be set, which prevents predictions being acted on if they fall below that threshold.

Classifier Interface Overview:

  • Manages the use of one or more Classifiers on a Collection (currently 4)
  • Trains and tests Classifiers (generating confusion matrix and F1 score)
  • Making reader friendly translations/suggestions from test results
  • Automatic and/or Manual optimisation of Classifiers from test results
  • Automatic filing based on Classifier results and user given criteria

The majority of these interfaces load their components at runtime from a list, making Dreamcatcher's processes easily modifyable or extendable. All of the above interfaces can be ustilised via the Dreamcatcher GUI

Dreamcatcher GUI

Dreamcatcher's GUI can be used to graphically build and manage collections of files (in a similar fashion to file explorers). Folders or Categories are created, which represent the possible types/classes that the classifier can allocate incoming files to.
The GUI is split in to two sections, the Collection View, and the Training and Auto-filing interface.

The Collection View is an interface for importing, viewing, searching for, and managing files. The Collection Browser supports multiple view widgets (that can be used concurrently) which let you view a Collection in different ways (e.g. Tree view, icon view).

Training & Auto-filing interface picture

The Training and Auto-filing interface can be used to review and manage the information drawn from your files that will be used for Classification.

Training & Auto-filing interface picture

A more in depth view of what you can do with the GUI can be found here.