Revision 1 as of 2008-04-18 09:07:27

Clear message

The MEDITE Project

MEDITE is a powerful text comparison software that is issued from a collaboration between literary and AI scholars.

Textual genetic criticism is a discipline that studies drafts led by authors during the writing process. MEDITE’s first aim was to align two linearized transcriptions of such drafts (two texts) in order to expose invariants and differences between them. We discovered that for texts with a lot of repetitions, existing aligners (version comparison tools) failed to perform correct alignments. This is due to masking phenomena which appear when the pairing of two text blocks masks and therefore avoids the pairing of other identical blocks. MEDITE addresses this problem.

MEDITE is built on an original sequence alignment algorithm, based on the edit distance with moves conceptual frame. It detects deleted, inserted, replaced, moved and invariant character blocks and aligns pairwise these last three block types. The first algorithm step detects maximal exact matches (MEM): homologies between the two texts which can’t be extended to the left or to the right without losing identity. MEMs are either invariant or moved blocks and are identified by browsing the space of possible alignments by an A* procedure which minimize the size of moved blocks. This whole process is then applied recursively between each pair of aligned invariant blocks in order to detect smaller blocks and to avoid masking phenomena. Finally, as deleted, inserted and replaced blocks are non repeated blocks, they are deduced from the alignment.

For results visualization, the two texts are presented side-by-side in a two panel GUI. Deletions, insertions and replacements are overlined in a specific color. Moves are underlined: it enables to visualize, for instance, moves inside insertions. Invariants stay black on white. Pairwise aligned blocks are linked together and a simple mouse click aligns them side-by-side.

MEDITE has been compared with other version comparison tools, the most famous being the one inside Microsoft Word. None of them was able to align correctly hard texts and to overcome masking phenomena but MEDITE. Further the visualization interface is often very bad-suited which leads to more difficulties in the results understanding.

Because our algorithm is sequence based, it is language independent and can process any language without specific resources. For instance, it can process Arabic texts. Further, it is character based and so it detects intra-words modifications which is very useful for flexional languages. The moves detection is a kind of knowledge discovery as it exposes block correlations between the texts.

MEDITE is now used by philologists in textual genetic criticism and epistemologists in ideas’ story. It enables them to study longer texts and to discover, more systematically, transformations between authors’ draft. They can then establish diachronic corpus of an author’s oeuvre. We now plan to embed MEDITE in digital library platforms.

Interested people can get a free version of the software on the following web site: [http://www-poleia.lip6.fr/~acasa/MEDITE MEDITE]