Differences between revisions 13 and 14
Revision 13 as of 2008-04-22 08:53:09
Size: 6816
Comment:
Revision 14 as of 2008-04-28 12:39:37
Size: 6239
Comment:
Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:

Textual genetic criticism is a discipline that studies drafts led by authors during the writing process. MEDITE’s first aim was to align two linearized transcriptions of such drafts (two texts) in order to expose invariants and differences between them. We discovered that for texts with a lot of repetitions, existing aligners (version comparison tools) failed to perform correct alignments. This is due to masking phenomena which appear when the pairing of two text blocks masks and therefore avoids the pairing of other identical blocks. MEDITE addresses this problem.

The MEDITE Project

MEDITE is a powerful text comparison software that is issued from a collaboration between literary and AI scholars.

More precisely, funded by the CNRS information society program, the EDITE project made the ([http://www.item.ens.fr/ ITEM]) - Institut des Textes et Manuscrits Modernes - collaborate with the [http://www-acasa.lip6.fr ACASA] team, member of the [http://www.lip6.fr LIP6] laboratory. MEDITE, i.e. EDITE Machine, has originally been achieved to solve the needs of textual genetic criticism. It is nothing more than an efficient uni-lingual aligner. But it appears now to be useful in many applications (scholar publishing, automatic translation, computerized epistemology, etc.).

Textual genetic criticism is a discipline that studies drafts led by authors during the writing process. MEDITE’s first aim was to align two linearized transcriptions of such drafts (two texts) in order to expose invariants and differences between them. We discovered that for texts with a lot of repetitions, existing aligners (version comparison tools) failed to perform correct alignments. This is due to masking phenomena which appear when the pairing of two text blocks masks and therefore avoids the pairing of other identical blocks. MEDITE addresses this problem.

MEDITE is built on an original sequence alignment algorithm, based on the edit distance with moves conceptual frame. It detects deleted, inserted, replaced, moved and invariant character blocks and aligns pairwise these last three block types. The first algorithm step detects maximal exact matches (MEM): homologies between the two texts which can’t be extended to the left or to the right without losing identity. MEMs are either invariant or moved blocks and are identified by browsing the space of possible alignments by an A* procedure which minimize the size of moved blocks. This whole process is then applied recursively between each pair of aligned invariant blocks in order to detect smaller blocks and to avoid masking phenomena. Finally, as deleted, inserted and replaced blocks are non repeated blocks, they are deduced from the alignment.

For results visualization, the two texts are presented side-by-side in a two panel GUI. Deletions, insertions and replacements are overlined in a specific color. Moves are underlined: it enables to visualize, for instance, moves inside insertions. Invariants stay black on white. Pairwise aligned blocks are linked together and a simple mouse click aligns them side-by-side.

MEDITE has been compared with other version comparison tools, the most famous being the one inside Microsoft Word. None of them was able to align correctly hard texts and to overcome masking phenomena but MEDITE. Further the visualization interface is often very bad-suited which leads to more difficulties in the results understanding.

Because our algorithm is sequence based, it is language independent and can process any language without specific resources. For instance, it can process Arabic texts. Further, it is character based and so it detects intra-words modifications which is very useful for flexional languages. The moves detection is a kind of knowledge discovery as it exposes block correlations between the texts.

MEDITE is now used by philologists in textual genetic criticism and epistemologists in ideas’ story. It enables them to study longer texts and to discover, more systematically, transformations between authors’ draft. They can then establish diachronic corpus of an author’s oeuvre. We now plan to embed MEDITE in digital library platforms.

Interested people can visit the MEDITE web site (in French): [http://www-poleia.lip6.fr/~acasa/MEDITE MEDITE]

References

  1. Fenoglio I., Ganascia J-G. : "MEDITE: un logiciel pour l'approche comparative de documents de genèse", Revue Genesis, pp. 166-168, 2007 (in French) ([attachment:Genesis pdf])
  2. Ganascia, J.-G., Bourdaillet, J. Alignements unilingues avec MEDITE.. Actes des Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles, 2006. (in French)
  3. Ganascia J.G., Fenoglio I., Lebrave J-L, Manuscrits, genèse et documents numérisés. EDITE : une étude informatisée du travail de l’écrivain, revue Document numérique, special issue on « temps et document » 2005 (in French)([attachment:Document%20Numerique%202004 pdf])
  4. Ganascia J.G. On the Supposed Neo-Structuralism of Hypertext, Diogenes N°196, September 2002, Issue 4, Blackwell Publishing Ltd.
  5. Ganascia J-G, EDITE-MEDITE, un passage des versions aux variantes, actes du XIVième congrès International de Linguistique et de Philologie Romanes, August 2004, Aberystwyth, Wales, United Kingdown, Max Niemeyer Verlag, septembre 2007 ([attachment:CILPR%202005 pdf])
  6. Bourdaillet J., Ganascia J.-G.: "Alignements monolingues avec déplacements", 14e Conférence sur le Traitement Automatique des Langues Naturelles. (in French)
  7. Bourdaillet J., Ganascia J.-G., Fénoglio I. : "Machine Assisted Study of Writers' Rewriting Processes", 4th International Workshop on Natural Language Processing and Cognitive Science (NLPCS), Madeire, Portugal
  8. Bourdaillet J., Ganascia J-G, Practical block sequence alignment with moves, LATA 2007, International Conference on Language and Automata Theory and Applications, 30 mars – avril 2007. ([attachment:LATA%202007 pdf])
  9. Bourdaillet J., Ganascia J.-G., Alignement of Noisy Unstructured Text Data, IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India - January 8, 2007

Software

The MEDITE software can be freely downloaded here:

  • [attachment:MEDITE_version3.4 Version 3.4]
  • [attachment:MEDITE_version3.7 Version 3.7]

For more information about MEDITE, please contact [mailto:Julien.Bourdaillet@lip6.fr Julien Bourdaillet] or [mailto:Jean-Gabriel.Ganascia@lip6.fr Jean-Gabriel Ganascia]

Medite_Project (last edited 2015-11-12 07:24:52 by GustaveGanascia)