OEDP Automation Strategy

From Encyclosphere Project Wiki
Revision as of 16:16, 19 February 2024 by Lsanger (talk | contribs) (Draft save)

There are five steps to go from old paper encyclopedias to encyclopedia articles on Oldpedia (or other Encyclosphere sites).

  1. Scan the books: go from paper to image-only PDF. This has already been done for well over a hundred volumes, more than enough to keep us busy.
  2. OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step.
  3. Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step.
  4. Markup: go from either TXT or HTML to a file marked up with <article> as well as markup for title. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).
  5. Make ZWI (and upload): go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).

While a fully automated system system certainly sounds better, steps 1-3 cannot be automated yet, and encyclopedias with longer articles, and especially those with complex features such as more complex styling, footnotes, tables, and pictures, cannot be handled fully automatically.

The only type of encyclopedia that is a candidate for full automation is the sort with relatively short articles and relatively few images (although fewer than 50 or 100 plates seems could be processed by hand. It is these encyclopedias that the rest of this page will deal with.