OEDP Automation Strategy

From Encyclosphere Project Wiki
Revision as of 16:08, 19 February 2024 by Lsanger (talk | contribs) (Finished describing five steps)

There are n steps to going from old paper encyclopedias to encyclopedia articles on Oldpedia (or other Encyclosphere sites).

  1. Scan the books: go from paper to image-only PDF. This has already been done for well over a hundred volumes, more than enough to keep us busy.
  2. OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step.
  3. Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step.
  4. Markup: go from either TXT or HTML to a file marked up with <article> as well as markup for title. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).
  5. Make ZWI: go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).