OEDP Automation Strategy

From Encyclosphere Project Wiki

The steps to turn old encyclopedias into ZWI-driven web pages

There are five steps to go from old paper encyclopedias to encyclopedia articles on Oldpedia (or other Encyclosphere sites).

  1. Scan the books: go from paper to image-only PDF. This has already been done for well over a hundred volumes, more than enough to keep us busy.
  2. OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step.
  3. Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step.
  4. Markup: go from either TXT or HTML to a file marked up with <article> as well as markup for title. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).
  5. Make ZWI (and upload): go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).

While a fully automated system system certainly sounds better, steps 1-3 cannot be automated yet, and encyclopedias with longer articles, and especially those with complex features such as more complex styling, footnotes, tables, and pictures, cannot be handled fully automatically.

The only type of encyclopedia that is a candidate for full automation is the sort with relatively short articles and relatively few images (although fewer than 50 or 100 plates seems could be processed by hand. It is these encyclopedias that the rest of this page will deal with.

How to automatically ZWI-ify simpler encyclopedias

With AutoZWI, there are several tasks complete, several to do, and several open questions.

Completed and partially completed tasks:

  • text_cleaner.rb inputs foo.txt and outputs foo-cu.txt (for "cleaned-up") in the foo/ directory (where 'foo' is the metadata's Publisher code). This takes a TXT file (not HTML) and fixes some common problems with OCR output, such as spaces between quotation marks and the words they govern. It also creates a new /foo directory if one does not yet exist. Since it is HTML only, it does not have any italics, bold, or small caps.
  • firstline_finder.rb inputs a foo-cu.txt file and outputs foo-mu.txt (for "marked-up"), also in the foo/ directory. This takes the cleaned-up TXT file and adds <article> and<span class="title"> and <p> tags.
  • htmlizer.rb takes the foo-mu.txt (marked-up) file and splits them into individual html files, placing them in a new foo/html/ directory. For example, one of 3000+ HTML autogenerated this way was saved at ccrk/html/Aachen.html.
  • zwify.rb iterates over the contents of foo/html/, prepares article.html, article.txt, media.json, metadata.json, and signature.json for each article individual article, putting the results in a subdirectory of foo/zwicontent/; for example, ccrk/zwicontent/Aachen/ . It then iterates over the zwicontent/ subdirectories and produces ZWI files out of them, placing the files in foo/zwi/; for example, ccrk/zwi/Aachen.zwi.

Tasks to do:

  • Improve the accuracy of heuristics used to individuate articles, and other small things.
  • Actually push to Oldpedia.