OEDP Automation Strategy: Difference between revisions
(Where I'm at with AutoZWI) |
|||
| Line 18: | Line 18: | ||
* <code>text_cleaner.rb</code> inputs <code>foo.txt</code> and outputs <code>foo-cu.txt</code> (for "cleaned-up"). This takes a TXT file (''not'' HTML) and fixes some common problems with OCR output, such as spaces between quotation marks and the words they govern. It also creates a new <code>/foo</code> directory if one does not yet exist. Since it is HTML only, it does not have any italics, bold, or small caps. | * <code>text_cleaner.rb</code> inputs <code>foo.txt</code> and outputs <code>foo-cu.txt</code> (for "cleaned-up"). This takes a TXT file (''not'' HTML) and fixes some common problems with OCR output, such as spaces between quotation marks and the words they govern. It also creates a new <code>/foo</code> directory if one does not yet exist. Since it is HTML only, it does not have any italics, bold, or small caps. | ||
* <code>firstline_finder.rb</code> inputs a <code>foo-cu.txt</code> file and outputs <code>foo-mu.txt</code> (for "marked-up"). This takes the cleaned-up TXT file and adds both <code><article></code> and (soon) <code><nowiki><span class="title"></nowiki></code> tags. | * <code>firstline_finder.rb</code> inputs a <code>foo-cu.txt</code> file and outputs <code>foo-mu.txt</code> (for "marked-up"). This takes the cleaned-up TXT file and adds both <code><article></code> and (soon) <code><nowiki><span class="title"></nowiki></code> tags. | ||
Tasks to do: | |||
* Actually do the <code><nowiki><span class="title"></nowiki></code> markup. I seem to have made an excellent set of rules for this purpose, so I'm all but ready to add this to <code>firstline_finder.rb</code>. | |||
* Do a test run through the entire ccrk-mu.txt output, make necessary changes to ccrk.txt, and see if the changes made by hand have the preferred effect. | |||
* Examine ZWIFormat again and decide whether | |||
Revision as of 17:23, 19 February 2024
The steps to turn old encyclopedias into ZWI-driven web pages
There are five steps to go from old paper encyclopedias to encyclopedia articles on Oldpedia (or other Encyclosphere sites).
- Scan the books: go from paper to image-only PDF. This has already been done for well over a hundred volumes, more than enough to keep us busy.
- OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step.
- Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step.
- Markup: go from either TXT or HTML to a file marked up with
<article>as well as markup for title. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system). - Make ZWI (and upload): go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).
While a fully automated system system certainly sounds better, steps 1-3 cannot be automated yet, and encyclopedias with longer articles, and especially those with complex features such as more complex styling, footnotes, tables, and pictures, cannot be handled fully automatically.
The only type of encyclopedia that is a candidate for full automation is the sort with relatively short articles and relatively few images (although fewer than 50 or 100 plates seems could be processed by hand. It is these encyclopedias that the rest of this page will deal with.
How to automatically ZWI-ify simpler encyclopedias
With AutoZWI, there are several tasks complete, several to do, and several open questions.
Completed and partially completed tasks:
text_cleaner.rbinputsfoo.txtand outputsfoo-cu.txt(for "cleaned-up"). This takes a TXT file (not HTML) and fixes some common problems with OCR output, such as spaces between quotation marks and the words they govern. It also creates a new/foodirectory if one does not yet exist. Since it is HTML only, it does not have any italics, bold, or small caps.firstline_finder.rbinputs afoo-cu.txtfile and outputsfoo-mu.txt(for "marked-up"). This takes the cleaned-up TXT file and adds both<article>and (soon)<span class="title">tags.
Tasks to do:
- Actually do the
<span class="title">markup. I seem to have made an excellent set of rules for this purpose, so I'm all but ready to add this tofirstline_finder.rb. - Do a test run through the entire ccrk-mu.txt output, make necessary changes to ccrk.txt, and see if the changes made by hand have the preferred effect.
- Examine ZWIFormat again and decide whether