OEDP Automation Strategy: Difference between revisions
(Initial save) |
(Finished describing five steps) |
||
| Line 4: | Line 4: | ||
# OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step. | # OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step. | ||
# Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step. | # Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step. | ||
# | #Markup: go from either TXT or HTML to a file marked up with <code><article></code> as well as markup for title. This step is performed by a combination of hand and code (in the old [[ZWIFormat]] system) or, for one encyclopedia so far, by code (in the new [[AutoZWI]] system). | ||
#Make ZWI: go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old [[ZWIFormat]] system) or, for one encyclopedia so far, by code (in the new [[AutoZWI]] system). | |||
Revision as of 16:08, 19 February 2024
There are n steps to going from old paper encyclopedias to encyclopedia articles on Oldpedia (or other Encyclosphere sites).
- Scan the books: go from paper to image-only PDF. This has already been done for well over a hundred volumes, more than enough to keep us busy.
- OCR the PDFs: go from image-only PDF to text-recognized PDF. An ABBYY FineReader OCR Editor (AFR) step.
- Output text or HTML: go from text-recognized PDF to TXT and HTML. Another AFR step.
- Markup: go from either TXT or HTML to a file marked up with
<article>as well as markup for title. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system). - Make ZWI: go from marked up file to ZWI file. This step is performed by a combination of hand and code (in the old ZWIFormat system) or, for one encyclopedia so far, by code (in the new AutoZWI system).