Database

From Encyclosphere Project Wiki

A database is a collection of ZWI files, usually hosted by an aggregator. More documentation can be found at this link.

Structure

Here's what the file tree of a typical ZWI database looks like:

database/
└── en <————
    ├── examplepedia
    │   ├── en.examplepedia.org
    │   │   └── wiki#Example_article.zwi
    │   └── en.examplepedia.org.csv.gz
    ├── trash
    │   └── examplepedia
    │       ├── en.examplepedia.org
    │       │   └── wiki#Deleted_article.zwi
    │       └── en.examplepedia.org.csv.gz
    └── index.csv.gz

There's an en folder in the top level of this database. en is the 2-letter ISO language code for English. (Here's a handy list of language codes.) In the top level of the database, there is a folder for each 2-letter language code, containing all of the articles in the corresponding language.

database/
└── en
    ├── examplepedia <————
    │   ├── en.examplepedia.org
    │   │   └── wiki#Example_article.zwi
    │   └── en.examplepedia.org.csv.gz
    ├── trash
    │   └── examplepedia
    │       ├── en.examplepedia.org
    │       │   └── wiki#Deleted_article.zwi
    │       └── en.examplepedia.org.csv.gz
    └── index.csv.gz

In each language folder (e.g. en), there is a folder for each publisher, containing the articles from that publisher.

database/
└── en
    ├── examplepedia
    │   ├── en.examplepedia.org
    │   │   └── wiki#Example_article.zwi
    │   └── en.examplepedia.org.csv.gz
    ├── trash <————
    │   └── examplepedia
    │       ├── en.examplepedia.org
    │       │   └── wiki#Deleted_article.zwi
    │       └── en.examplepedia.org.csv.gz
    └── index.csv.gz

Also in the language folder, there's a special trash folder, containing ZWI files marked for deletion. The trash folder has the same structure as the parent language folder.

database/
└── en
    ├── examplepedia
    │   ├── en.examplepedia.org <————
    │   │   └── wiki#Example_article.zwi
    │   └── en.examplepedia.org.csv.gz
    ├── trash
    │   └── examplepedia
    │       ├── en.examplepedia.org
    │       │   └── wiki#Deleted_article.zwi
    │       └── en.examplepedia.org.csv.gz
    └── index.csv.gz

In each publisher folder (e.g. examplepedia), there is a folder for each domain used by the publisher, containing the articles hosted on that domain. In this example, the only domain is en.examplepedia.org.

database/
└── en
    ├── examplepedia
    │   ├── en.examplepedia.org
    │   │   └── wiki#Example_article.zwi <————
    │   └── en.examplepedia.org.csv.gz
    ├── trash
    │   └── examplepedia
    │       ├── en.examplepedia.org
    │       │   └── wiki#Deleted_article.zwi
    │       └── en.examplepedia.org.csv.gz
    └── index.csv.gz

In the domain folders (e.g. en.examplepedia.org) are the ZWI files themselves.

The names of the ZWI files start with the part of the SourceURL after the domain, with the slashes replaced by pound signs (#). (SourceURL is a field in the ZWI file's metadata.json.)

Any existing pound signs in the URL are escaped with another pound sign: ##

Finally, .zwi is added.