EncycloCrawler
Overview
EncycloCrawler is a program that crawls encyclopedias, generating a database of ZWI files. It’s written in Java, and uses Crawler4j to crawl websites. More documentation can be found at this link.
Adding encyclopedias to EncycloCrawler
Adding an encyclopedia to EncycloCrawler requires basic knowledge of JSON and CSS selectors. Depending on the complexity of the pages you're crawling, you may also need to know Java.
The file describing a crawler is known as a definition. Crawler definitions are written in JSON.
Base crawlers
Every crawler relies on a base crawler, which does all of the heavy lifting. There are 3 different types of base crawlers: standard
, mediaWiki
, and metadataOnly
.
The standard
crawler should be used for most non-MediaWiki encyclopedias. It retrieves all article content, including images, audio, and other media. For encyclopedias with restrictive licenses, see metadataOnly
.
The mediaWiki
crawler should be used for most MediaWiki encyclopedias. It automatically uses the correct CSS selectors, although custom selectors can be supplied if necessary.
The metadataOnly
crawler is for encyclopedias with more restrictive licenses, that don't allow distributing of their content. It doesn't retrieve any of the article content; only the title, description, and a list of significant phrases, which are phrases that appear more often in the article than in normal English.
Writing a crawler definition
Start by cloning the EncycloCrawler repository. You will need to have Git installed. Open a terminal or command prompt and type the following command:
git clone https://gitlab.com/ks_found/encyclocrawler.git
In the repo folder, go to src/main/resources/crawlers
. This is where the crawler definitions are located.
Next, you will need to determine the ID of the encyclopedia you want to add. Make the name of the encyclopedia lowercase, and remove any spaces and special characters. For example, Encyclopedia Mythica becomes encyclopediamythica
.
If the name is long, you may want to abbreviate it. For example, International Standard Bible Encyclopedia becomes isbe
, and Encyclopedia of Math becomes eom
.
Add .json
to this ID, and you have the filename of the crawler definition. If your encyclopedia's ID is encyclopediamythica
, create a file named encyclopediamythica.json
in src/main/resources/crawlers
.
Using the standard
base crawler
Example definition
{
"id": "encyclopediamythica",
"type": "standard",
"license": "CC BY-NC-ND 4.0",
"baseURL": "https://pantheon.org",
"titleSelector": "#main > h1.my-4",
"contentSelector": ".row",
"paragraphSelector": "#main p",
"toRemoveSelector": "#aside > *:not([role=\"complementary\"])",
"isArticleCheckSelector": "#main .eoa",
"requiredInURL": "/articles/"
}
Supported fields
Field | Explanation | Required? |
---|---|---|
id
|
The encyclopedia's ID. The name of the encyclopedia, lowercase, with no spaces or special characters. | Yes |
type
|
The type of crawler (standard in this case)
|
Yes |
license
|
The encyclopedia's license. Can be a string (e.g. CC BY-SA 3.0 ), or a URL.
|
Yes |
baseURL
|
The URL the crawl will start from. Try to look for a contents page or other page with lots of links to articles. | Yes |