EncycloCrawler

From Encyclosphere Project Wiki

Overview

EncycloCrawler is a program that crawls encyclopedias, generating a database of ZWI files. It’s written in Java, and uses Crawler4j to crawl websites. More documentation can be found at this link.

Adding encyclopedias to EncycloCrawler

Adding an encyclopedia to EncycloCrawler requires basic knowledge of JSON and CSS selectors. Depending on the complexity of the pages you're crawling, you may also need to know Java.

The file describing a crawler is known as a definition. Crawler definitions are written in JSON.

Base crawlers

Every crawler relies on a base crawler, which does all of the heavy lifting. There are 3 different types of base crawlers: standard, mediaWiki, and metadataOnly.

The standard crawler should be used for most non-MediaWiki encyclopedias. It retrieves all article content, including images, audio, and other media. For encyclopedias with restrictive licenses, see metadataOnly.

The mediaWiki crawler should be used for most MediaWiki encyclopedias. It automatically uses the correct CSS selectors, although custom selectors can be supplied if necessary.

The metadataOnly crawler is for encyclopedias with more restrictive licenses, that don't allow distributing of their content. It doesn't retrieve any of the article content; only the title, description, and a list of significant phrases, which are phrases that appear more often in the article than in normal English.

Writing a crawler definition

Start by cloning the EncycloCrawler repository. You will need to have Git installed. Open a terminal or command prompt and type the following command:

git clone https://gitlab.com/ks_found/encyclocrawler.git

In the repo folder, go to src/main/resources/crawlers. This is where the crawler definitions are located.

Next, you will need to determine the ID of the encyclopedia you want to add. Make the name of the encyclopedia lowercase, and remove any spaces and special characters. For example, Encyclopedia Mythica becomes encyclopediamythica.

If the name is long, you may want to abbreviate it. For example, International Standard Bible Encyclopedia becomes isbe, and Encyclopedia of Math becomes eom.

Add .json to this ID, and you have the filename of the crawler definition. If your encyclopedia's ID is encyclopediamythica, create a file named encyclopediamythica.json in src/main/resources/crawlers.

Using the standard base crawler

Example definition

{
  "id": "encyclopediamythica",
  "type": "standard",
  "license": "CC BY-NC-ND 4.0",
  "baseURL": "https://pantheon.org",
  "titleSelector": "#main > h1.my-4",
  "contentSelector": ".row",
  "paragraphSelector": "#main p",
  "toRemoveSelector": "#aside > *:not([role=\"complementary\"])",
  "isArticleCheckSelector": "#main .eoa",
  "requiredInURL": "/articles/"
}

Supported fields

Field Explanation Required?
id The encyclopedia's ID. The name of the encyclopedia, lowercase, with no spaces or special characters. Yes
type The type of crawler (standard in this case) Yes
license The encyclopedia's license. Can be a string (e.g. CC BY-SA 3.0), or a URL. Yes
baseURL The URL the crawl will start from. Try to look for a contents page or other page with lots of links to articles. Yes

Troubleshooter

Links