EncycloCrawler: Difference between revisions

From Encyclosphere Project Wiki
No edit summary
No edit summary
 
Line 1: Line 1:
== Overview ==
== Overview ==
[https://gitlab.com/ks_found/encyclocrawler EncycloCrawler] is a program that crawls encyclopedias, generating a database of [[ZWI]] files. It’s written in Java, and uses Crawler4j to crawl websites.
[https://gitlab.com/ks_found/encyclocrawler EncycloCrawler] is a program that crawls encyclopedias, generating a database of [[ZWI]] files. It’s written in Java, and uses Crawler4j to crawl websites. More documentation can be found at [https://docs.encyclosphere.org/#/encyclocrawler this link].
== Trouble shooter ==
 
== Adding encyclopedias to EncycloCrawler ==
Adding an encyclopedia to EncycloCrawler requires basic knowledge of JSON and CSS selectors. Depending on the complexity of the pages you're crawling, you may also need to know Java.
 
The file describing a crawler is known as a definition. Crawler definitions are written in JSON.
 
=== Base crawlers ===
Every crawler relies on a base crawler, which does all of the heavy lifting. There are 3 different types of base crawlers: <code>standard</code>, <code>mediaWiki</code>, and <code>metadataOnly</code>.
 
The <code>standard</code> crawler should be used for most non-MediaWiki encyclopedias. It retrieves all article content, including images, audio, and other media. For encyclopedias with restrictive licenses, see <code>metadataOnly</code>.
 
The <code>mediaWiki</code> crawler should be used for most MediaWiki encyclopedias. It automatically uses the correct CSS selectors, although custom selectors can be supplied if necessary.
 
The <code>metadataOnly</code> crawler is for encyclopedias with more restrictive licenses, that don't allow distributing of their content. It doesn't retrieve any of the article content; only the title, description, and a list of significant phrases, which are phrases that appear more often in the article than in normal English.
 
=== Writing a crawler definition ===
Start by cloning the EncycloCrawler repository. You will need to have Git installed. Open a terminal or command prompt and type the following command:
<code>git clone <nowiki>https://gitlab.com/ks_found/encyclocrawler.git</nowiki></code>
In the repo folder, go to <code>src/main/resources/crawlers</code>. This is where the crawler definitions are located.
 
Next, you will need to determine the ID of the encyclopedia you want to add. Make the name of the encyclopedia lowercase, and remove any spaces and special characters. For example, Encyclopedia Mythica becomes <code>encyclopediamythica</code>.
 
If the name is long, you may want to abbreviate it. For example, International Standard Bible Encyclopedia becomes <code>isbe</code>, and Encyclopedia of Math becomes <code>eom</code>.
 
Add <code>.json</code> to this ID, and you have the filename of the crawler definition. If your encyclopedia's ID is <code>encyclopediamythica</code>, create a file named <code>encyclopediamythica.json</code> in <code>src/main/resources/crawlers</code>.
 
=== Using the <code>standard</code> base crawler ===
 
==== Example definition ====
<code>{
  "id": "encyclopediamythica",
  "type": "standard",
  "license": "CC BY-NC-ND 4.0",
  "baseURL": "<nowiki>https://pantheon.org</nowiki>",
  "titleSelector": "#main > h1.my-4",
  "contentSelector": ".row",
  "paragraphSelector": "#main p",
  "toRemoveSelector": "#aside > *:not([role=\"complementary\"])",
  "isArticleCheckSelector": "#main .eoa",
  "requiredInURL": "/articles/"
}</code>
 
==== Supported fields ====
{| class="wikitable"
!Field
!Explanation
!Required?
|-
|<code>id</code>
|The encyclopedia's ID. The name of the encyclopedia, lowercase, with no spaces or special characters.
|Yes
|-
|<code>type</code>
|The type of crawler (<code>standard</code> in this case)
|Yes
|-
|<code>license</code>
|The encyclopedia's license. Can be a string (e.g. <code>CC BY-SA 3.0</code>), or a URL.
|Yes
|-
|<code>baseURL</code>
|The URL the crawl will start from. Try to look for a contents page or other page with lots of links to articles.
|Yes
|}
 
== Troubleshooter ==
[[Image: encyclocrawler_install.png|767x767px]]
[[Image: encyclocrawler_install.png|767x767px]]



Latest revision as of 17:23, 21 February 2024

Overview

EncycloCrawler is a program that crawls encyclopedias, generating a database of ZWI files. It’s written in Java, and uses Crawler4j to crawl websites. More documentation can be found at this link.

Adding encyclopedias to EncycloCrawler

Adding an encyclopedia to EncycloCrawler requires basic knowledge of JSON and CSS selectors. Depending on the complexity of the pages you're crawling, you may also need to know Java.

The file describing a crawler is known as a definition. Crawler definitions are written in JSON.

Base crawlers

Every crawler relies on a base crawler, which does all of the heavy lifting. There are 3 different types of base crawlers: standard, mediaWiki, and metadataOnly.

The standard crawler should be used for most non-MediaWiki encyclopedias. It retrieves all article content, including images, audio, and other media. For encyclopedias with restrictive licenses, see metadataOnly.

The mediaWiki crawler should be used for most MediaWiki encyclopedias. It automatically uses the correct CSS selectors, although custom selectors can be supplied if necessary.

The metadataOnly crawler is for encyclopedias with more restrictive licenses, that don't allow distributing of their content. It doesn't retrieve any of the article content; only the title, description, and a list of significant phrases, which are phrases that appear more often in the article than in normal English.

Writing a crawler definition

Start by cloning the EncycloCrawler repository. You will need to have Git installed. Open a terminal or command prompt and type the following command:

git clone https://gitlab.com/ks_found/encyclocrawler.git

In the repo folder, go to src/main/resources/crawlers. This is where the crawler definitions are located.

Next, you will need to determine the ID of the encyclopedia you want to add. Make the name of the encyclopedia lowercase, and remove any spaces and special characters. For example, Encyclopedia Mythica becomes encyclopediamythica.

If the name is long, you may want to abbreviate it. For example, International Standard Bible Encyclopedia becomes isbe, and Encyclopedia of Math becomes eom.

Add .json to this ID, and you have the filename of the crawler definition. If your encyclopedia's ID is encyclopediamythica, create a file named encyclopediamythica.json in src/main/resources/crawlers.

Using the standard base crawler

Example definition

{
  "id": "encyclopediamythica",
  "type": "standard",
  "license": "CC BY-NC-ND 4.0",
  "baseURL": "https://pantheon.org",
  "titleSelector": "#main > h1.my-4",
  "contentSelector": ".row",
  "paragraphSelector": "#main p",
  "toRemoveSelector": "#aside > *:not([role=\"complementary\"])",
  "isArticleCheckSelector": "#main .eoa",
  "requiredInURL": "/articles/"
}

Supported fields

Field Explanation Required?
id The encyclopedia's ID. The name of the encyclopedia, lowercase, with no spaces or special characters. Yes
type The type of crawler (standard in this case) Yes
license The encyclopedia's license. Can be a string (e.g. CC BY-SA 3.0), or a URL. Yes
baseURL The URL the crawl will start from. Try to look for a contents page or other page with lots of links to articles. Yes

Troubleshooter

Links