(RFC) ZWI Metadata — Article Read Time or Word Count

From Encyclosphere Project Wiki
Revision as of 09:28, 28 February 2024 by Chris W (talk | contribs) (→‎Voting: Added quorum requirement)

(RFC) ZWI Metadata — Article Read Time or Word Count

Introduction

Displaying the reading time of an article is becoming more common across many kinds of websites.

From a user perspective, this feature allows the reader to see the estimated reading time of an article and decide whether they want to read it now or later. It also provides a general idea of the level of detail any given article has (the difference is immediately obvious between a ~3 minute article and a ~24 minute article).

From an application and user perspective, this feature allows interrogating, searching, filtering, and ordering of articles based on their length.

The Metadata or Application Level?

While the read-time calculation could be done at the application level without being included in the metadata, this comes with some drawbacks.

Every time the "article summary" (Title, Description, Read Time) is displayed, the application would have to parse the entire text of the article to produce this information. Extrapolate over thousand of users (or more), with each search displaying ten or more articles, and there is a lot of redundant processing/calculations taking place.

To do searching, filtering or ordering based on the read time, the information cannot easily be calculated on the fly — for example, if you wanted to sort all articles (or a sub-class of articles, such as all articles about a given topic, or from a given source) the application would have to calculate the read time from the text of every single article being interrogated at run-time.

An application that wanted to provide this functionality could always maintain it's own internal database of metadata about the contained ZWI files, including the read-time. However, this adds an extra degree of complexity at the application/database level, whereas calculation at the ZWI generation stage and inclusion in the metadata is simpler.

Ultimately inclusion in the metadata is not required, but may be beneficial.

Proposal - Two Options (or Both)

Add a new field to the metadata.json file for ZWI articles, to contain either:

1) A 'fast' and 'slow' read-time in minutes. This could look like the following, or similar:

"read-minutes": { "fast": 18, "slow": 27 }

2) Or, providing largely equivalent functionality, the article word-count, such as:

"word-count": 1348

The Word Count option would allow the same ability to interrogate, search, filter, and order articles based on their length, and the application could still display the read-time by performing a simple conversation from word-count to read-time, without having to parse the article text itself. (read_minutes = word_count / words_per_minute)

In some ways, the word-count option may be a better choice: the read-time can be generated from the word-count, but not vice-versa ; the read time could be generated specific to a user or application preference by allowing them to choose the words-per-minute read time ; and the data itself is smaller.

Of course, both fields could be included, eliminating the need to perform any conversions or calculations.

Example Code

Below is an example PHP function to generate read-time values for any given/parsed string (i.e., article text).

function readTime(string $text, int $fast=180, int $slow=120) : array
{
	// Returns a 'fast' and 'slow' reading time, in whole minutes.
	//
	// Can be used on text in any format, however running over HTML is
	// recommended as all HTML tags are stripped, leaving only real readable
	// text for the read-time calculation.
	//
	// Defaults:
	//  - Fast reading time is calculated on 180 WPM (3 WPS)
	//  - Slow reading time is calculated on 120 WPM (2 WPS)

	if (!$text) { return array(0, 0); }

	$text = strip_tags($text);

	$word_count = str_word_count($text);

	// Note: The above line will only work 'fairly well' on Latin-based text.
	// For Unicode support, a regex, or a delimiter split (like the following
	// code snippet uses via explode) would be more accurate.

	// Calculate read time (minutes with a decimal place value)
	$time_F = $word_count / $fast;
	$time_S = $word_count / $slow;

	// Strip decimal place value, leaving only minutes
	$minutes_F = floor($time_F);
	$minutes_S = floor($time_S);

	// Calculate the number of seconds based on the decimal place value
	$seconds_F = ($time_F - $minutes_F) * 60;
	$seconds_S = ($time_S - $minutes_S) * 60;

	// Round up (increase the minute value) if more than 30 seconds
	if ($seconds_F > 30) { $minutes_F++; }
	if ($seconds_S > 30) { $minutes_S++; }

	// Return fast and slow read time in minutes
	return array($minutes_F, $minutes_S);
}

And below is an example PHP function to generate the word-count of a string (i.e. article text):

function wordCount(string $text, bool $strip_tags=true) : int
{
	// Returns the word count for a string of text.
	//
	// Can be used on text in any format, however running over HTML is
	// recommended as all HTML tags are stripped, leaving only real readable
	// text for the word-count calculation.
	//
	// If the $strip_tags flag is set to false, HTML tags will not be stripped.

	if (!$text) { return 0; }

	$marks = array(
		',', '.', '?', '!', ':', ';', '"', '“', '”', '─', '—', '+',
		'*', '/', '|', '(', ')', '[', ']'
	);

	// Clean text for word count: strip HTML tags, convert newlines
	// to spaces, strip out grammatical marks
	if ($strip_tags) { $text = strip_tags($text); }
	$text = str_replace(PHP_EOL, ' ', $text);
	$text = str_replace($marks, '', $text);

	$words = explode(' ', $text);

	$word_count = count($words_array);

	return $word_count;
}

Both code extracts are released under the CC0 - Public Domain license.

NOTE: This code uses strictly-typed parameters, requiring newer versions of PHP — however the type information can be removed, making the functions compatible with older versions of PHP ; in which case, you may want to include variable type checks.

Impacted Project Details

Project: ZWI Format

Project Owner(s): Larry Sanger (?), Sergei Chekanov (?), Henry Sanger (?)

RFC Owner(s): Chris W

Comments

Please add your questions and comments here. (Or alternatively, engage in a conversation on Mattermost and post the outcome here.)

Voting

This RFC is in the discussion and comment phase, so voting has not opened.

Voting will be concluded when either a quorum of members have voted, or by the closing date of ##-#####-2024.

To be accepted, a quorum of members must have voted, majority of voters must vote yes, and the project owner(s) must choose not to veto the proposal.

ZWI Metadata - Read Time or Word Count
Name Yes No Unsure
Chris W (@chris_w)

Symbols: ✔ ✘