Standardized reader stat features
I. Purpose
Publishers greatly desire access to statistical data collected by Encyclosphere readers, and there is no reason not to make such data available to them, as long as it can be collected and shared without exposing any data that should not be exposed on privacy grounds. By giving publishers access to this data, we incentivize them to participate more actively; if we do not give them this data, our aggregators appear to be competing with them, which is an awful impression to leave, because it is simply not true.
Such data really needs to be standardized for ease of access and aggregation. Thus, preliminarily, we need to answer three questions:
- What data should not be collected, because it represents a privacy violation of end users?
- Of remaining data, which should be collected, because it would be useful?
- How should the data be collected, so as to avoid the enumerated privacy violations and so as to ensure the useful data is collected?
II. Data not to collect
Reflecting on the sort of data that is typically logged by web analytics software, there there are two types of data that represents a privacy incursion. First, IP addresses: they expose a great deal of information, because they are sometimes (indeed often) unique. Second, personally identifiable data from which an individual's identity can be inferred, especially behavioral data, must be avoided. In the latter category, debatably, would be referrers. The problem with referrers is that, due to the fine-grained nature of them (even when generalized), they can frequently
Generalized or anonymized data is acceptable, but one must not publish anonymized data that can be tied to particular individuals due to small sample size. For example, suppose a spy wants to find out whether an individual from Wyoming has been looking at a certain biography of a government-identified terrorist. Suppose only one person viewed that article, and the (anonymized) data showed that that person was from Wyoming. That would be a failure to respect privacy.
III. What data should be collected
The most important numbers publishers want are, of course, page views and unique views for each article. Page views for article components, such as images, are less important, but might be "nice to have." It is also extremely interesting to publishers to know where their traffic is coming from, down at least to the state and national level. Search engine and social media referrals make good sense to include, if generalized. Data should be recorded and made available at a variety of levels of time granularity, down to daily, except when the numbers are so small as to permit identifiability. Beyond this, there really is nothing terribly important.
IV. How to collect the data
IP addresses should never be made available. It should also be impossible to supply an IP address, apply a common hashing algorithm, and determine a match with some data that is supplied.
Generally, unless the unique number for any metric (whether based on geography, time, or whatever) is at least 10, it should not be recorded. For example, if only 9 visitors from Ohio visited the "George Washington" article in December 2023, then no page should reveal that fact; but if 10 visitors from New York visit the article in December 2023, then it is acceptable to reveal that fact. Of course, the 9 December visitors should be counted in the summary 2023 data, assuming more than 9 visitors from Ohio visited the article in 2023. Etc.
Data should be stored in a hashed/encrypted form until a generation routine determines what is to be published (each day, I suppose). Such granular source data must never be made available via API. IP addresses should be not just immediately hashed, but hashed in such a way as to be undiscoverable by the developer.
Obviously, we will want to make sure we follow relevant laws and regulations, but the above plans should satisfy even the most stringent regulations.
V. Standardizing analytics data storage and API
We should also stay in communication whenever anyone starts working on an analytics system, so that we adopt (or create) the same standard for storing analytics data and making it available via an API. Basically, someone should be able to download the analytics data from 2024 for EncycloReader and EncycloSearch using the same code, and use the same software to view the outputted files.