On the Rivalry of Data

Data competes with itself and its peers, and its value is derived from its uniqueness.

Economic goods are separated into two categories, rivalrous and nonrivalrous, where the consumption of one unit of the former means one less unit for others to consume (think oil, steel, appointment slots), while consuming the latter does not reduce the remaining utility (think street lights, downloading a song). Congestible goods describe partial rivalry, or the category between these two extremes where the consumption of one unit is non-rivalrous, but the level of rivalry increases with the number of consumers — fisheries, freeways. Data, here defined as the informational inputs to a machine learning model, is often described as a non-rivalrous good, since copying it is free, but this is a mistake. While consuming the data is non-rivalrous in that consumers aren’t competing for the same resource, data does exhibit a diminishing marginal product in production, where, when one unit of data is consumed, it diminishes the utility of informationally similar data for that consumer only. Think of watching a movie on Netflix: my watching the movie for the first time doesn’t affect other viewers’ experience at all, and once I have watched the movie, a second viewing will be experientially different for me only.

Diminishing marginal product in production is quite a mouthful, so I will call data a “self-rivalrous” good instead. The mistake that economics forums make, including the IMF and the American Economic Review, is that they consider data entirely non-rivalrous by only defining its utility through comparisons between prospective buyers of said data. This probably comes from a technical ignorance of how data is used — which is understandable — however, it has implications for the implementation of public policy in that treating data as entirely non-rivalrous disempowers the individuals from whom the data is derived.

Specifically, treating data as non-rivalrous is both borne of and reinforces the notion that the platform/aggregator of the data is the first-class group through which data gains value via its transaction, because data is non-rivalrous only between aggregators. However, the self-rivalry of data addresses that it does have a finite and limited utility, and recognizes that data fundamentally describes something about a real, living person. In fact, the self-rivalry and the value of data are both derived from the fact that it is created by and is about a person, and data is only economically valuable today insofar as it is personal. Overlooking these facts has created a rhetorical environment in which data is treated as an infinite resource to be extracted from the aether that is simultaneously valuable to its consumer and valueless to its source. In reality, data is derived from human behavior and experience, and its value reflects the underlying individual that it describes. This misaligned worldview has allowed broad violence to be conducted against individual autonomy in the last two decades because economists, policymakers, and media communicators lacked the technical understanding necessary to recognize that the value of data comes from its source, and thus could not consider that the individual who created the data could have rights over it.

What is Self-Rivalry?

To understand data’s economic utility structure, we have to first look at how data is consumed. Machine learning models, the underpinning technology of artificial intelligence, are in simplified terms a system of equations that transform an input of dimension M into an output of dimension N. Think of the input like a row in an Excel sheet, and the output like a probability for each potential outcome. If the model is trying to classify a 16-pixel by 16-pixel image as either a cat or a dog, then the input has a size M=256 (total pixels in the image) and the output is size N=2. If the model is using your age, gender, graduate status, and income to predict the probability you vote for each of five candidates, then M=4 and N=5. Roughly, this system of equations operates by applying a coefficient value (also called a weight) to each input value, and evaluating the sum of these values to get the output. The weights are optimized when the outcome of the equation is correct for as many inputs as possible, as far as we have in the training data set.

As many inputs as possible. This describes the diminishing value of similar pieces of data. Let’s say I am making an AI vending machine that identifies stray animals and dispenses the appropriate food based on that classification. If I want to distinguish cats from dogs, it makes sense that I would want to see many, many types of dogs and cats. There are over 400 breeds of dogs (depending on who you ask) that are very phenotypically diverse. If I see only a few types of cats and dogs, my machine learning model may learn to associate pointy ears and short snouts with cats, and long snouts with dogs — and it would mistake the Hmong Bobtail for a cat.

The Hmong Bobtail is not a well-known dog, and there are far more pictures of Labradors and Chihuahuas than of Hmong Bobtails. Moreover, the Hmong Bobtail looks far more like a cat than a Chihuahua does; only a very bad model would mistake a Chihuahua for a cat. So, a photo of the Bobtail helps us optimize our model weights to distinguish cats from dogs more than a photo of a Chihuahua would. However, once I have added a few photos of the Hmong Bobtail to my data set, the next photo added doesn’t improve my model much. The tenth Chihuahua photo added wouldn’t improve my model at all, and might even get removed. However, when I download each of these photos for my machine learning model, it doesn’t diminish the value of those photos for your cat/dog classification model, because you have not yet used them.

Looking at human-centered data, consider a social media platform with forty thousand records of college-aged male users clicking on Ray-Ban ads. The forty-thousand-and-first such record adds nothing, as the model already knows this demographic buys sunglasses. But if those same users start clicking on bucket hat ads, that record is valuable; the model has something new to learn. A second platform that already models bucket hat purchases faces the inverse: sunglasses records are valuable, bucket hat records are not. Neither platform’s use of a record diminishes its value to the other, and in fact, the diversity of records makes the aggregated data between the two even more valuable.

So, while data does not create competition between two people seeking to use it — since it can be copied essentially for free and is not consumed exclusively by one party — it does create competition with itself, or more importantly, with other bits of data similar to itself. In economic terms, we can call this self-rivalrous as the good exhibits diminishing marginal value in production individually, to each consumer. This is distinct from classical rivalry, where consumption by one party reduces availability to another. Data is not classically rivalrous, since copying is free and my use of a record does not prevent yours. But it is not non-rivalrous either, because the marginal value of any piece of data to a given consumer is finite and contingent on what that consumer has already seen. So, we can say that a piece of data is in competition with itself and its close neighbors, or that it is self-rivalrous.

Uniqueness as Value

From this definition of self-rivalry — particularly the notion that data competes with its neighbors — sheds light on a new aspect of data from which its value is largely derived: its uniqueness. Looking back to the cat/dog vending machine, the photos of the Hmong Bobtail are valuable because they are a unique animal distinct from most breeds of dog. In human data, the uniqueness is a property of the individual whose behavior and preferences the data describes. Therefore, the value of human data is generated by the human, not the platform that collected it. Recognizing this, the individual must then have a residual claim over the value that their personal data provides.

This lens provides a proper justifying framework with which we can address past failures related to data. Particularly, it provides concrete reasoning as to why scandals like the Cambridge Analytica case felt so wrong, and why the resolution feels so lackluster.

Regulatory Errors

Cambridge Analytica was a British consulting firm that worked jointly with Facebook to collect information from Facebook users nonconsensually and used it for political ad targeting. They issued a survey to a quarter million people who consented for their responses to be used for academic purposes only. Then, their personal information was scraped from their Facebook profile to create a psychological profile, and the same was done for all profiles with a friend connection to the initial participant, totaling over 87 million people as reported by Facebook themselves.

In response, Cambridge Analytica was shut down, and Facebook had to pay $5B USD to the FTC in 2018 for violating a privacy order by the FTC imposed in 2012, which was itself issued due to Facebook’s failure to maintain user’s privacy of their data. The 87 million individuals whose data was harvested received nothing and had no independent legal standing. The harm was framed as Facebook violating contractual obligations, rather than individual rights.

This lack of recognition of data’s relationship to individuals continues to propagate through regulatory architecture even today. Legal frameworks like the GDPR and CCPA, while steps in the right direction, impose fines on platforms that are paid to the government, instead of to the individuals whom they have harmed. Although the aim of those regulations are to protect the consumer, individuals are left powerless to bring property claims or other such arguments because they have no formally recognized interest to enforce. A fully developed legal framework, properly informed by the individual’s rights to their data, should center these rights and provide tools to enforce them, much in the way an individual can enforce their property or religious rights.

Establishing individual legal standing over data is a necessary step, but recognition alone does not return value to its source. The question is how individuals, once recognized as generators of data’s value, can empower themselves to participate in (or abstain from) its exchange. The project of this publication is to establish and promote the architecture of collective mechanisms that will make such participation practical.

## References 1. IMF, "The Economics and Implications of Data: An Integrated Perspective." <https://www.imf.org/en/publications/departmental-papers-policy-papers/issues/2019/09/20/the-economics-and-implications-of-data-an-integrated-perspective-48596> 2. Jones and Tonetti 2018, "Nonrivalry and the Economics of Data." <https://christophertonetti.com/files/papers/JonesTonetti_DataNonrivalry.pdf>