<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://takebackyourdata.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://takebackyourdata.org/" rel="alternate" type="text/html" /><updated>2026-05-29T08:23:13+00:00</updated><id>https://takebackyourdata.org/feed.xml</id><title type="html">Take Back Your Data</title><subtitle>A publication arguing that your data is your property, and building the economic, legal, and political case for why that matters.</subtitle><author><name>Roman Belaire</name></author><entry><title type="html">On the Rivalry of Data</title><link href="https://takebackyourdata.org/essays/on-the-rivalry-of-data/" rel="alternate" type="text/html" title="On the Rivalry of Data" /><published>2026-05-27T00:00:00+00:00</published><updated>2026-05-27T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/on-the-rivalry-of-data</id><content type="html" xml:base="https://takebackyourdata.org/essays/on-the-rivalry-of-data/"><![CDATA[<p>Economic goods are separated into two categories, rivalrous and nonrivalrous, where the consumption of one unit of the former means one less unit for others to consume (think oil, steel, appointment slots), while consuming the latter does not reduce the remaining utility (think street lights, downloading a song). Congestible goods describe partial rivalry, or the category between these two extremes where the consumption of one unit is non-rivalrous, but the level of rivalry increases with the number of consumers — fisheries, freeways. Data, here defined as the informational inputs to a machine learning model, is often described as a non-rivalrous good, since copying it is free, but this is a mistake. While <em>consuming</em> the data is non-rivalrous in that consumers aren’t competing for the same resource, data does exhibit a diminishing marginal product in production, where, when one unit of data is consumed, it diminishes the utility of informationally similar data for that consumer only. Think of watching a movie on Netflix: my watching the movie for the first time doesn’t affect other viewers’ experience at all, and once I have watched the movie, a second viewing will be experientially different for me only.</p>

<p>Diminishing marginal product in production is quite a mouthful, so I will call data a “self-rivalrous” good instead. The mistake that economics forums make, including the IMF and the American Economic Review, is that they consider data entirely non-rivalrous by only defining its utility through comparisons between prospective buyers of said data. This probably comes from a technical ignorance of how data is used — which is understandable — however, it has implications for the implementation of public policy in that treating data as entirely non-rivalrous disempowers the individuals from whom the data is derived.</p>

<p>Specifically, treating data as non-rivalrous is both borne of and reinforces the notion that the platform/aggregator of the data is the first-class group through which data gains value via its transaction, because data is non-rivalrous only between aggregators. However, the self-rivalry of data addresses that it does have a finite and limited utility, and recognizes that data fundamentally describes something about a real, living person. In fact, the self-rivalry and the value of data are both derived from the fact that it is created by and is about a person, and data is only economically valuable today insofar as it is <em>personal</em>. Overlooking these facts has created a rhetorical environment in which data is treated as an infinite resource to be extracted from the aether that is simultaneously valuable to its consumer and valueless to its source. In reality, data is derived from human behavior and experience, and its value reflects the underlying individual that it describes. This misaligned worldview has allowed broad violence to be conducted against individual autonomy in the last two decades because economists, policymakers, and media communicators lacked the technical understanding necessary to recognize that the value of data comes from its source, and thus could not consider that the individual who created the data could have rights over it.</p>

<h2 id="what-is-self-rivalry">What is Self-Rivalry?</h2>

<p>To understand data’s economic utility structure, we have to first look at how data is consumed. Machine learning models, the underpinning technology of artificial intelligence, are in simplified terms a system of equations that transform an input of dimension M into an output of dimension N. Think of the input like a row in an Excel sheet, and the output like a probability for each potential outcome. If the model is trying to classify a 16-pixel by 16-pixel image as either a cat or a dog, then the input has a size M=256 (total pixels in the image) and the output is size N=2. If the model is using your age, gender, graduate status, and income to predict the probability you vote for each of five candidates, then M=4 and N=5. Roughly, this system of equations operates by applying a coefficient value (also called a weight) to each input value, and evaluating the sum of these values to get the output. The weights are <em>optimized</em> when the outcome of the equation is correct for as many inputs as possible, as far as we have in the training data set.</p>

<p><em>As many inputs as possible.</em> This describes the diminishing value of similar pieces of data. Let’s say I am making an AI vending machine that identifies stray animals and dispenses the appropriate food based on that classification. If I want to distinguish cats from dogs, it makes sense that I would want to see many, many types of dogs and cats. There are over 400 breeds of dogs (depending on who you ask) that are very phenotypically diverse. If I see only a few types of cats and dogs, my machine learning model may learn to associate pointy ears and short snouts with cats, and long snouts with dogs — and it would mistake the Hmong Bobtail for a cat.</p>

<p>The Hmong Bobtail is not a well-known dog, and there are far more pictures of Labradors and Chihuahuas than of Hmong Bobtails. Moreover, the Hmong Bobtail looks far more like a cat than a Chihuahua does; only a very bad model would mistake a Chihuahua for a cat. So, a photo of the Bobtail helps us optimize our model weights to distinguish cats from dogs more than a photo of a Chihuahua would. However, once I have added a few photos of the Hmong Bobtail to my data set, the next photo added doesn’t improve my model much. The tenth Chihuahua photo added wouldn’t improve my model at all, and might even get removed. However, when I download each of these photos for my machine learning model, it doesn’t diminish the value of those photos for <em>your</em> cat/dog classification model, because you have not yet used them.</p>

<p>Looking at human-centered data, consider a social media platform with forty thousand records of college-aged male users clicking on Ray-Ban ads. The forty-thousand-and-first such record adds nothing, as the model already knows this demographic buys sunglasses. But if those same users start clicking on bucket hat ads, that record is valuable; the model has something new to learn. A second platform that already models bucket hat purchases faces the inverse: sunglasses records are valuable, bucket hat records are not. Neither platform’s use of a record diminishes its value to the other, and in fact, the diversity of records makes the aggregated data between the two even more valuable.</p>

<p>So, while data does not create competition between two people seeking to use it — since it can be copied essentially for free and is not consumed exclusively by one party — it does create competition with <em>itself</em>, or more importantly, with other bits of data similar to itself. In economic terms, we can call this self-rivalrous as the good exhibits diminishing marginal value in production individually, to each consumer. This is distinct from classical rivalry, where consumption by one party reduces availability to another. Data is not classically rivalrous, since copying is free and my use of a record does not prevent yours. But it is not non-rivalrous either, because the marginal value of any piece of data to a given consumer is finite and contingent on what that consumer has already seen. So, we can say that a piece of data is in competition with itself and its close neighbors, or that it is self-rivalrous.</p>

<h2 id="uniqueness-as-value">Uniqueness as Value</h2>

<p>From this definition of self-rivalry — particularly the notion that data competes with its neighbors — sheds light on a new aspect of data from which its value is largely derived: its uniqueness. Looking back to the cat/dog vending machine, the photos of the Hmong Bobtail are valuable because they are a unique animal distinct from most breeds of dog. In human data, the uniqueness is a property of the individual whose behavior and preferences the data describes. Therefore, the value of human data is generated <em>by the human</em>, not the platform that collected it. Recognizing this, the individual must then have a residual claim over the value that their personal data provides.</p>

<p>This lens provides a proper justifying framework with which we can address past failures related to data. Particularly, it provides concrete reasoning as to why scandals like the Cambridge Analytica case felt so wrong, and why the resolution feels so lackluster.</p>

<h2 id="regulatory-errors">Regulatory Errors</h2>

<p>Cambridge Analytica was a British consulting firm that worked jointly with Facebook to collect information from Facebook users nonconsensually and used it for political ad targeting. They issued a survey to a quarter million people who consented for their responses to be used for academic purposes only. Then, their personal information was scraped from their Facebook profile to create a psychological profile, and the same was done for all profiles with a friend connection to the initial participant, totaling over 87 million people <a href="https://about.fb.com/news/2018/04/restricting-data-access/">as reported by Facebook themselves</a>.</p>

<p>In response, Cambridge Analytica was shut down, and Facebook had to pay $5B USD to the FTC in 2018 for violating a privacy order by the FTC imposed in 2012, which was itself issued due to Facebook’s failure to maintain user’s privacy of their data. The 87 million individuals whose data was harvested received nothing and had no independent legal standing. The harm was framed as Facebook violating contractual obligations, rather than individual rights.</p>

<p>This lack of recognition of data’s relationship to individuals continues to propagate through regulatory architecture even today. Legal frameworks like the GDPR and CCPA, while steps in the right direction, impose fines on platforms that are paid to the government, instead of to the individuals whom they have harmed. Although the aim of those regulations are to protect the consumer, individuals are left powerless to bring property claims or other such arguments because they have no formally recognized interest to enforce. A fully developed legal framework, properly informed by the individual’s rights to their data, should center these rights and provide tools to enforce them, much in the way an individual can enforce their property or religious rights.</p>

<p>Establishing individual legal standing over data is a necessary step, but recognition alone does not return value to its source. The question is how individuals, once recognized as generators of data’s value, can empower themselves to participate in (or abstain from) its exchange. The project of this publication is to establish and promote the architecture of collective mechanisms that will make such participation practical.</p>

<div class="references">

## References

1. IMF, "The Economics and Implications of Data: An Integrated Perspective." &lt;https://www.imf.org/en/publications/departmental-papers-policy-papers/issues/2019/09/20/the-economics-and-implications-of-data-an-integrated-perspective-48596&gt;
2. Jones and Tonetti 2018, "Nonrivalry and the Economics of Data." &lt;https://christophertonetti.com/files/papers/JonesTonetti_DataNonrivalry.pdf&gt;

</div>]]></content><author><name>Roman Belaire</name></author><category term="economics" /><category term="theory" /><category term="AI" /><summary type="html"><![CDATA[Economic goods are separated into two categories, rivalrous and nonrivalrous, where the consumption of one unit of the former means one less unit for others to consume (think oil, steel, appointment slots), while consuming the latter does not reduce the remaining utility (think street lights, downloading a song). Congestible goods describe partial rivalry, or the category between these two extremes where the consumption of one unit is non-rivalrous, but the level of rivalry increases with the number of consumers — fisheries, freeways. Data, here defined as the informational inputs to a machine learning model, is often described as a non-rivalrous good, since copying it is free, but this is a mistake. While consuming the data is non-rivalrous in that consumers aren’t competing for the same resource, data does exhibit a diminishing marginal product in production, where, when one unit of data is consumed, it diminishes the utility of informationally similar data for that consumer only. Think of watching a movie on Netflix: my watching the movie for the first time doesn’t affect other viewers’ experience at all, and once I have watched the movie, a second viewing will be experientially different for me only.]]></summary></entry><entry><title type="html">A/B Testing Is Human Subjects Research</title><link href="https://takebackyourdata.org/essays/ab-testing-is-human-subjects-research/" rel="alternate" type="text/html" title="A/B Testing Is Human Subjects Research" /><published>2026-05-14T00:00:00+00:00</published><updated>2026-05-14T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/ab-testing-is-human-subjects-research</id><content type="html" xml:base="https://takebackyourdata.org/essays/ab-testing-is-human-subjects-research/"><![CDATA[<p>A/B testing is a hundred years old and almost universally beneficial. A marketing firm prints half its mailers in red and half in green, counts the responses, and prints more red ones. A grocery chain tests two shelf arrangements and keeps the one that moves more product. The method is simple, and no one is harmed. However, in the world of software, which is now fully integrated with the human experience, a few structural distinctions exist that make newspaper headline testing fundamentally different from social media algorithm tuning. In fact, the intentional manipulation of user emotional states reads far closer to scientific testing than many intuit.</p>

<p>The scale and structure of internet products added both considerable complexity and newfound ability to such tests. In 2012, Facebook <a href="https://www.npr.org/sections/alltechconsidered/2014/06/30/326929138/facebook-manipulates-our-moods-for-science-and-commerce-a-roundup">conducted a study</a> on over half of its 1.2 billion users to determine if, by changing the emotional content of their feeds, the users’ emotional state in turn also changed. Twelve years later, this same experiment runs continuously, for all users, on short-form content platforms optimized for “engagement” — because, thanks to excellent marketing, whatever a private firm does to their users has remained exempt from the moral framework we apply to all other forms of experimentation, from cancer research to graduate studies on psychology.</p>

<p>Regarding the continuous nature of these online optimizations, the relationship between modern media and its users provides insight into a particularly distressing aspect of A/B testing in that world. While previous A/B testing was between choices within an environment (which ad/headline/flavor do you like better), the integration of digital media into our social, leisure, and economic lives means the tests change our environment itself. As much as Coca-Cola spent on its marketing and product design to make Coke as enticing as possible, it could never restructure your entire diet to see how your behavior changed. TikTok, Instagram, YouTube, Netflix, and all the other media platforms, meanwhile, maintain not just a master model of what button shapes make people click more often, but per-user profiles that detail our preferences and predict what makes us spend more time with the content. Every session a user spends on-platform is simultaneously its own product being sold to advertisers and a data point refining a behavioral model whose purpose is to predict what the user does next.</p>

<h2 id="common-rule">Common Rule</h2>

<p>We have long-standing precedent on how to recognize and address these experiments. <a href="https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html">The Common Rule</a>, the federal regulation that has governed human subjects research in the United States since 1991, defines human experimentation research as “a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge,” conducted on “a living individual about whom an investigator obtains information through intervention or interaction.” A typical platform A/B test satisfies every clause.</p>

<p>It is a systematic investigation: subjects are randomized into groups, treatments are administered, outcomes are measured, and statistical inference is performed. It is designed to contribute to generalizable knowledge: the findings are not consumed and discarded but written up internally, distilled into institutional knowledge. It involves living individuals and is conducted through intervention — the test actively modifies the user experience. It also uses identifiable private information, since modern platform experiments are tied to user accounts and to behavioral profiles assembled over years.</p>

<p>The Common Rule legislation provides a full framework for upholding human moral standards in trial research, principally centered on the notion of informed consent. Essentially, the legislation was written with the understanding that the most important aspect of a human clinical trial was that the human subject fully understood and was informed of the personal risks and benefits, of the description and experimental status of the procedures, and of the purpose of the study, and consented. Notably, the terms of service “consent” famously do <em>not</em> count towards informed consent for unreasonable statutes.</p>

<p>Obviously, it is unreasonable to state that software companies should not perform A/B testing at all, simply because they made a product that their users intertwine their lives with. Instead, they just need to make a good-faith effort to inform users of what is being tested and why, and allow an opt-out for those nonconsenting.</p>

<h2 id="affect-interventions-and-design-preference">Affect Interventions and Design Preference</h2>

<p>A loophole that should be shut down immediately is the option for companies to claim a low risk of harm with simple UI changes. Common Rule and other IRB regulations allow for approved exemptions for studies with minimal or no risk to participants. As such, companies could shroud manipulative changes behind innocuous design changes like color choice or font size, and waive the informed consent process. However, this should be explicitly taken into account because a century of design research and the entire discipline of graphic design <em>know</em> that visual choices move the affective experience. A designer claiming otherwise is either ignorant, and therefore unfit to run tests on millions, or intentionally misleading.</p>

<p>However, this introduces a problem of scale. If every visual choice carries weight, then every change would require informed consent, which is operationally impossible. A line must be drawn somewhere, and using the risk of harm or the magnitude of design change is somewhat arbitrary. Instead, we should look at the A/B test’s goals in relation to user preference, and whether it is trying to meet or to manipulate them. A platform making a button more visually appealing is meeting preference (users prefer attractive interfaces); a platform testing whether red borders trigger impulse purchases by anxiety induction is manipulating preference.</p>

<p>The last required definition to draw this line, then, is what constitutes preference. There is a distinction in economics between what people <em>say</em> they want (stated preference) and what people <em>do</em> (revealed preference). The classical position, from Paul Samuelson in the late 1930s, is that revealed preference is the authoritative signal. If a person says they want to eat healthier and yet orders pizza three nights a week, the pizza is the preference. This distinction between elicited and observed signals catches a failure mode in self-reported data, where people misrepresent themselves not only to interviewers but also to themselves, and has contributed to many quality of life improvements over time.</p>

<p>It has, however, been weaponized. The argument is familiar to anyone who has watched a congressional hearing in the last decade. The CEO of whatever in-vogue media platform testifies, saying that they simply optimize for user preference. Users come back, they click; these are revealed preferences. Who are we to second-guess the user? We are merely meeting them where they are.</p>

<h2 id="circular-optimization">Circular Optimization</h2>

<p>This argument has a flaw, which is attributed to the distinct structure of online media that has never existed in the past. Revealed preference, when conceived, assumed that the environment in which choices were made was roughly neutral. A user was offered a free choice between pre-existing options (say, restaurants), and has up-to-date information on how the options work. Now, users are being adversarially optimized against by a system that has spent a decade learning their specific exploitable patterns. Users largely do not know how the product they are using operates or towards what goal. Moreover, there is no choice. A/B tests work en masse and aggregate results across users rather than elicit active choices. Finally, there is the environment — the app itself — where every aspect is carefully chosen (through more testing) to change some part of the user’s behavior.</p>

<p>Consider a casino. A casino is engineered, in every detail of its lighting and carpet pattern and drink service, to extract money from the people inside it. By staying and losing money repeatedly, they reveal a preference for the activity. Obviously, no one treats casinos as ethical. The whole environment is the manipulation; the revealed behavior is what manipulation looks like when it succeeds. Revealed preference theory is a tool for inferring desires in conditions of free choice. When the conditions of free choice have been violated, engineered to produce particular behaviors, we cannot rely on revealed preference as a motivation for design.</p>

<p>The casino analogy, already damning, actually understates the case. A casino is a single engineered environment applied uniformly to everyone who walks in, optimized for the average user. A platform’s A/B testing apparatus, combined with a persistent per-user behavioral profile, produces something worse: a casino whose architecture is continuously revised around each individual. The house knows the odds, and it knows that <em>you</em> specifically are susceptible to loss-aversion framing after 11 pm, that your scroll slows when you encounter a particular kind of outrage, and that ASMR content reliably extends your session by eight minutes.</p>

<p>Moreover, this casino also changes you through its user profile. The user profile, initially built on information about what keeps you on longer, begins to predict how you’ll behave, and influences the content served. A user who initially resists doomscrolling is gradually reconditioned through content sequencing until the resistance erodes. The casino gets better at extracting from them, and they get worse at resisting it. This is the experiment that has never required — and would never gain — informed consent, and is the logical endpoint of the marriage of environment design and user profiling.</p>

<p>No reasonable position holds that software companies should freeze their interfaces in place, or that every button-color test requires a consent form. Instead, we should just recognize that a meaningful category of platform experimentation already meets the federal definition of human subjects research and should be governed accordingly. Behavioral reconditioning, and any test that draws on longitudinal user profiles, should require informed consent or a workable opt-out for those who decline. This is not an unreasonable burden on a trillion-dollar industry.</p>

<div class="references">

## References

1. Facebook Emotional Contagion Study, NPR. &lt;https://www.npr.org/sections/alltechconsidered/2014/06/30/326929138/facebook-manipulates-our-moods-for-science-and-commerce-a-roundup&gt;
2. The Common Rule, US Department of Health and Human Services. &lt;https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html&gt;
3. 45 CFR 46.101(a), US DHHS. &lt;https://www.ecfr.gov/current/title-45/part-46#p-46.101(a)&gt;
4. 45 CFR 46.116(b), US DHHS. &lt;https://www.ecfr.gov/current/title-45/part-46#p-46.116(b)&gt;
5. Terms of Service, Berkeley Technology Law Journal. &lt;https://btlj.org/2014/11/terms-of-service-didnt-read-might-not-be-a-problem-if-its-browsewrap/&gt;
6. "A Note on the Pure Theory of Consumer's Behaviour," Paul Samuelson. *Economica*, 1938.

</div>]]></content><author><name>Roman Belaire</name></author><category term="technology" /><category term="ethics" /><category term="law" /><category term="AI" /><summary type="html"><![CDATA[A/B testing is a hundred years old and almost universally beneficial. A marketing firm prints half its mailers in red and half in green, counts the responses, and prints more red ones. A grocery chain tests two shelf arrangements and keeps the one that moves more product. The method is simple, and no one is harmed. However, in the world of software, which is now fully integrated with the human experience, a few structural distinctions exist that make newspaper headline testing fundamentally different from social media algorithm tuning. In fact, the intentional manipulation of user emotional states reads far closer to scientific testing than many intuit.]]></summary></entry><entry><title type="html">The Standard Objection: Why Would Google Ever Negotiate?</title><link href="https://takebackyourdata.org/essays/the-standard-objection/" rel="alternate" type="text/html" title="The Standard Objection: Why Would Google Ever Negotiate?" /><published>2026-04-14T00:00:00+00:00</published><updated>2026-04-14T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/the-standard-objection</id><content type="html" xml:base="https://takebackyourdata.org/essays/the-standard-objection/"><![CDATA[<p>In the current data ecosystem, large firms like Google or Meta already capture data, attention, and dollars without sharing power or revenue. Assuming their leadership pursues rational business goals — minimize costs, maximize profits — there is no reason for them to engage with a cooperative of users unless they are forced to. “Forced,” in this case, being either by regulation or by the cooperative reaching a critical mass such that a boycott would be significantly detrimental to business. Both are unlikely to happen, since there is currently no proven legal or normative blueprint to break the ice. Moreover, large data firms will actively avoid ceding bargaining power over data to protect their control and margin, as mandated by their duty to their shareholders.</p>

<p>However, as legislatures worldwide start to develop consumer protection laws around data — e.g., GDPR, CCPA, PDPA — the cost-benefit analysis changes. As increasing regulatory pressure forces firms to create expensive internal consent and governance infrastructure, an opportunity arises. By providing a single interface for firms to access user data, their compliance costs are cut drastically. In return, users have a seat at the negotiation table.</p>

<h2 id="whats-in-it-for-me">What’s in it for <em>me?</em></h2>

<p>As noted above, a rational data firm would only negotiate with its users on the terms of their data agreement if there was something to be gained by doing so. Today, as is the case since the beginning of the internet, there is no such benefit.</p>

<p>Even as this idea picks up steam and the public (and even legal) understanding of data ownership changes, our tech lords will fight tooth and nail to retain control over their hoard. As such, while the political and philosophical arm of Data as Property reduces barriers to <em>entry</em> for the user, it must also be accompanied by a technological framework that reduces barriers to <em>acceptance</em> by the companies. The primary barrier is, of course, the cost of compliance and of managing the data itself.</p>

<p>A well-designed data cooperative will exhibit the following.</p>

<h2 id="ease-of-use">Ease of Use</h2>

<p>A data cooperative must be easier for a company to work with than the status quo. That means a single technical integration for consent, access, and logging; standard contract templates; and predictable governance timelines for new uses of data.</p>

<p>It should appear as an external consent- and data-stewardship service: one API, one legal document set, and one point of contact, instead of thousands of fragmented relationships.</p>

<h2 id="exported-transparency">Exported Transparency</h2>

<p>Exported Transparency is the idea that the cooperative, by retaining control over its own data, outsources the burden of transparency away from the firm. Since the cooperative manages governance and can be internally transparent, users should be able to see what data they have contributed, which licenses their data is part of, what activity the cooperative has approved for their data, and how these decisions have come to be — voting records, internal debates, town halls, etc.</p>

<p>Externally, this also means easing transparency in the more conventional, regulatory sense. The cooperative will have a record connecting data from its users to the export interface, so when regulators or other third parties audit the firms for compliance, they need only show provenance for one consistent source of data, instead of many thousands of individuals.</p>

<h2 id="compliance-by-design">Compliance by Design</h2>

<p>Many regulatory documents, like the GDPR and other emerging standards, emphasize “privacy by design” and accountability as an ongoing process. A well-designed data cooperative should be able to reduce this friction by moving accountability infrastructure closer to the user. Take the right to deletion, for example: a cooperative’s infrastructure should, at the very least, provide an upfront and verifiable way for a user to exclude themselves from a pool of data, before that data reaches the firms. This reduces the need for a company to spend resources on compliance and data provenance. Similarly, data minimization, consent, and privacy can be written into agreed-upon standards set by the cooperative’s members, so that users know exactly what they are agreeing to each time.</p>

<h2 id="high-quality-data">High Quality Data</h2>

<p>Finally, a member-owned cooperative has a structural incentive to push for accuracy, richness, and appropriate contextual metadata, because better data means better licensing terms and more revenue for members.</p>

<p>For firms, that translates into higher signal-to-noise ratios than typical web scraping: cleaner schemas, clearer documentation, and fewer duplicates and errors. In an era where AI systems are scrutinized for bias, explainability, and provenance, access to a cooperative’s high-quality data can become a competitive advantage.</p>

<h2 id="conclusion">Conclusion</h2>

<p>A well-designed data cooperative provides benefits to its members and incentives to companies by cutting compliance costs, externalizing transparency over consent and usage, and delivering high-quality data through a single, auditable interface. As the cost of compliance rises and regulatory bodies are rightly developed, the argument for data cooperatives only gets stronger.</p>

<div class="references">

## References

- Data Cooperatives: &lt;https://arxiv.org/html/2504.10058v1&gt;
- Cost of GDPR compliance: &lt;https://secureprivacy.ai/blog/cost-of-gdpr-compliance&gt;
- Mechanisms of Data Stewardship: &lt;https://www.adalovelaceinstitute.org/report/legal-mechanisms-data-stewardship/&gt;
- The Need for Intermediaries: &lt;https://hai.stanford.edu/news/radical-proposal-data-cooperatives-could-give-us-more-power-over-our-data&gt;

</div>]]></content><author><name>Roman Belaire</name></author><category term="cooperative" /><category term="economics" /><category term="law" /><summary type="html"><![CDATA[In the current data ecosystem, large firms like Google or Meta already capture data, attention, and dollars without sharing power or revenue. Assuming their leadership pursues rational business goals — minimize costs, maximize profits — there is no reason for them to engage with a cooperative of users unless they are forced to. “Forced,” in this case, being either by regulation or by the cooperative reaching a critical mass such that a boycott would be significantly detrimental to business. Both are unlikely to happen, since there is currently no proven legal or normative blueprint to break the ice. Moreover, large data firms will actively avoid ceding bargaining power over data to protect their control and margin, as mandated by their duty to their shareholders.]]></summary></entry><entry><title type="html">What is a Data Cooperative?</title><link href="https://takebackyourdata.org/essays/what-is-a-data-cooperative/" rel="alternate" type="text/html" title="What is a Data Cooperative?" /><published>2026-03-31T00:00:00+00:00</published><updated>2026-03-31T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/what-is-a-data-cooperative</id><content type="html" xml:base="https://takebackyourdata.org/essays/what-is-a-data-cooperative/"><![CDATA[<p>The writings on this publication aim to normalize personal data as a form of property, with the ultimate goal of returning agency to the individual via legal ownership. A significant barrier to this goal is the challenge of mediating data transactions between millions of individuals and the big data brokers that wish to do business with them, leading to a lopsided data market akin to the labor market. Following this logic, it makes sense to apply the <em>cooperative</em> model that many workers, farmers, and small business owners have used worldwide — but for data.</p>

<p>This essay is predated by <a href="https://arxiv.org/html/2504.10058v1">“Democratic Models for Ethical Data Stewardship”</a> (Mendonça et al.), which provides more robust definitions for those interested.</p>

<h2 id="stewardship-vs-ownership">Stewardship vs Ownership</h2>

<p>The basic moral claim I wish to make across this project is that the individual has a claim over the data that is derived from them, that it should be recognized legally, and that a framework needs to exist to facilitate this. Under this view, the individual must maintain traditional ownership of their data, which implies exclusive control and the right to refuse access by others. Collective ownership, where a group pools its data and shares legal ownership over the entire pool, violates this protocol by adding friction to the process of opt-out. Collective <em>stewardship</em>, on the other hand, defers only the maintenance, governance, and facilitation of data to the collective while retaining individuals’ claims to ownership.</p>

<p>Mendonça provides a good roadmap by insisting on the clear delineation between data cooperatives and data unions. They define a data union as a legal entity established to pool data for stronger negotiations, without necessarily creating a statute of ownership within the pool. In practice, this would mean a member contributes their data to the pool under the promise that it will be fairly and properly managed by the union organizer.</p>

<h2 id="cooperatives-vs-unions">Cooperatives vs Unions</h2>

<p>Data cooperatives, in contrast, maintain a clear statute of ownership via a democratic voting bloc where members collectively determine what to do with their individually held data. This approach is more aligned with the ethos of Data as Property, though the facilitation of usage and enforcement of internal policies is less implicit and must be actively handled.</p>

<p>Mendonça also briefly describes two ideas that are less aligned with Data as Property, but are worth mentioning to more clearly define what we want out of a cooperative.</p>

<p><strong>Data trusts</strong>, such as the Mayo Clinic’s patient data trust, are a way to ensure data is handled with care and proper governance by a trusted party. This method is the easiest on the part of the members; however, it has the least — indeed, no — democratic control, foregoing ownership for security and accountability. In contrast, a data cooperative’s purpose is to empower its participants with agency over their data, in exchange for an increase in personal responsibility.</p>

<p><strong>Data commons</strong>, on the other hand, are somewhat antithetical to the idea of individual rights towards data: a data commons is similar to a data union concerning communal access and control; however, they also impose the idea that data is a shared resource. Data as Property takes the opposite stance — that data belongs to the individual — on the premise that the current attitude towards data has treated it largely as a common resource, and has resulted in various legal, economic, and moral harms against society.</p>

<h2 id="what-is-a-data-cooperative">What is a Data Cooperative?</h2>

<p>Following these definitions, a data cooperative can then be defined as a voting body that facilitates the governance, operations, and compensation matters surrounding the use of member data. The guiding principles of a cooperative, established by the International Cooperatives Association and based on the Rochdale Principles of the 19th century, relate to individual ownership of data via:</p>

<ol>
  <li><strong>Voluntary Membership:</strong> The right to be deleted, a commonly held legislative right similar to the right to privacy, requires that members can opt out at will.</li>
  <li><strong>Democratic Member Control:</strong> To maintain the individual claim over one’s data, democratic control must be maintained.</li>
  <li><strong>Member Economic Participation:</strong> Individual claims over data also require fair and proportional economic incentives.</li>
  <li><strong>Autonomy and Independence:</strong> This is the core argument behind Data as Property.</li>
  <li><strong>Education, Training, and Information:</strong> One of the primary harms resulting from the current legal position around data is that there is an incentive <em>against</em> educating individuals about how their data is used. As such, education is a necessary component of trust and transparency.</li>
  <li><strong>Cooperation Amongst Cooperatives:</strong> While different cooperatives may align on different goals, there is a technological incentive for cooperatives to share information: if cooperatives share a similar or identical governance or information structure, their barriers to adoption, sales, and recognition are drastically reduced.</li>
  <li><strong>Concern for Community:</strong> Because data derives part of its value from its relationship between individuals, it is in everyone’s best interest that their community’s data is also treated fairly and with respect.</li>
</ol>

<h2 id="governance">Governance</h2>

<p>A data cooperative must set, manage, and enforce the handling of member data. This is both internal and external — governance relates to the voting structure, membership criteria, and personal and collective bargaining requirements when it comes time to negotiate. Roughly, a data cooperative needs to determine:</p>

<ul>
  <li>Negotiating structure</li>
  <li>Voting model</li>
  <li>Data policy (what can buyers do with our data?)</li>
  <li>Participation rules (e.g., can a member opt out of one negotiation and maintain membership?)</li>
</ul>

<h2 id="operations">Operations</h2>

<p>A data cooperative must also manage the practical operations of data stewardship, such as:</p>

<ul>
  <li>Retaining lawyers for damages</li>
  <li>Facilitating external functions (transactions, audits, etc.)</li>
  <li>Determining technical operations</li>
  <li>Carrying out internal functions (voting, policy enforcement, etc.)</li>
</ul>

<p>The specifics of each would rely on the governance and economic requirements of the collective.</p>

<h2 id="economic">Economic</h2>

<p>Lastly, a data cooperative needs to manage the economic aspects of its model. This entails both the pricing and negotiations for external use, and also the design of economic incentives for member participation to begin with. It is unclear as of yet if there is a preferable model, or if there even <em>should</em> be a preferable model, but examples include direct participation (1Kb = 1 cent, for example), weighted participation (medical data &gt; retail history), and cumulative weight (compensation based on tenure and overall participation).</p>

<p>The economic structure of a cooperative is the most important aspect, as membership gain is both nominally dependent on payout, and directly contributes to bargaining power. At a minimum, members should feel that participation is always economically preferable to non-participation. To avoid the case of moral injury where destitute citizens feel forced to sell intimate knowledge of themselves, the economic reward of participating should be balanced with the moral reward of self-determination regarding opting in or out of a decision; something like a minimal incentive may work to this end.</p>

<h2 id="technological-framework">Technological Framework</h2>

<p>The last piece — which I exclude from the above three pillars, as it is not so much a guiding principle — is the technological framework used to facilitate the cooperative. The challenge with data cooperatives versus normal workers’ co-ops is the technical knowledge required to manage the data. In a traditional co-op, each worker knows exactly what the utility, operation, and storage requirements are for whatever they are contributing; this cannot easily be said today for user data. Moreover, due to its digital nature, the scale of these features requires specific infrastructure and treatment.</p>

<p>While any centralized platform could be developed, a blockchain infrastructure may be a worthy candidate. Voting records, governance logs, and transaction logs are all decentralized and tamper-evident, facilitating individual agency within a large collective while being transparent and stable. However, this must be treated as infrastructure and not the incentive structure itself, meaning the blockchain should be divorced from having a nominal value — that is, an associated cryptocurrency. Attaching an asset to the participation of the cooperative would introduce volatility and misaligned incentives and open the cooperative up to abusive behavior.</p>

<h2 id="conclusion">Conclusion</h2>

<p>This essay describes what a data cooperative is meant to be: a member-owned, democratically governed structure for stewarding personal data while preserving individual ownership, consent, and opt-out rights. A well-defined data cooperative manages governance, operations, and incentives so members can collectively negotiate the use of their data without surrendering personal control. Technological progress is making a data cooperative more feasible than it was even a few years ago, with better tools for secure online voting and tamper-evident governance; these developments make the cooperative model a practical design for self-determination over personal data.</p>]]></content><author><name>Roman Belaire</name></author><category term="cooperative" /><category term="economics" /><category term="law" /><category term="theory" /><summary type="html"><![CDATA[The writings on this publication aim to normalize personal data as a form of property, with the ultimate goal of returning agency to the individual via legal ownership. A significant barrier to this goal is the challenge of mediating data transactions between millions of individuals and the big data brokers that wish to do business with them, leading to a lopsided data market akin to the labor market. Following this logic, it makes sense to apply the cooperative model that many workers, farmers, and small business owners have used worldwide — but for data.]]></summary></entry><entry><title type="html">Data as Labor</title><link href="https://takebackyourdata.org/essays/data-as-labor/" rel="alternate" type="text/html" title="Data as Labor" /><published>2026-03-24T00:00:00+00:00</published><updated>2026-03-24T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/data-as-labor</id><content type="html" xml:base="https://takebackyourdata.org/essays/data-as-labor/"><![CDATA[<p>The central thesis of this publication is that data should be considered property, though only in the colloquial sense that individuals should have a recognized claim over the data they produce. However, for the purposes of <em>legal</em> definitions and treatments of data, it is useful to compare it to labor and property and their corresponding legal precedents.</p>

<p>This essay will expand on ideas from <a href="https://doi.org/10.1093/jla/laz004"><em>Is Data Labor?</em></a> by Julian Jonker and <a href="https://journals.sagepub.com/doi/10.1177/20539517211020220"><em>Data as Commodity</em></a> by Sam Popowich, both of which provide excellent analyses of the relationship between data and the individual.</p>

<p class="note"><em>Note: this essay refers to "data" under the assumption of personal data, ignoring other forms of data — e.g., weather reports.</em></p>

<p>I will approach the treatment of data from three angles — as labor, capital, and commodity — and compare how intuitive each is as a classification for data. By the end, we will see that data is not quite each of them, but some mix.</p>

<h2 id="commodity">Commodity</h2>

<p>Nearly a decade ago, major news media began publishing the headline “Data is the New Oil.” This was meant to capture the idea that data was the newest and most valuable resource to enter the economy. While the intuitive comparison is easy to digest, there are a few key differences between data and oil.</p>

<p>First, oil is a limited, non-renewable resource, while data is practically limitless. As such, the laws of supply and demand would dictate that the price of data should be zero, yet that is clearly not the case when we observe the modern economy. Second, data is not an exhaustible resource like oil or wood; it is more like air or dirt. Data is sometimes considered non-rival in that one party using it does not prevent another party from doing the same; however, this is not always the case, as elaborated later. Finally, data exhibits network effects that resources do not: 100 barrels of oil have 100 times the value of a single barrel, while 100 points of data are far more valuable than 100 times that of a single datum.</p>

<p>The result of these differences is that it is difficult to view data properly as a resource; rather, it more closely aligns with capital due to its longevity, ubiquity, and network properties. Suffice to say, viewing data as a commodity or resource only makes sense on a surface level, and while it makes for an easy economic argument (that platforms are simply gathering the data as a resource), this view ignores the more intertwined relationship between the individual and his or her data.</p>

<p><strong>On the non-rivalry of data.</strong> Data is often considered a non-rival asset, in that using one dataset to train an AI model does not destroy the value of the dataset to others, and in fact does not destroy the data at all, creating an “infinite” amount of data. While this is generally true, there are some caveats to consider that are becoming more and more prevalent as AI models improve. Broadly, there is a corporate effort to privatize or otherwise restrict access to high-quality data, making it a <em>competitive</em> asset, though in principle this doesn’t affect the value of the data itself.</p>

<p>More recently, however, there are conversations surrounding the supposed infinite supply of data. As new models require fresher, higher-quality data, this creates a “saturation” effect that bounds the marginal value of a data point. The older a dataset is, the less valuable it is to new models, which have already encoded the relevant behaviors within the model; this creates a sort of self-rivalry where each firm can use a dataset a limited number of times, but unlimited competitors can use the same data without exhaustion. All this to say that data has some unique qualities that separate it from the class of economic objects known as “commodities.”</p>

<h2 id="capital">Capital</h2>

<p>If data is not a good, is it instead capital? While the network effects it experiences would usually point to this, it differs greatly from other forms of capital when considering its relationship to the final product.</p>

<p>Consider a textile factory and its machines: the production of cotton fabric occurs within these bits of capital (the looms) without changing them, while the output is necessarily a transformation of the input (the cotton). In an AI datacenter, the electronics, server racks, and processors clearly constitute capital, and the output (the AI model) is definitively a transformation of the data. However, the data itself remains in its original state, thanks to its intangibility. Jonker posits that, because of this, data is more like <em>human capital</em>, where the benefit of the data is similar to the benefit of a worker’s skill. However, data can be separated from the human who produced it, unlike skills and knowledge.</p>

<p>Thus, if we equate data directly to capital and consider it as some kind of digital machine, we risk erasing the fact that data originates from users. This enables an extortive claim that users are not entitled to recognition or compensation for their contributions, since the data is treated as an autonomous asset rather than a trace of human activity. In doing so, we lose the inherent connection between users and the data they produce.</p>

<p>This leaves us with an incomplete picture of data in its relationship to the economy. It is certainly not a commodity, and while it shares features with capital, treating it as such ignores its human origin. Data is more relatable as an input to a business than as the means of production itself, which leaves the remaining factor of production, labor.</p>

<h2 id="labor">Labor</h2>

<p>While classifying data as a form of <em>labor</em> is counterintuitive, it is the remaining economic object that we can compare to in hopes of deriving some legal precedent for ownership. Labor, like data, originates in human activity, carries the imprint of individual contribution, and raises natural questions of ownership and compensation. Beyond these shared traits, data also creates a similar dynamic between the user and platform — that is, worker and owner. Per Jonker, data is most like labor in that:</p>

<ul>
  <li>Nearly everyone prefers compensation over free contributions,</li>
  <li>Buyers of data have a systemic bargaining advantage over individual sellers of data,</li>
  <li>The terms of authority between the platform and user are open-ended and therefore open to abuse. For example, a platform may use your data for political advertising without your knowledge; this is un- or under-specified in the terms and conditions to allow the platform to adapt to new business conditions.</li>
</ul>

<p>Clearly, data occupies the same socioeconomic niche as labor, even though the “physical” manifestation differs. Still, there are some differences. Labor is best understood as a <em>process</em>, while data is a <em>thing</em>. Labor is limited and rival, while data is portable and often non-rival.</p>

<h2 id="conclusion">Conclusion</h2>

<p>We are left at a crossroads: recognizing data as capital ignores the instinctual claim over our own behavior and privacy, while recognizing data as labor captures political aspects but fails to account for the fact that data is an object.</p>

<p>As such, perhaps we can stake out a new treatment of data, wherein it is indeed capital but necessarily an outcome of some human effort. That is, it is a distinct, transferable economic entity with a utility and an exchange value, but its principal first owner is the individual from which it is derived. Under this treatment, we can recognize the moral claim an individual has over their “labor,” without degrading the utility of data by removing its transferability.</p>

<p>To operationalize this, it would be prudent to set up the organizational frameworks described elsewhere in this publication — that is, data cooperatives — both to serve as a legal precedent for data ownership and to empower the sovereign individual.</p>]]></content><author><name>Roman Belaire</name></author><category term="economics" /><category term="theory" /><category term="law" /><summary type="html"><![CDATA[The central thesis of this publication is that data should be considered property, though only in the colloquial sense that individuals should have a recognized claim over the data they produce. However, for the purposes of legal definitions and treatments of data, it is useful to compare it to labor and property and their corresponding legal precedents.]]></summary></entry><entry><title type="html">You’ve Already Opted In</title><link href="https://takebackyourdata.org/essays/youve-already-opted-in/" rel="alternate" type="text/html" title="You’ve Already Opted In" /><published>2026-03-13T00:00:00+00:00</published><updated>2026-03-13T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/youve-already-opted-in</id><content type="html" xml:base="https://takebackyourdata.org/essays/youve-already-opted-in/"><![CDATA[<p>In 2026, it is more accurate to say that people have already been enrolled in AI training than to ask whether they wish to “opt in.” Generative and predictive models are built on large mixtures of web-scraped text, social-media content, product logs, and advertising telemetry that reflect years of ordinary online behaviour. Most individuals never encounter an AI-specific consent screen; their contribution is mediated instead through terms of service, tracking cookies, and product settings that treat model training as “service improvement.”</p>

<p>Regulators have begun to acknowledge that obtaining explicit, revocable consent from each person whose data appears in training corpora is largely impracticable at internet scale, especially for web scraping and cross-platform tracking. As a result, the dominant legal framing in Europe and elsewhere rests on “legitimate interests,” a legal term that removes individual control over personal data as the primary object of regulation. Instead, it <em>assumes</em> that large-scale data processing is unavoidable, and focuses on balancing that activity against individual rights after the fact.</p>

<p>Contemporary AI systems rely on large datasets assembled from online activity so heavily that most people are already inside the training environment, whether or not they ever consciously engage with AI. Legal and technical infrastructures have been built on the premise of ubiquitous data availability: regulators increasingly surrender to the idea that individual consent is impracticable at scale for key practices such as web scraping, while platforms and corporations normalize surveillance. The result is an environment where “big data” is not just a quantitative descriptor but the precondition for competitive participation in AI development, and in which meaningful non-participation of the end user has become effectively impossible.</p>

<p>This essay examines four major pipelines through which personal data flows into contemporary AI systems. Each traces a different route — mass text harvesting, social media, advertising, and downstream models — showing how ordinary digital behaviour becomes embedded in the infrastructure that trains and powers modern AI.</p>

<h2 id="big-data-is-a-structural-condition">Big Data is a Structural Condition</h2>

<p>First, we should set the record straight: data is no longer a byproduct or measurement of our time spent online. It is now a primary source of value; it is the substrate in which our digital (and increasingly, physical) lives are grown.</p>

<p>From the perspective of service providers, the relevant question is no longer whether data can be collected, but how much and from where. Large language and multimodal models are trained on mixtures of scraped web text, digitised books, code repositories, images, and other online content, precisely because such data provides the <strong>only</strong> available corpora at the scale required for current architectures. Reporting and technical analyses disclose that these training sets routinely include <strong>personal information</strong> — names, contact details, biographical profiles — because they mirror what people expose online via day-to-day activities.</p>

<p>This dependence on large-scale datasets translates directly into competitive advantage. Organizations that control major web platforms, social networks, or adtech infrastructure possess high-quality libraries of behavioral and interaction data that can be used to train and direct user attention. Meanwhile, empirical work on model privacy risks shows that trained models may retain statistically detectable traces of individual records, which contradicts common narratives stating that data can easily be anonymized. Together, these dynamics fix data as both an economic and a technical prerequisite for contemporary AI, and they make the idea of a clean boundary between opting in and out of training sets increasingly untenable.</p>

<h2 id="pipeline-1-web-scraping">Pipeline 1: Web Scraping</h2>

<p>Web scraping is the foundational practice through which AI developers obtain large text and image datasets. Automated programs systematically copy publicly reachable pages — including blogs, forums, documentation sites, news articles, and other publisher content — into datasets that can reach billions of pages in length. Analyses of prominent datasets and legal commentary alike confirm that these collections frequently contain personal information and copyrighted material, reflecting the composition of the web itself rather than any fine-grained selection for consent or licensing status.</p>

<p>Regulators have been explicit that obtaining informed, granular consent from each individual whose data appears in scraped content is, in practice, unworkable. The UK Information Commissioner’s Office, for example, has suggested that “legitimate interests” will often be the only realistic lawful basis for web scraping in the context of generative AI, subject to implicit permissions rather than individual opt-ins. Submissions to the ICO’s consultation process acknowledge that data collectors have no direct relationship with most data subjects and cannot feasibly notify or solicit consent across billions of web pages. In this sense, scraping presupposes that people have already “agreed” by virtue of publishing on the open web, even though data-protection laws do not treat public availability as consent for arbitrarily repurposed processing.</p>

<p>Attempts to retrofit transparency into this environment tend to focus on high-level disclosures and partial opt-outs. Some initiatives advocate for labelling of models trained predominantly on licensed or public-domain content, while others explore technical signals (such as <code class="language-plaintext highlighter-rouge">robots.txt</code> directives) to express site-level preferences about scraping. Yet these measures operate under a core assumption: that the public web is, by default, a mineable resource.</p>

<blockquote>
  <p><em>Scraping presupposes that people have already “agreed” by virtue of publishing on the open web, even though legal precedent does not treat public availability as consent.</em></p>
</blockquote>

<h2 id="pipeline-2-social-platforms-and-shadow-profiles">Pipeline 2: Social Platforms and Shadow Profiles</h2>

<p>Social platforms occupy a different position in the data landscape. They capture information about personal ties, interactions, and preferences — follows, likes, comments, shares — that is valuable for both narrow recommendation systems and broader AI models. Policy updates and reporting over the past several years show major platforms stating that public posts and images may be used to train AI tools, including generative models.</p>

<p>A central feature of these systems is that they additionally create and enrich records about people who are not active participants, or who participate only minimally. Research and investigative work on “shadow profiles” documents how platforms take contact lists, tagged photos, and other users’ uploads to construct profiles of non-users or abandoned profiles, sometimes long before or after any intentional use. Platforms typically claim that collected data is used to improve services such as friend recommendations, security checks, and targeted advertising — functions that primarily affect only active users. When the same data is incorporated into AI training or fine-tuning pipelines, however, this boundary begins to collapse. In training datasets, the significance of a record often lies less in the choices of the individual who generated it than in its statistical relationship to other records.</p>

<p>Even individuals who never post publicly, or who avoid creating accounts on major platforms, can be represented in training data through other people’s disclosures and ubiquitous contact-sync features. Opting out, in such a context, would require coordinated non-participation across social networks and a redesign of platform infrastructures that currently treat all captured signals as potential inputs into learning systems.</p>

<blockquote>
  <p><em>Opting out would require coordinated non-participation across social networks and a redesign of platform infrastructures.</em></p>
</blockquote>

<h2 id="pipeline-3-advertising-telemetry">Pipeline 3: Advertising Telemetry</h2>

<p>Contemporary advertising relies on continuous collection of behavioral telemetry: page views, clicks, time on site, approximate location, device characteristics, referrers, and trackers. This infrastructure was built to support targeting and measurement, but the same logs are now routinely used as training data for models that predict clicks or purchases, and that construct targeting profiles for advertisers.</p>

<p>Empirical investigations of industry “opt-out” tools show that even motivated individuals struggle to prevent their data from being collected and propagated across the adtech ecosystem, due to fragmented interfaces, opaque identifiers, and the persistence of historical logs. From a model-development perspective, these constraints are not incidental: the value of behavioral data lies precisely in its continuity and coverage, and the cost of honoring per-person retroactive withdrawal would be substantial for systems already trained on large corpora.</p>

<h2 id="pipeline-4-synthetic-data">Pipeline 4: “Synthetic” Data</h2>

<p>Across web scraping, social platforms, and adtech pipelines, the immediate output is not a model but a set of large datasets assembled from digital traces. These collections are cleaned, normalized, and filtered before being combined into training corpora for machine learning systems. Because their composition largely mirrors what is available online, they inevitably include personal data in many forms. Thanks to shadow profiles, even when individual records are anonymized, they retain value through aggregation.</p>

<p>Once trained on these broad datasets, models are further refined using narrower and often more sensitive data sources. Fine-tuning stages incorporate application-specific dialogues, user interactions, and engagement signals like comment threads, clicks, or viewing behavior. The resulting systems are deployed on social media feeds, recommendation engines, and advertising exchanges to optimize for your attention. Deployment itself generates additional data, which is subsequently fed back into the training pipeline. In effect, data collection and model improvement become mutually reinforcing, creating a feedback loop that advantages those who already control large-scale user data streams.</p>

<p>This pipeline raises perhaps the most consequential privacy implications. Research in machine learning privacy has demonstrated that trained models can sometimes reveal information about their training data through techniques such as membership-inference attacks. Even if raw records are later deleted or anonymized, their statistical influence can persist in the model parameters derived from them. In this sense, personal data does not merely pass through the system — it becomes embedded within models that may be reused, fine-tuned, and deployed far downstream from the original point of collection.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Taken together, these pipelines illustrate how personal data moves from ordinary digital activity into the infrastructure of AI. Web scraping absorbs the public web into large training libraries; social platforms transform interactions into personal information; advertising telemetry tracks behavior across much of the online economy; and downstream training converts these datasets into models that are widely reused and redeployed. At no point in this process is participation meaningfully negotiated with the individuals whose data is involved. Instead, participation emerges as a structural consequence of how digital systems are organized.</p>

<p>This arrangement reflects a broader shift in how personal information functions within the digital economy. Data is no longer merely collected to provide discrete services or features. It has become a raw material for model development and a tangible input into systems whose effects extend far beyond the platforms where the data was first generated. As AI systems are trained, fine-tuned, and deployed across industries, the traces of everyday online activity become embedded within technical infrastructures that persist and evolve independently of their original sources.</p>

<p>Seen in this light, debates about whether individuals should “opt in” to AI training describe only a small part of the phenomenon. The pipelines described here suggest that participation in AI development has already been distributed across the population through the routine operation of digital platforms. Rather than a question of individual choice, the relationship between people and AI systems is increasingly defined by the structural conditions of a data-saturated environment.</p>

<div class="references">

## References

1. Bender, Emily M., Zhao, Ben, et al. "Your Personal Information Is Probably Being Used to Train Generative AI Models." *Scientific American*, 19 October 2023.
2. Information Commissioner's Office (ICO). "The lawful basis for web scraping to train generative AI models." 31 August 2025.
3. Hamlins. "Decoding the ICO's Generative AI guidelines: what you need to know." 5 March 2025.
4. Center for Data Innovation. "Written Evidence Submission on the Lawful Basis for Web Scraping to Train Generative AI Models." 2024.
5. Al Jazeera. "Are tech companies using your private data to train AI models?" 24 November 2025.
6. Microsoft Research. "Collecting telemetry data privately." 7 December 2017.
7. TrustArc. "Tracking Technologies: The Hidden Backbone of AdTech and the Privacy Minefield It Creates." 21 September 2025.
8. The Markup. "I Tried to Use the Ad Tech Industry's Tool to Opt Out of Personalized Ads. Did It Work?" 24 March 2021.
9. Vox. "The tricky truth about how generative AI uses your data." 26 July 2023.
10. OpenMined. "ML Privacy Meter: Aiding Regulatory Compliance by Quantifying the Privacy Risks of Machine Learning." 2024.
11. Dev.to. "The Ghost in the Machine: How Social Media AI Builds Shadow Profiles on People Who Never Signed Up." 2026.
12. ICO. "What are the conditions for processing?" 2023.

</div>]]></content><author><name>Roman Belaire</name></author><category term="AI" /><category term="technology" /><category term="privacy" /><category term="law" /><summary type="html"><![CDATA[In 2026, it is more accurate to say that people have already been enrolled in AI training than to ask whether they wish to “opt in.” Generative and predictive models are built on large mixtures of web-scraped text, social-media content, product logs, and advertising telemetry that reflect years of ordinary online behaviour. Most individuals never encounter an AI-specific consent screen; their contribution is mediated instead through terms of service, tracking cookies, and product settings that treat model training as “service improvement.”]]></summary></entry><entry><title type="html">Your Data is Not Yours</title><link href="https://takebackyourdata.org/essays/your-data-is-not-yours/" rel="alternate" type="text/html" title="Your Data is Not Yours" /><published>2026-03-03T00:00:00+00:00</published><updated>2026-03-03T00:00:00+00:00</updated><id>https://takebackyourdata.org/essays/your-data-is-not-yours</id><content type="html" xml:base="https://takebackyourdata.org/essays/your-data-is-not-yours/"><![CDATA[<p>Imagine the morning routine of a 29-year-old elementary school teacher in Columbus, Ohio — we’ll call her Tracy — who thinks of herself as “not online.” She doesn’t engage in arguments on Reddit, and her news comes from headlines and friends instead of always-on political podcasts; she rarely, if ever, consciously shares things about herself online. When Tracy wakes up, she briefly checks her phone for the weather and the day’s news, gets dressed, and grabs a coffee before work. After paying for the coffee with her credit card, she enjoys it in the parking lot while scrolling through Instagram, then heads inside to greet her dear students. At lunch, she browses online for a new dress, adds a few to her cart for later consideration, and goes back to the classroom. After work, she watches a recent episode of the latest island-based dating show on her current streaming service, meditates, and goes to sleep.</p>

<p>At no point does Tracy willingly offer any part of herself to another, save for her kindness and knowledge bestowed on her students and colleagues that day. Yet, behind the screen, her “digital twin,” as it is known in the data industry, is busy in the shadowy belly of the internet, selling slices of herself on the data marketplace. Every app, website, and purchase she makes innocuously documents some small piece of herself: search terms, click paths, the device she’s using, the clothes she wants — even the things she doesn’t want, like the articles she scrolls past — aggregating and selling these dossiers to data brokers for pennies. No part of Tracy is off-limits: the media she consumes, the location she buys things in, the profiles of others who walk past her, the type of advertisement she takes just a little longer to skip; it is all recorded, saved, packaged, shipped, distributed, and sold without her knowledge or informed consent.</p>

<p>This profile is then sold and resold to advertisers, insurers, background-check firms, and credit agencies, among others, often fed into computer algorithms to predict, classify, and group her behavior into categories used to target Tracy with what big conglomerates determine people “like her” will buy or believe. Characteristics are inferred from collected data: even if Tracy doesn’t use online health services or a menstrual tracking app (which, by the way, <a href="https://www.ftc.gov/business-guidance/blog/2023/02/location-health-and-other-sensitive-information-ftc-committed-fully-enforcing-law-against-illegal">have been caught sharing data</a>), data analysis firms are routinely able to predict correlated behaviors by, for example, counting searches for “luteal phase length” to send her a perfectly-timed ad for family planning services, or promote a particularly infuriating headline to her suggested searches.</p>

<p>Tracy never sees these profiles, nor sees money from them, and cannot track (let alone negotiate with) the dozens of intermediaries recording and trading little bits of her life. Yet her digital twin, a poltergeist brought to life by this data ecosystem, is constantly and continuously used to target ads, determine prices and offerings, and train models that will, in turn, shape what she sees tomorrow.</p>

<hr />

<p>When the commercial internet took off in the 1990s, personal data was treated as a kind of harmless byproduct that the marketing industry called “clickstream exhaust.” Companies logged page views and search terms for debugging or crude traffic stats, and regulators mostly focused on narrow sectors like health and finance, not on pervasive tracking of everyday life. This changed in the 2000s when, largely due to economies of scale, advertising became the primary business model for the internet. As advertisers demanded higher click-through rates, the previously unused information exhaust was directed towards a growing adtech ecosystem. Data brokers emerged to barrel up this newly tapped reservoir of information and resell it, combining website visitation behavior with transaction data, loyalty programs, loan inquiries, and social media to build “digital twins,” or persistent user profiles for millions of people.</p>

<p>When the 2010s rolled around, this ecosystem had matured into a full surveillance advertising industry, with real-time bidding systems that expose detailed information about you to thousands of firms every time a webpage loads an ad slot. Laws like the GDPR in Europe and CCPA in California put guardrails around this trade by introducing consent banners, access rights, and opt-outs, but leave the underlying business model intact: the commodification of behavior.</p>

<p>The most recent evolution in this consumer panopticon is generative AI. The same user data, now including TikTok videos, voice snippets, and forum posts, is scraped and fed into large models that learn from them at scale, turning humans’ rich individual lives into training data for systems meant to generate new “content” specifically engineered to capture attention, clicks, and dollars, further perpetuating this cycle. Yet in most jurisdictions, there is still no clear recognition that the people whose data underwrites these systems have any ownership claim over it at all; at best, they have procedural privacy rights that are difficult to exercise and easy to route around.</p>

<hr />

<p>Over the past decade, the dominant answer to data abuse has been more privacy laws and more regulation. Legislation such as the GDPR and CCPA provide important individual rights: to see what information companies have on them, to correct it, and above all, the right “to be deleted.” This is progress, but notice what these laws do not do. They do not allow you to say that your data is <em>yours</em> the way you can about your house, your bank account, or your labor. They do not provide you with the ability to make a clear claim like “you used my data without my permission, you owe me.” Instead, they treat data as something companies are generally allowed to collect and monetize, subject to a set of compliance duties and opt-out mechanisms. In practice, that means the default is still extraction; your rights arrive late, are hard to exercise, and are easy to design around.</p>

<p>A property framing starts from a different place. Much personal data, especially the data you directly hold on your devices and in your accounts, already looks like an asset in the legal sense: it is definable, excludable, economically valuable, and transferable. Recognizing it as such would make explicit what is currently obscured: when a company copies, trades, or uses that data to train models without meaningful consent, it is not just violating an abstract “privacy interest,” it is appropriating something of value that belongs to you.</p>

<p>I am not arguing that privacy law is useless, or that we should replace it with a pure market in data. The point is that privacy law regulates the <em>manner</em> in which firms exploit your data; a property definition questions who should hold the primary entitlement in the first place. Once we say that individuals hold a property claim over their data, it becomes natural to talk about negotiation, licensing, and compensation.</p>

<hr />

<p>Of course, there are many challenges and critiques of data as property. Critics of data ownership warn that turning personal information into property could backfire. They worry it will create a new market where the rich can afford privacy and the poor are pushed to sell ever more intimate details of their lives, or that the transaction costs of negotiating millions of tiny licenses will entrench the giants who already dominate today’s markets. Others point out that many data points are relational (such as family and community links) and don’t fit neatly into a story of individual ownership. Even when recognizing that data should be property, these critiques remain valid if every individual has the <em>obligation</em> to steward their own data as a shepherd would his herd; this is not the model I am advocating. On the contrary, I take these concerns as reasons to enforce individual <em>rights</em> and protect them the same way one would protect their time and labor.</p>

<p>Specifically, I am advocating for a model that allows individuals to opt in to a “labor market” for their data while protecting those who choose not to participate. Instead of leaving each person with the impossible, expensive task of monitoring hundreds of companies and enforcing their rights one by one, we should match individual autonomy with collective governance: the individual holds the primary claim over personal and behavioral data, which they can pool into a member-owned union and negotiate on fair terms, set red lines, and refuse toxic deals altogether. Property, in this view, is not the end state; it is the legal foundation that empowers the individual and makes meaningful collective bargaining over data possible in the first place.</p>

<h2 id="three-arguments">Three Arguments</h2>

<p>To these ends, I propose three main arguments: moral autonomy, economic fairness, and balance of power.</p>

<p>There is, first and foremost, the moral argument for autonomy and identity. As individuals in a modern and empowered society, we each have the inalienable right to self-determination. Our lives are ever more intertwined with “data” such that it no longer records what we do but determines how we do it; it tangibly affects our lives from the news we see, to the purchases we make, to the food we eat. Framing data as merely “information about you” understates the bidirectional relationship we have with it, while treating it as something we own provides clarity to an intuition most people already have. It is wrong to take detailed records of your life, manipulate you with them, and turn a profit without your consent or say in the terms.</p>

<p>Second is the argument of economic fairness. Personal data is now the primary input into a vast ecosystem of targeted advertising, risk scoring, and content generation. This makes it an <em>asset</em> that generates real revenue for firms that harvest and process it. Yet you — the person creating this capital simply by living your life — see none of that value. Property law is how we recognize and organize claims over productive assets; it gives us the vocabulary to return credit to producers. Without a property claim, individuals are not producers; we are raw material.</p>

<p>Last, we must consider the structure of power that surrounds our data. Framing this conversation around privacy focuses only on compliance within existing business models, predicated on the idea that the platforms are producing the data, not extracting it from users. Treating data as property provides a different starting point, giving primary entitlement and leverage to the person instead of the platform intermediating their activity. Once it is established that any further use requires strict permission from the individual, it becomes natural to talk about contracts, licenses, conditions, revocation, and, crucially, about pooling entitlements to negotiate a better deal.</p>

<hr />

<p>The case for individual ownership of data is clear, though there are limitations that must be analyzed. Like freelance labor, individual management of data is a task that is simply too exhausting for many. Transaction costs, negotiations, invoice follow-ups, and abusive markets would render such a system inoperable at scale. Instead, like-minded individuals can pool resources to build negotiating leverage, set terms, delegate accounting, and reduce costs. Modern technology has never been more adept than now at “democratization,” and the purpose of this project is to provide a direction and platform for efficient data sovereignty.</p>

<p>Like it or not, the world we live in is now one where our lives are defined by data. The question is not whether data will be collected and used, but who it will serve. Sticking our heads in the sand and pretending otherwise only cements a status quo in which others capture value from information drawn out of our lives while we absorb the risks and none of the rewards.</p>

<p>The alternative is to treat our data as something we own, and to build organizations that explain what is happening to our data, how we can use it, and how we can finally benefit from a resource we already produce. These organizations can negotiate democratically, turning individuals from natural resources into constituents whose vote and veto matter. Firms that want access to rich, high-quality data will have to approach cooperatives as counterparties, not as resources to be mined. Licenses will come with democratic conditions attached: no resale, fair use, audit rights, and compensation. Lawmakers, in turn, will be able to point to concrete institutions that manage data as personal property and use them as models for stronger rights in statute.</p>

<p>This is the future I am arguing for. Data exists, we produce it, and it won’t go away. It affects our lives and produces tangible value, generating financial prosperity and freedom for its owners. It belongs to you — let’s start acting like it.</p>]]></content><author><name>Roman Belaire</name></author><category term="economics" /><category term="law" /><category term="privacy" /><category term="theory" /><summary type="html"><![CDATA[Imagine the morning routine of a 29-year-old elementary school teacher in Columbus, Ohio — we’ll call her Tracy — who thinks of herself as “not online.” She doesn’t engage in arguments on Reddit, and her news comes from headlines and friends instead of always-on political podcasts; she rarely, if ever, consciously shares things about herself online. When Tracy wakes up, she briefly checks her phone for the weather and the day’s news, gets dressed, and grabs a coffee before work. After paying for the coffee with her credit card, she enjoys it in the parking lot while scrolling through Instagram, then heads inside to greet her dear students. At lunch, she browses online for a new dress, adds a few to her cart for later consideration, and goes back to the classroom. After work, she watches a recent episode of the latest island-based dating show on her current streaming service, meditates, and goes to sleep.]]></summary></entry></feed>