Open Web, Closed Data

3 min readDec 25, 2024

Image generated by Author using Google Gemini Advanced 1.5 Pro

A recent Wired exposé revealed it had used openly available cell phone logs to tie cell phone network pings on Jeffrey Epstein’s island to the owners of the those devices, or least the people who are likely to use those devices at one location during the day and another location at night. OSINT YouTuber Ryan Macbeth had demonstrated the tools he had used to determine the likelihood of whether one social medial influencer was actually overseas in a project he had claimed to be a part of.

There are two competing forces that are in a land grab when it comes to open source data. The first is identity stitching. This tries to piece together who the person interacting is based on bread crumbs they’re leaving behind. It could be IP addresses tied to an account, time and location of access to a website, or proximity to others who have looked at similar items. There are tens of thousands of people whose work is dedicated to finding out who you are. The famous case study of Target sending in the mail an advertisement for a pregnancy test to a teen girl, arousing the ire her father, only for Target to have been correct in predicting the pregnancy.

The second countering force is data privacy. This force’s power ebbs and wanes depending on the political climate. Privacy laws prevent the collection of data that can be used for identity stitching, selling our data to others, and using the data for nefarious purposes. Sometimes, this force goes too far, making it nearly impossible to obtain data, even if anonymous, to train new AI systems. It also provides an unfair advantage to incumbents who had access to data to train large model that newer startups can’t use.

However, when the counter force is too weak, bad things happen. First, authoritarian regimes can use the data for tracking and squashing dissidence. Second, Wild West unchecked private industry can end up making our lives miserable through biased behavior. Could or insurance premiums go up based on the websites we’ve visited? Could adaptive pricing mean we’re spending more than others on the same items because we’ve been marked as gullible by an algorithm? At one extreme, some companies trying make money from ad revenue dox private individuals, positioning themselves as contact information brokers.

From a consumer perspective, there are simple protections that can be used to limit exposure. We don’t have to eat all the cookies presented to us, or go on the Internet from our home IP, or sign up for a the next fly-by-night AI video generation site because someone tweeted about it (maybe that example is too personally specific). We can also actively search out our own information and request deletion. We can

From a company perspective, requiring user data to fulfill a service customers are paying for is a reasonable ask. 23andMe can use our DNA to pull out reports and provide insights back to us. Can they retain it to share with further insights it might glean in the future? Sure… if it does so in a way that is very secure and anonymous. Selling customized drugs based on our DNA? Only if we agree first.

If we’re not careful and ensure oversight, we’re going to discover bad things after the extremist companies have gone to far with our data. If we’re too heavy handed with data privacy regulation, it will mean little innovation in AI, poor quality and biased predictions, and less democratization of technology leading to incumbents price gouging their AI services.

Open Web, Closed Data

Written by Leor Grebler

No responses yet