Corporate Data Trade in the Age of Personalised Language Models

One of the most common questions directed towards Silicon Valley enterprises in the last two decades is the lack of transparency on the details of the data stored about the users. Platforms like Instagram and Facebook are responsible for countless different types of user details stored in their very own data fortresses. If you are not a tech enthusiast or an internet jockey, you might not be aware of the potential risks that these pose. To put it briefly, user data is very valuable for companies that make most of their revenue from analysing users and adapting to them. These include advertising services, social networking platforms, and many more. Now, there is a new source of personal material for these companies to grab: LLMs.

Around 2021, right after the pandemic, the world was about to see an interesting new technological revolution. After years of research and development, a few companies were able to produce a line of production-ready transformer-based neural networks.

For the internet, it was like magic. Suddenly, we all had our own Jarvis that we could do our homework with. For the next one to two years, people were hesitant to share their actual information with it, as it was basically a stranger, but then they started to warm up to it. It was first like a Google that you could speak to, but then it wasn’t. Currently, the internet is on a course of isolation. Every day, people prefer to get their answers directly from an LLM rather than searching for it on websites written by humans or even Google. With the increasing amount of dependence on an excessively personal network, a new threat to the safety of user information is underway.

Traditionally, on social media platforms, only a small portion of data is directly obtained. Most of the data is usually derived from the actions of the user, such as dwell time, or other actions that indicate the level of interest of the user. These systems were brought to a level where it was possible to predict what the user would want to see on their screen in a specific circumstance. You might be wondering what is wrong with collecting data for personalisation, and the answer would be nothing; however, that data is actually a great source of revenue for the Silicon Valley tech leviathans. A recent event proving this is the Cambridge Analytica scandal, but that is just the tip of the iceberg. With the recent changes in the search ecosystem, our way of getting information has changed significantly. We can now chat with the results, and by chatting with them, we tend to reveal a lot about ourselves more directly than ever before.

People ask LLMs what they should buy, where they should go, and even share their names and surnames so they can spend less effort later. Besides the practicality of this, the data provided is really valuable. This personal information is stored in a memory layer so the LLM can personalise its future responses accordingly. However, the more the LLM knows the user, the more information the user provides. This information could reveal a deeper source of knowledge on the user. Unlike action-based tracking, LLMs can extract what you like, what you are uncertain about, and even what you fear. An example of this are college applicants. High schoolers go through a period of applications and it is the most stressful time of their educational journey. They start to get their application results and the pressure on them is huge. As this pressure gets stronger, they seek someone to talk to, and whenever they cannot find someone, they turn to ChatGPT for help. In times of desperation, people often cannot really regulate their decisions, and this is what happens to these high schoolers who are trying to cool themselves off. As they do that, they provide a concerning amount of information to third parties. This information could then be sold to advertising agencies so they could target the user and hit them in the right spots. In such cases, the user is more likely to get baited. When the user wakes up to this, it is too late. The profile is already built, and stored in some node in those data fortresses.

Microbloat: Why Microsoft software is bloated

The Architecture of Memory