Synopsis: Grass's network uses your excess bandwidth to scrape AI data from the internet. By exploring how AI labs train their language models, we can learn a bit about what types of material the network accesses, and why your personal data is not part of the equation.
By now, you probably know that Grass uses your excess internet bandwidth to scrape AI data from the public web. As we’ve explained a bit in the past, it takes massive amounts of web data to train a language model. But what kind of data is it, exactly? And why is it needed?
To understand the answers to these questions, we need to learn a bit about how large language models work. So strap in for a minute while we take a quick look at what’s going on behind the scenes. Today, we’ll try to explain LLMs like you’re five — or at least like you’re a sophisticated adult who wants to understand AI a little better. So where to begin?
LLMs and the Word Vectors They Produce
Let’s start simple: LLMs are AI algorithms to which you can pose questions in plain language and get an actual answer. You may ask for a summary of a given topic, a translation of a particular passage, or a detailed solution to a complex problem. In response, they will generate predictive text to satisfy whatever prompt you’ve decided to input. To the untrained eye, it’s a robot that can speak English.
But how do they work? Ultimately, LLMs comb through massive amounts of written language, find patterns in the ways that certain words relate to one another, then translate these words into strings of numbers that reflect these relationships. These numbers are the language that LLMs actually speak, and they are known as “word vectors.” Let’s give an example to see how they work.
Say you’re in the mood to eat something with meatballs, but you can’t remember the name of that pasta that goes with them. If you ask an LLM what to call this mysterious noodle, it will search for a noun that is A) a pasta, and B) likely to appear in the same sentence as “meatball.” Voila: “Spaghetti.”
In a very simple model, trained only to answer meatball-related questions for forgetful diners, each word vector might have only two dimensions.
1: Does this word describe a noodle? (1 for yes and 0 for no.)
2: How strong is the correlation between this word and “meatballs” in written text?
In this case, spaghetti might be represented as [1, 0.95], with the 1 signifying that spaghetti is a noodle and the 0.95 signifying a 95% correlation with the word “meatball.” This is a higher score than any other word the model has encountered, and thus most likely to be the correct answer. There you have it: Spaghetti and Meatballs.
So now we understand how LLMs communicate a word’s relationship with other words — but what happens when the questions become more complicated? Instead of asking what to call “spaghetti,” what if you asked what a seven year old would call spaghetti?
To find out, you’d have to read quotes from millions of seven year olds and determine which word has the highest correlation with “meatball” in these very specific contexts. As it turns out, seven year olds — hardly known for their facility with the Italian language — are liable to mispronounce the word as “sketti” or “basketti.” At least, that’s what ChatGPT reported back a few moments ago.
Now, this raises a few questions. When answering our prompt simply required a two dimensional assessment of general correlation, it was easy to comb through limited data and see which word appeared in the most sentences with “meatballs.” As soon as we started asking more complex questions, though, the word vectors needed to be exponentially longer, and thus draw on larger banks of information. Perhaps you can see where this is going. If you want to train an LLM to answer any question a user could possibly ask, you’re going to have to access much larger datasets.
While the scientists in our example above may be content to study meatballs alone, major AI labs are working to create incredibly refined LLMs that will someday have access to all recorded human knowledge. This requires them to spit out word vectors with far more than two dimensions, which can capture more subtle relationships between the words they read. To illustrate, let’s use this model, which was trained on the entire English Wikipedia.
Consider the word “Donkey.” In English, it’s spelled D-O-N-K-E-Y. Vectorized, it’s spelled -0.092339 followed by another 5,507 digits. — a mouthful to say, and impossible to remember.
The word vectors in this model are so long because the model is trained on 199,430 unique words, and it’s capable of producing vectors for each of them that communicate its relationship with all the others. By training their model in this way on the entirety of Wikipedia, it’s able to answer any questions that might be contained in the articles within. The 5,000 character vector lengths bely the sheer amount of information that each one relates back to. So it’s not hard to figure out that if we want these LLMs to give accurate answers, the correlations that they draw between words — and the patterns they discover in written content — get more and more accurate as the data sets they’re trained on get larger and larger.
But how could an AI lab possibly access this much data?
The Grass Connect
This is where it all ties back to you, and the role you play by running a node on Grass. If you look at the list of models on the website we linked earlier, you can see that a variety of them are available. One was trained by reading all of the words on Wikipedia, one by combing through mountains of Google News articles, and one on the British National Corpus. Whatever data a lab wants its model to be trained on, this is the content they need to access in order to train it.
Here’s the thing: this is relatively simple when the data is crystalized and the answers won’t change. If someone asks an LLM when Columbus discovered America, the answer will always be 1492. They could train it on the Encyclopedia Britannica.
But what if an LLM wants to answer questions about contemporary information? What if it wants to answer questions about popular sentiment, or how the average person feels about a certain topic? Where could you find billions of people expressing their thoughts and opinions on any topic imaginable, refreshed eternally in a never ending stream? Modern problems, as they say, require modern solutions. In this case, the solution is social media.
To access this information, however, requires a nonstop connection to the internet, viewed from every corner of the Earth, capable of downloading unfathomable volumes of written language. This, my friend, is where Grass steps in. It uses its network of nodes to view all manners of information like this, and scrape training data from the public web. It's a beautiful thing.
So now you understand who these labs are, the LLMs they are trying to train, the types of data they use to train them, and how they can access it with the help of our network. This is only the most rudimentary explanation of how LLMs are trained, and we’ve obviously left a lot out in the interest of simplicity. But hopefully it goes some of the way towards explaining what's going on behind the scenes here, and how AI labs use the public data submitted on social media websites to train their AI models.
You’ll notice that nowhere in this conversation is your personal data mentioned even once, and that’s because it doesn’t factor into the equation. When we tell people the network is used to access web data, that’s often their first assumption — that they’re giving up their own data, just like they do by using social media in the first place. We just wanted to write this primer so you would know that this doesn’t happen in any way, not even 1 percent — the network is used to access public web data, often from sites like Reddit, and nothing about you is visible whatsoever. So you can rest assured that your privacy is intact — and maybe you learned something along the way.