By Daniel · December 24, 2008
Data packaging is a procedure that is heavily used in computing, but rarely are we aware that it is occurring. In broad terms, data packaging is the process of combining multiple data pieces into one large chunk, or "package", of data. In this sense, a data package is similar to an array. But unlike an array, a data package can be passed through scripts, files, sockets, and other mediums as if it were a single value. Data packages can be especially useful in Game Maker, where the programmer has only minimal control over the scope of his variables and arrays. To simulate a script returning multiple values, for example, a programmer could pack them into one data package and return the result in the form of a string.
Delimiters
You've probably used something like data packaging in the past. Maybe you wrote a login script that sends a user/password combination delimited by a pipe character ("username|password"). If you're encrypting your data using ROT13 then all is well, but what happens when you bust out your AES implementation and it spits out half a dozen pipe characters? You could find another encryption algorithm that only returns alphanumerics, but if you want your program to be clean and extensible, you shouldn't be relying on special delimiters anyway. They put constraints on what data can and cannot be sent, and this is generally bad programming form. Delimitation techniques are perfectly fit for some simple uses, but they have their limits.
Size Declarations
Back in the very early ages of computer programming, some guru had the idea of using size declarations to store variable-length data pieces. If we put a piece of data at the beginning telling us how many characters are contained in the username, then our program can figure out where the username data stops and the password data begins. In fact, with this technique we are not limited to two values (such as username and password). We can pack and unpack an arbitrary number of data pieces if we simply prefix each one with its own size declaration.
For most purposes, this technique works just fine. But it's not scalable. If we use 8-bit size definitions, then we get 2^8 = 256 possible sizes. Most players will select names that are less than 256 characters, but you can never be sure. If we want more flexibility, we can use 16-bit size definitions to give us 2^16 = 65,536 size options. This should be plenty for usernames, but what if we need to send a 70k image? Bumping the length declaration up to 24 or 32 bits would put you in the safe zone, but if you're packing data to be sent over a network or you just have lots of little data pieces, you can't afford to be wasteful. You could keep track of how long the length declaration is for each piece of data, but what a mess that would be! We'd surely be breaching at least four of the laws of programming.
Meta Size Declarations
In this tutorial, we address this problem using variable sized length definitions. In other words, we will introduce a meta length declaration to tell us how long the main length declaration is. But to avoid circular regression, we will structure the meta length declaration differently. First of all, we will convert everything to binary. (If you are not familiar with binary representation, I suggest reading an introduction like this one.) Then we will check the size of the binary sequence corresponding to the main length declaration -- in other words, how many of the "0"s and "1"s correspond to the main length declaration. Then we will write out that many "0"s, and put them at the very front of the data segment, before the main length declaration. Those "0"s are our meta length declaration -- they tell the program how many bits the main length declaration will consist of.
What if the main length declaration itself begins with one or more zeros? We don't want our program to interpret those zeros as part of the meta length declaration! To solve this, we will put a "1" between the meta and main length declarations to act as a separator. Even if our main length data begins with a "1", we should still add an extra "1", because the unpacking function is going to discard it as a meaningless delimiter.
Let's look at this from the other side -- the unpacking procedure. We have some unknown number of "0"s, then a "1". Our program counts the zeros preceding the one and stores it as a meta-size variable, say "sizeofsize". We throw away the "1" since it's just a delimiter. Now we take the next size_of_size bits in the binary sequence, and convert the result into integer form. This is our main size -- call it data_size. Finally, we read the next data_size bits of data. This is the actual content of our data piece. Everything else left in the package is part of the next data piece and should be handled separately.
So far we've been treating the data package as a binary sequence. If our programming language supported binary sequences as a special data type we might just leave it in that form, but in Game Maker we have to store them as strings of "0"s and "1"s if we want the bit values to be easily accessible. Strings in Game Maker are meant to store 256 possible values per character, and we're only using two of them ("0" or "1"), so we'll want to convert our binary sequence back into a "real" string at the end to make the final package smaller.
Final Words
Let's take a step back and analyze what we did. Normally when we're dealing with size declarations, we read a part of a binary sequence and convert it to integer form. The trouble is that we normally don't know how much of the binary sequence corresponds to the size declaration, unless we assign it a fixed length (which is suboptimal). So we came up with an alternative system for representing integers in binary form, by simply writing out a number of "0"s equal to the value of the integer. That way, when the "0"s terminate we know we've hit the end of the size declaration. Now, we could have just expressed the main size declaration this way, but then we would be doubling the length of our data piece. So we combined the two systems -- we used a standard binary representation for our main size quantity, but used the alternative representation to describe the (meta) size of the (main) size.
The beauty of this technique is that while it produces data packages of roughly the same size as those produced with similar methods, its scalability is impeccable. If you need to pack small pieces of data just a few bytes each, this technique will do so while adding very little overhead. If you need to pack all 32 volumes of the Encyclopædia Britannica, this technique can do that too (given adequate computational resources) -- and still the overhead is minimal.
If you'd like to download a scripted GML implementation of this technique, you can do so here. (It comes with a few encryption scripts as well.) Those scripts implement some simple details that weren't covered in this tutorial, like storing data types. You can also download a demonstration program here.
Categories: Data processing
There are no comments to display.
You must be signed in to post comments.