Break stuff. Now.

Files Are Just Numbers


September 1, 2019

When we think of files, we often think of them in terms of their presentation. For .jpg, .png, .gif, we see a picture. For .mp4, .mov, .avi, .mkv, we see a video. For .docx, .xlsx, .pptx, we see a document. For .html, .js, .css, we see code in the form of text. For .7z, .rar, .zip, we see an archive file.

But under the hood, beyond the encoding and endianness sorcery, files are just numbers.

Files are just streams of structured binary data. Various pieces of information about the file are stored as bits in computer memory. For example, of the first 8 bytes in a .png file, bytes 1-3 is the string "PNG" to identify itself as a PNG file, and the rest are control bytes.

While binary is used to represent data in computers, it's also used to represent numbers. Instead of the Base-10 that's used for decimals, binary uses Base-2 and we count in terms of 1s and 0s. The number 61 in binary, for example, is 01100001.

So if files are binary, and binary are numbers, therefore I conclude files are numbers!

It's easier to visualize this by using an example. Let's say you have a text file containing "a" and a trailing newline (LF). If you open that file in a hex editor (I use hexdump on VS Code), you'll get the following:

  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 61 0A                                              a.

That's 00001010 01100001 in binary (read bytes right to left), which is 2657 in decimal.

Now let's write "abcd" in that text file, we get:

  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 61 62 63 64 0A                                     abcd.

That's 00001010 01100100 01100011 01100010 01100001 in binary which is 44633907809 in decimal.

These examples may be small and contrived, but the idea is the same for any file. Whether it's that .html file you edit at work, or that .exe you download for games, or that .mp4 you open for videos, files are just numbers. They can also be colors if you have groups of 3 bytes (rgb) or 4 bytes (rgba).

This opens up some really interesting cases. I recently watched a video from Numberphile where they discuss this from a copyright perspective. In their example, if a movie is under copyright protection, files that depict this movie are also under copyright protection. But if files are just numbers, does that mean the the number equivalents of these files are also under copyright protection? Is the distribution of the number considered illegal?

Hmmm... 🤔