Thinking in Bits

The number 65,536 is an awkward figure to everyone except a hacker, who recognizes it more readily than his own mother’s date of birth.

Snow Crash, Neil Stephenson

A question which I like to use when interviewing C++ programmers: what is the range of a 32-bit integer?

I don’t use this question very often anymore, or at least I don’t let it influence my decision very much (which is why I don’t mind spilling the beans here), because the correlation with other technical skills turns out to be not as strong as I thought it would be. Still, there are some interesting patterns in the kind of people who know the answer versus those who do not.

Among the people who do not know the answer, some of them react quite affronted that they would even be expected to. What is the point, they will ask, in having memorized some little piece of trivia which they could Google up in a few seconds? Isn’t “knowing where to find it” a much more useful skill? Ask me about architecture! Ask me about design patterns and data structures! All of these objections have some validity, and indeed we will certainly ask about those other things during the interview. Still, I believe that it is perfectly reasonable to expect a good developer to know the answer to the above question by heart.

There are two reasons why I believe that. The first is simple: if you don’t know the range of the data types you’re working with, how can you write reliable code? “I will look it up when I need it” is a common excuse, but that doesn’t fly with me. It would be a perfectly valid answer if I had asked you to recite from memory the arguments to some obscure library function, but we’re talking about the integers here! The most commonly used datatype in any programming language. Are you seriously telling me that, every single time you write

for (int i = 0; i < n; ++i)

you look up in the compiler documentation whether the range of i is large enough that you can safely assume n to fall within it? And then you promptly forget about it so that you need to look it up again the next time? So that, when you arrive in my interviewing room with a CV that says you have five years of C++ experience, you have looked up that particular bit of trivia many thousands of times by now, and you still don’t remember it? What’s more likely is that you have never looked it up, which means that you are basically, as Joel Spolsky would call it, programming by superstition.

The other reason why you need to know this stuff is that you will be a more effective programmer if you are able to think of what you are doing in terms of the underlying processor instructions — a skill which is useful even when working in very high-level languages, but which is absolutely necessary when working in C or C++.

For example, even a non-programmer may be aware that the reason why the world is rapidly moving to 64-bit processors and operating systems right now, is because the 32-bit ones are unable to use more than 4GB of memory. But as a professional programmer, I would expect you to know why that is. The answer is simple: it’s because 32-bit machines use 32-bit memory addresses, and with those you can only access 232 = 4,294,967,296 different bytes. If you don’t know that, you are lacking a fairly basic piece of understanding of how computers work.

Now, of course I don’t expect people to know the exact number 4,294,967,296 by heart, but “a little over four billion” (or two billion for a signed integer) would not be too much to ask. If you cannot tell whether it is more or less than a million, as was the case with several ‘senior’ candidates I had the pleasure to interview, you’d better hit the ball out of the ballpark on every other part of the interview. Which they did not. (Please note that I’m using the American usage of ‘billion’ to mean a thousand million.)

Another point is that C++ has largely fallen out of favour nowadays for general application programming. So when it is still used, it is mostly to support performance-critical algorithms or very high-volume data processing. In that kind of code, having a good mental picture of the data structures you’re using is really crucial. If you want to have millions of instances of a given object in memory simultaneously, shaving a few bytes off each instance can really matter — which means that you need to know when you can safely use a char or a short instead of just blithely using _int_s for everything.

So, here are the facts which I expect every C++ programmer, as well as the better C# and Java ones, to know more readily than their mother’s birthday:

  • A byte is 8 bits, and can hold 256 different values. Examples: the number of different characters in a non-Unicode text file, the number of different shades of red, green and blue in a true-colour image. Also the maximum filename length on ext3 and NTFS, and an unfortunate limit on path length in many programs.
  • A ‘short’, on most C/C++ compilers, is 16 bits, which means it ranges from -32,768 to 32,767 when signed, or from 0 to 65,535 when unsigned. A decade or two ago, it was quite common to have the default size of int be 16 bits on PC-based C compilers, and you may still encounter this in the embedded world. On many systems today the wchar type is 16 bits, so that it supports all characters in the UCS-2 subset of Unicode, which contains all of the characters you actually care about. Beware of the difference between UCS-2 and UTF-16, though. Before true-colour displays became the norm, there were video cards which supported 16-bit colour; usually this was actually 15-bits, with 5 bits (32 different shades) each for red, green and blue, although sometimes the remaining bit was used to allow 64 different shades for green, which is the colour which the human eye can perceive most accurately. 16 bits is also the default integer size on some database systems, which can give you some nasty surprises when your e-commerce system has been in production for a few months and a customer submits the 65536th order.
  • 24 bits (three bytes) gives you about 16.8 million (16,777,216) different combinations. Among other things, this is the number of different colours on a true-colour display. In November 2006, Slashdot had to do an emergency update of their database, since they used the MySQL ‘mediumint’ type for comment IDs, so the system broke when the 16777216th comment was posted.
  • 32-bits is the default integer size for almost any modern programming language — yes, even on 64-bit platforms, which will break any code which assumes that a pointer can be cast into an integer and back. As stated above, a signed integer ranges from minus to plus 2.1 billion, while the unsigned range goes up to almost 4.3 billion. This is a fairly generous range for most problem domains, but sooner or later it will bite you if you’re not aware of it. Among many other things, this is why many programs have trouble with data files larger than 2GB. Also the absolute upper limit on the number of unique IP4 addresses (the actual limit is much lower because of the way the address space is organized).
  • A float value uses 24 bits for the mantissa, which translates into 7 decimal digits of precision, which is less than most programmers intuitively expect. A double uses 53 bits, which is plenty for most purposes. Anyway, the most important thing you need to know about working with floating-point numbers is that it is fraught with gotchas and special cases] which will trip you up sooner or later. And never, ever use them for monetary values.

I would also expect a serious programmer to know the powers-of-two table at least up to 216 = 65,536. You simply run into those numbers so often that you can’t help memorizing them after a while, and it’s surprising how often this knowledge can help you in debugging various kinds of overflow issues.

So what about larger integer sizes? Well, unless you work in cryptography or some other specialized field (in which case I really hope that you didn’t learn anything new from this article), you can safely assume that 64 bits is enough for anything you will ever want to do. But here is a handy little rule-of-thumb I use to convert between bits and decimal digits: 10 bits is 1,024 and 3 digits is 1,000, so to go from number-of-bits to number-of-digits you just divide by 10 and then multiply by 3. (Or for an even rougher approximation, just divide by 3.)

For instance, an unsigned 64-bit address represents roughly 3 * 6.4 = 19.2 decimal digits, so it can safely be used to store numbers well over 1,000,000,000,000,000,000. Likewise, if you want to know how many bits you need to store a 12-digit number, you divide by 3 and multiply by 10, which tells you that you need approximately 40 bits.