While working on an unrelated problem, I stumbled across very surprising (to me) behavior of a C compiler. My code was the equivalent of the following:
#include <stdio.h>
int arr[42];
int main( void ) {
printf( "%u\n", sizeof( &arr ) );
return( 0 );
}
The & operator produces the address of its operand, and if the type of the operand is T, the type of the resulting expression is pointer to T, at least in C89 and later. So the size of that type really, really should be the size of a pointer. Here’s what Visual C++ 1.52c, the last 16-bit Microsoft compiler, declared as ANSI-compliant, does with it:
C:\temp>cl -W4 asz.c
Microsoft (R) C/C++ Optimizing Compiler Version 8.00c
Copyright (c) Microsoft Corp 1984-1993. All rights reserved.
asz.c
Microsoft (R) Segmented Executable Linker Version 5.60.339 Dec 5 1994
...
C:\temp>asz 84
84? That’s the size of the array, not the size of a pointer to the array. Needless to say, that is at odds not only with any modern compiler (gcc, clang) but also with many other mid-1990s compilers (IBM or Watcom for example). What was Microsoft doing there?
The Visual C++ 1.52c behavior is odd, but OK, it’s an old compiler, so let’s try Visual C++ 6.0, the workhorse of 32-bit Windows development for many years:
C:\temp>cl -W4 asz.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86
Copyright (C) Microsoft Corp 1984-1998. All rights reserved.
asz.c
Microsoft (R) Incremental Linker Version 6.00.8447
...
C:\temp>asz 168
Well… that’s the exact same thing, only with int being 32 bits instead of 16 (168 = 2 * 84). Definitely not the size of a pointer. Still no complaints from the compiler even with the highest warning level, either. Now let’s fast-forward a few more years, to Visual Studio 2005:
C:\temp>cl -W4 asz.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
asz.c
Microsoft (R) Incremental Linker Version 8.00.50727.42
...
M:\temp>asz 168
OK, same weird thing in the 21st century. But wait… that’s not the final word. In Visual Studio 2008, Microsoft changed its mind:
C:\temp>cl -W4 asz.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
asz.c
Microsoft (R) Incremental Linker Version 9.00.21022.08
...
C:\temp>asz.exe 4
Hooray! Sanity has been restored. But why did it take so long? And more importantly, why was Microsoft doing the weird stuff in the first place? To get a good clue, we have to go all the way back to Microsoft C 5.1 (1988):
C:\temp>cl asz.c
Microsoft (R) C Optimizing Compiler Version 5.10
Copyright (c) Microsoft Corp 1984, 1985, 1986, 1987, 1988. All rights reserved.
asz.c
asz.c(7) : warning C4046: '&' on function/array, ignored
Microsoft (R) Overlay Linker Version 3.65
...
C:\temp>asz 84
Now that’s interesting. Notice that Microsoft C 5.1 warns about this construct (taking the address of an array) even at the default warning level. Starting with Microsoft C 6.0 (1990), there is no warning even at the highest level. So what changed? Obviously, the ANSI C Standard (C89)/ISO C Standard (C90) was published. Microsoft C 5.1 is fairly close to ANSI C, but was released before the Standard was published, and cannot therefore be expected to be fully compliant.
The warning produced by Microsoft C 5.1 gives an excellent hint as to what is going on. The & operator is simply ignored, which means that of course the sizeof operator produces the size of the array, rather than the size of a pointer to the array. For some inexplicable reason, Microsoft C 6.0 kept the existing behavior but eliminated the arguably very useful warning. That behavior persisted until version 14.0 of the compiler (Visual Studio 8.0 aka 2003) and changed in version 15.0 (Visual Studio 9.0 aka 2005).
To understand why the Microsoft compiler did what it did, it is useful to review the second edition of Kernighan and Ritchie’s The C Programming Language (1988). The short answer is on page 260 (Appendix C) within a list of changes in the proposed ANSI C standard relative to the 1978 edition of The C Programming Language. One of the numerous changes is the following: “Applying the address-of operator to arrays is permitted, and the result is a pointer to the array.” That is why any compiler compliant with C89 and later must produce the size of a pointer in the example given at the beginning. In K&R C, applying the address-of operator (&) to an array was actually not defined.
K&R vs. ANSI
Microsoft decided to simply ignore the operator, in a nod to the very close relationship between pointers and arrays in C. In many contexts, ignoring the address-of operator makes sense, because “an array” and “a pointer to the array” have the same value (the address of the first element in the array). Applying the sizeof operator is one area where there is a clear difference.
How the pre-ANSI behavior in Microsoft C persisted for almost 20 years after Standard was published (and Microsoft C was supposedly compliant with it) is unclear. The most likely explanation is that precisely because taking the address of an array was not defined in K&R C, it was a rarely used feature, and no one noticed the difference.
Arrays vs. Pointers in C
Many C programmers know (or think they know) that arrays and pointers are almost the same in C. That is not a coincidence. K&R states: “In C, there is a strong relationship between pointers and arrays, strong enough that pointers and arrays should be discussed simultaneously. Any operation that can be achieved by array subscripting can also be done with pointers.”
In the language of ISO C99, the link between arrays and pointers is described as follows (6.3.2.1); note the explicit exception for the address-of operator: Except when it is the operand of the sizeof operator or the unary & operator, or is a string literal used to initialize an array, an expression that has type “array of type” is converted to an expression with type “pointer to type” that points to the initial element of the array object and is not an lvalue.
Arrays and pointers in C are syntactically and semantically so similar that programmers sometimes forget what the differences are. A pointer is a variable which is usually modifiable; it has its own storage (typically four or eight bytes), but it is entirely separate from whatever it points to. An array on the other hand could be thought of as a label identifying the start of the array in memory. The label itself does not have any storage; it is effectively a constant. But defining an array variable does allocate storage for the array.
What an array isn’t is a modifiable lvalue (in the language of the C Standard), meaning that it can’t be assigned to. The term lvalue as used in the C Standard can be slightly misleading at first glance (it is actually well defined in the text). The naming refers to an E1 = E2 style expression where E1 is sometimes called an lvalue because it’s on the left-hand side of the expression. In the C Standard, E1 would actually be called a modifiable lvalue (because, obviously, it can be modified if it can be assigned to). The more general term lvalue means an expression which designates an object. Examples of lvalues that are not modifiable lvalues are constants (in the const-qualified type sense, not preprocessor macro sense) and arrays.
Note that there is a significant but incomplete overlap between lvalues and expressions whose address can be taken. A function designator is not an lvalue but its address can be taken, whereas a bit-field or a variable with register storage is an lvalue yet its address cannot be taken.
Moral
What have we learned? C is just complex enough that there are obscure corners of the language which get rarely exercised. C also underwent significant transformation going from K&R C to ANSI C in the 1980s. A compiler written in the K&R days (Microsoft C was initially written around 1984) could sometimes exhibit vestiges of the old behavior even after it was overhauled to be ANSI compliant.
And pointers and arrays in C are close enough that the syntactically very minor differences between them can be misleading.
The obvious solution would be to make ‘a[]’ a regular lvalue,
representing the array as a whole.
A slight problem with that is that an array need not have a known size, precisely because it is semantically a label.
Technically it could have been done, philosophically it may have been rejected because one line of code should not perform potentially very expensive operations. That is one of the nastiest things about C++, a simple assignment can be an incredibly complicated and expensive operation. The result is impenetrably complex bloatware.
Using sizeof in that case is an error, too.
And mewas talking about C 🙂 We all know how awful C++ is…
“one line of code should not perform potentially very expensive
operations.”
Don’t mind if metakes exception to this. Me’s just rebuilding a kernel
with the following line of code:
% make -j6
Certainly an expensive operation on that old craptop, but that’s just
the power of good tools.
Running ‘make’ is like executing a program, not writing code. And of course makefiles are not C, one line often takes seconds or even minutes to execute. The comment was specifically about C source code, perhaps excluding macros, because those can be a good way to hide complexity.
Hiding complexity is bad, me’ll agree. However, when things are sanely
written and properly documented (often an utopia, meknows), there isn’t
much of a problem in allowing “one line” to have a nontrivial effect…
It would be uncharacteristically fascist of C to forbid something purely
on the basis that it can be misused in a higher-level extension (C++)…
And me’s always grokked the shell as a scripting environment, as opposed
to a program launcher… and don’t forget about the old UNIX dream of
merging C and the shell together =)
Does the “/Za” flag (in versions that support it) have any effect on this behaviour?
It does not. I thought it might, but didn’t find any Microsoft C version where it would.