I’ve recently spent some time debugging curious hangs/aborts in two more or less exotic operating systems, Plan 9 and QNX 4.25. Both turned to be caused by the same innocuous-looking BIOS change, even though the circumstances were somewhat different and the symptoms initially didn’t look similar at all.
With Plan 9, the system simply hung during boot. With QNX 4.25, the graphical installer aborted (without any obvious hint as to what might be going wrong) soon after beginning to detect devices. QNX itself continued to work.
Investigating the Plan 9 hangs was somewhat easier because the system died soon after the error occurred. The proximate cause turned out to be a corrupted stack pointer, never a good thing.
Not very real mode
With Plan 9, the system was crashing/hanging in real mode when Plan 9 detected memory, or in some cases later on when calling the VESA BIOS to set a graphics mode (behavior dependent on the exact Plan 9 release). One of the root causes was that Plan 9 has a very strange idea of what real mode looks like, miles away from how Intel defines it.
The Plan 9 real mode switcher does a half-hearted job of switching to real mode, ignoring parts of the prescribed sequence (see section 9.9.2, “Switching Back to Real-Address Mode” in Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A): namely DS/ES/FS/GS/SS selectors are not loaded with real-mode compatible limits and attributes.
That doesn’t make too much difference except for the SS selector. Plan 9 does not change the SS attributes, which means that supposedly real-mode code is run with a 32-bit stack(!), although Plan 9 sets ESP to a 16-bit value. As a reminder, a 32-bit stack signifies that instructions which implicitly use the stack (PUSH, POP, CALL, RET, etc.) use ESP rather than SP, even when executing 16-bit code. When ESP contains a 16-bit value, that makes very little difference… most of the time.
Now comes the BIOS part. One might be forgiven to think it’s a good idea to replace stack frame prolog/epilog sequences with ENTER and LEAVE instructions. Especially replacing the MOV SP, BP/POP BP instruction sequence with a single-byte LEAVE opcode is a no-brainer when optimizing for space.
ENTER freely, but you shall not LEAVE
Unfortunately, when running 16-bit code with a 32-bit stack, good ideas can turn bad. The problem is that in such an environment, the ENTER/LEAVE instructions aren’t symmetric. One does not learn that from Intel’s documentation. Unfortunately, Intel’s current documentation of the ENTER instruction is simply fantasy—certainly not how the instruction really works.
Intel claims that in 16-bit code (i.e. default operand size being 16 bits) with a 32-bit stack, ENTER will move ESP to EBP. Since the LEAVE instruction moves EBP back to ESP, that makes sense. Sadly, while the LEAVE instruction really does move EBP to ESP (when running with a 32-bit stack), the ENTER instruction only updates BP.
Plan 9 artfully exploits this hole by leaving garbage in the high bits of EBP. Since 16-bit code normally only uses BP, that doesn’t matter. But while ENTER does not disturb the high word of EBP, LEAVE propagates it to ESP. In other words, when running 16-bit code with a 32-bit stack, the ENTER/LEAVE sequence will also move the high word of EBP to the high word of ESP as a side effect. Unless both happen to be the same, this is fatal.
How not to call the PCI BIOS
And what of QNX 4.25? The setting is different but the core of the problem is the same. The QNX 4.25 installer uses the PCI BIOS to probe for installed devices. For whatever reason, QNX calls the 16-bit protected-mode services rather than 32-bit services. But that’s not the problem.
The real problem is that QNX calls into the 16-bit protected mode PCI BIOS with a 32-bit stack, which is explicitly forbidden (see the PCI BIOS Specification, revision 2.1, section 3.2 “Calling Conventions”). Again, QNX uses a 16-bit ESP value but does not clear the high word of EBP, which is a problem when executing an ENTER/LEAVE instruction sequence. See above for what that does exactly.
The interesting bit about QNX is that the operating system itself survives this just fine. The advantages of a microkernel…
Mismatched Stack Sizes: A bad idea?
It is fairly apparent that the whole idea of running 16-bit code with a 32-bit stack is not really fully workable. It can only function as long as the 16-bit code obeys certain restrictions; specifically it only works when the 16-bit code does not use the stack too much.
It falls apart with any code which uses high-level language style stack frames or even just uses parameters and/or local variables stored on the stack. The problem is that any explicit stack accesses will fail if the data is located above the 64K boundary. 16-bit code inevitably uses 16-bit addressing which makes it very difficult to call into 16-bit code with a stack located above 64K; but if the stack is located entirely below 64K, what’s the point of using a 32-bit stack segment?
Using a 16-bit stack with 32-bit code is perhaps less troublesome, as long as the high word of ESP is kept zeroed.
The most problematic part of using a stack size that does not match the default operand size is that code which saves/restores the stack pointer effectively uses the “wrong” size. Even with a 32-bit stack, 16-bit code can only save SP/BP, otherwise the stack frame layout wouldn’t be what the called code expects. 32-bit code has a similar but less severe problem with a 16-bit stack, needing to save/restore full 32 bits of ESP even though only SP is implicitly used.
Intel clearly designed this feature to make it easier for mixed 16-bit/32-bit systems to share specific routines. Unfortunately this does not work in the general case, only when the called code is very carefully written. Nowadays it’s more likely that an operating system forgets to set the stack size properly and gets tripped up by unintended consequences of such omission.
Another question is why the LEAVE instruction is not symmetric with ENTER, or why it is not an equivalent of MOV xSP, xBP/POP xBP (as programmers are typically led to believe). It is difficult to see any sensible reason for such behavior, although this behavior has at least been properly documented.
It’s no wonder some have called IBM’s choice of the 8088 CPU for the original IBM PC “the most expensive decision [in] the history of the human race” (Steve Gibson in InfoWorld, Feb 19, 1990, page 30).
What is fun is to call 16-bit code in ring 0 in long mode, when on an interrupt the SS base is forced to zero.
Can you elaborate? A long mode interrupt handler points to 64-bit code which does not care about SS base.
But ESP is still the same.
And the SS base is treated as it were zero, while it was non zero before the interrupt occurred.
Two points… one, 16-bit code does not have to execute with a 16-bit stack (whether that’s actually workable is a different question). Two, the IST mechanism can enforce a stack switch regardless of what code was interrupted.
The first thing that come to my mind it’s, why Plan 9 have to go to real mode, one could think that it could work only in protected mode, and the second is this have to work at some cpu, I think, if not, why they didn’t corrected this?
Plan 9 goes to real mode in order to execute VESA BIOS and do a couple of other things. The need to execute BIOS code is real. Why they don’t use V86 mode or emulation is not something I can answer. Note that this is not a CPU problem, it depends on BIOS implementation.
Why they haven’t corrected this — again not something I can answer. I would hazard a guess that Plan 9 is not tested on a wide variety of platforms so problems like this can easily go unnoticed. I have no explanation at all as to why Plan does not set up the real mode environment properly, that looks like a bug.
3 years later and Intel has corrected their manual: as far as I now understand you better make sure you use an operand size prefix for ENTER and then also for LEAVE in order to ensure that the operand size matches the stack size. As far as I can tell this will make ENTER “symmetrical” to LEAVE.
I guess the IBM HW developers themselves got tangled up in this operand size/address size/stack size mess …
Do I remember correctly that OS/2 executes virtual device drivers with a default operand and address size of 32 bits (32-bit code segment) but with a stack size of 16 bits ?
Yes, OS/2 likes to combine 32-bit code and 16-bit stacks, I think that would apply to VDDs. It’s an excellent test of any x86 emulator/virtualizer 🙂