Last weekend I had the pleasure of debugging a curious case of older PCI configuration code (circa 2005) failing on newer (post-2010) hardware. The code was well tested on many 1990s and 2000s PCs (and some non-PCs) with PCI/AGP/PCIe and never showed problems. But at least on some newer machines (in my case a Sandy Bridge laptop) it just hung.
I quickly found out that code meant to walk over a PCI capability list got stuck in a loop. But the reason why it got stuck was unexpected. For reasons lost to the mists of time, the code reads the next capability pointer and the capability type (total of 16 bits of information) as two bytes, not one word. That was not the problem though. The problem was that the code always read a full DWORD from the config space and discarded the top bits. On the problematic system, that worked as long as the read was DWORD-aligned, but not otherwise. The code logic may seem unusual but it worked perfectly well on so many older systems, so why not anymore? And who violated the PCI specification, the code or (unlikely) the hardware?
Let’s start with the second question. Is such a misaligned access something that should work or not? A very strong case can be made on both sides: On the one hand, the code works perfectly fine on all PCs made prior to 2010 or so, and such fundamental behavior should not change. On the other hand, Intel presumably knows how PCI should work and wouldn’t get that wrong.
When one looks closer, a very unusual thing happens—the picture becomes fuzzier instead of clearer. It turns out that the PCI specification does not actually explicitly define this behavior.
Let’s quickly summarize how the PCI configuration works. So-called configuration mechanism #1 (there used to be mechanism #2, too) uses two I/O ports in the host CPU’s I/O space: CF8h aka CONFIG_ADDRESS, and CFCh aka CONFIG_DATA. The address port has some interesting behaviors but that’s irrelevant here; the key takeaway is that the address port selects a 32-bit DWORD in the PCI configuration space; it is not possible to address bytes/words. However, it is possible to read/write bytes or words using byte enables (that is general PCI behavior). And that’s where the seams between PCI and x86 CPUs start showing.
From the CPU’s point of view, the CONFIG_DATA port actually responds not only at the CFCh address but also at CFDh, CFEh, and CFFh. A byte read from CFEh, for example, translates into a DWORD-sized PCI config space access with only the 3rd byte enabled. It works as everyone expects, the PCI bus gets a 32-bit access but the CPU (and device) only deals with a single byte.
Now what happens if you read not a byte from CFEh but a DWORD? Well… in the usual case, some logic in the host bridge/chipset will probably break it down to two word accesses, one at CFEh and the other at D00h. The data (when reading) will be glued back together and the CPU gets a DWORD back. Everyone is happy.
Except when you have a modern CPU, there is not necessarily any chipset involved. There are PCI devices in the CPU itself and, well, these kinds of accesses now behave differently. In the case at hand, a DWORD read from within CONFIG_DATA that’s not DWORD-aligned will simply return FFFFFFFFh (computer speak for “there ain’t nothing here, dude”).
My conclusion is twofold. The code was always kind of wrong, because it probably read from I/O port D00h, at least in some cases. But that was almost certainly entirely harmless and the code worked as expected.
The other conclusion is not so much a conclusion as just wondering if Intel even realized that they were changing the existing semantics there. Perhaps they didn’t.
OMG, that’s so messy it reminds of the time when I [graphic description omitted for sanity]. In these cases, I really do find it rather difficult to say where one error ends and the next one starts: bloated i86, messy PCI (as if ISA wasn’t already bad enough in that regard!), organizations propagating standards to then cheerfully (“it’s the modern world, dude!”) violate them — and in the process sabotaging the efforts of a hapless programmer, who had to make the best of a very bad situation in the first place, while doing so.
Sometimes I really do wonder if I’m awake or if this is all just a surreal dream. How the fsck do they get away with this? The mind boggles…
“On the other hand, Intel presumably knows how PCI should work and wouldn’t get that wrong.” That’s a bad assumption. Intel has broken the PCI bus numerous times.
More than a decade ago, when Xscale was being pushed, we made the mistake of choosing it for a design that would have used the PCI bus (not found on most ARM SoCs of the time) to connect a UDMA ATA controller. We reviewed the specs and errata beforehand and everything looked ok a couple prototypes were assembled and then we tried to talk to a disk, and got nothing…
After several weeks of intense research and trying every possible avenue we have to give up and cancel the whole thing. Why? Well, these Xscale chips were marketed as network processors with a couple NPEs included and the PCI bus was meant for hooking up 802.11 hardware (which did work just fine). Of course, the network hardware is doing simple pretty simple I/O, mostly full DWORD access. So, Intel cut some corners, one of which was to leave out the logic necessary to mask a read to be less than 32bit. It was possible to do BYTE and WORD sized writes, but reads were always DWORD size no matter what. I imagine the decision came from some EE who figured this a good way to save $0.0000000001 by leaving out less than dozen gates and the software could just work around it by ignoring the unneeded bits. That sounds reasonable, except…
Traditional IDE (pre-AHCI) controllers are rather complicated and carry considerable legacy baggage. Probing the IDE bus for a disk for example requires WORD sized reads of various addresses in a particular order, during which each read action causes the state machine to advance. The reads of course are not in liner order, so doing a DWORD read throws the disk or controller into a bad state. It didn’t matter which controller brand/model was used, or whether it was PATA or SATA, the ATA protocol absolutely requires performing less than DWORD sized reads on the PCI bus, which was impossible given Intel’s boneheaded decision.
It turned out that this problem had been known, but rather than document it in the errata, Intel buried it in a document titled “Technical Specification Clarification.” To make that clear, they shat on the PCI spec and then called that a spec clarification. They did offer a work-around; burn a few GPIOs to drive a decoder+gate that would correctly mask PCI reads by selectively cutting the four byte enable lines (which were hard wired to be asserted during read cycles on the IXP4xx). That work-around not only had a much higher parts cost than it would have been to include the required gates in the processor, but it also meant extra CPU cycles spent manipulating the read mask before every read operation, which of course means forget you can forget about any UDMA/MWDMA, all access will be PIO with extra bonus wait cycles. The only possibility for performant storage was therefore SCSI disks and that blew the budget. Thus, project death.
DO NOT count on Intel to follow ANY specification EVER, including those they write.
re missing address masks on xscale:
Didn’t a similar issue exist with some early Alpha boxes, and the answer was simply to shift the address lines?
i.e. rather than have the IDE connected to A0-A4, connect it to A1-A5.
That means the registers end up split somewhat, but a 32 bit read becomes a 16 bit read.
To be honest, I don’t see Intel at fault here. For PCI config space Intel specifies that with register CONFIG_ADDRESS you specify a “quadruple” of bus/dev/func/reg that addresses an offset “reg” of the PCI config space of a device addressed by bus/dev/func. Intel makes it clear that “reg” is not only addressing a DWORD but that that offset also has to be DWORD aligned. It furthermore states that both, CONFIG_ADDRESS and CONFIG_DATA registers are 32-bit register addressed in I/O space which also means that these registers have to be aligned to 4 bytes and read/written as such.
Taking it all together and looking at Wikipedia and various pieces of source code, it’s clear that only DWORD aligned DWORDS can be read from or written to.
Additionally, the PCI spec clearly states that all capability descriptor blocks have to be aligned on a DWORD boundary.
I can however understand the other poster’s frustration about limitation in I/O mapped register space in PCI because that problem could not be expected.
I find it more problematic that there are subtle differences between AMD and Intel for various instructions:
Take “sidt” and “sgdt” as an example: the Intel spec goes through lengthy explanation why you only get a 24-bit base address (with the upper 8 bits zeroed) for the base address when the operand size is 16 bits. AMD on the other hand will always return 32 bits for the base address, regardless of operand size. Fortunately AMD will also ignore an operand size prefix prefixed to “sgdt”,”sidt”.
For a true 32-bit OS this difference will not matter but for a mixed segment size OS like OS/2 this can become a problem: a 16-bit device driver uses “sgdt” from a 16-bit code segment to query the GDT base address but of course it needs all 32 bits of the base address and not only 24 bits of it ! This will force you to prepend the “sgdt” instruction with the 066h operand size prefix which you will be forced to do “manually” as no assembler will be able to honour that restriction.
I seem to remember that the OS/2 designers were smart enough to place GDTs and IDTs (can be different addresses for each core) below the 16 MB address boundary …
Likewise there are subtle differences of usability of SYSCALL/SYSRET and SYSENTER/SYSEXIT from 32-bit code and 64-bit code respectively.
In short: you will need to find the “largest common denominator” between AMD and Intel if you want a common implementation for both without implementing things twice.
You’re arguing with some Intel specifications (which exactly?), but fact is that prior to about 2008, such non-aligned accesses worked on every PC platform. It’s pretty obvious that Intel changed things and almost certainly without realizing they were changing things.
You say: “It furthermore states that both, CONFIG_ADDRESS and CONFIG_DATA registers are 32-bit register addressed in I/O space which also means that these registers have to be aligned to 4 bytes and read/written as such” — but that does not follow. There is absolutely no architectural requirement for I/O space accesses to be aligned in any way, for example.
Some time ago I ran into a related problem with a USB OHCI host driver of an unnamed OS. The OHCI specification clearly says that only DWORD-sized, DWORD-aligned are allowed to the OHCI registers. The driver writer inadvertently violated that and accessed random bytes. That broke on an emulated OHCI implementation but worked fine on real hardware. What do you think happened, did the emulation vendor argue that “the specification doesn’t allow that” or was the emulation fixed to match real hardware behavior?
Re SGDT/SIDT. Thanks for the good laugh, because I dealt with this problem a few months ago. News flash: SGDT/SIDT behaves the same on Intel and AMD processors, but only one of the vendors documents the actual behavior. You need to spend less time trusting Intel manuals and more time examining what Intel products actually do. The Intel SGDT/SIDT documentation is, to put it mildly, very interesting fiction. See for example https://pcem-emulator.co.uk/phpBB3/viewtopic.php?f=2&t=159 for a public discussion — note that Win32s simply does not work on a CPU where SGDT only stores 24 bits of the 32-bit address when 16-bit operand size is in effect. Are you going to return your broken Intel CPU now? 🙂
1) I looked into the PCI Local Bus Specification 2.2 from 1998. From there I pulled everything I had to say about alignment in PCI config space.
2) Intel processor manual has this to say about I/O accesses (Volume 1, chapter 18):
… any two consecutive 8-bit ports can be treated as a 16-bit port, and any four consecutive ports can be a 32-bit port. …. Like words in memory, 16-bit ports should be aligned to even addresses … so that all 16 bits can be transferred in a single bus cycle. Likewise, 32-bit ports should be aligned to addresses that are a multiple of four. ….
The exact order of bus cycles used to access unaligned ports is undefined and is not guaranteed to remain the same in future IA-32 processors. If hardware or software requires that I/O ports be written to in a particular order, that order must be specified explicitely. For example, to load a word-length I/O port at address 2H and then another word ports at 4H, two word-length writes must be used, rather than a single doubleword write at 2H.
If I were a HW designer I’d make damn sure that my 16-bit ports align to even addresses and my 32-bit ports align to 4-byte multiple addresses in order to avoid “the undefined”. If I were a SW developer and had no better doc otherwise I would ASSUME that it is that way.
3) “sgdt”/”sidt”: oops, I look like an idiot now. But you forgot to mention the rest of the discussion in the link you are giving: it is now “lgdt”,”lidt” that seemingly suffer from this problem.
Net effect for me: in order to be ABSOLUTELY sure that I work with a 32-bit table base address, I’d prepend an operand size prefix to ALL of these instructions if I’d call them from 16-bit code. The worst that can happen is that the prefix is ignored.
4) OHCI for “unnamed OS”: I fixed this error about 6 years ago in these USB HC drivers for that “unnamed OS”. I wish the virtualization platform provider would not have “fixed” that problem in the emulation so that it would have become obvious 🙂
Question: How can you say for sure that it worked just fine in this broken way on ALL HW ? Fact is, you don’t know.
By the way: is the PCI config space problem located in a graphics driver that also ran under that same “unnamed OS” ?
In short: Call it “defensive programming”. I have been working 20+ years doing SW development for embedded HW where you really don’t want to have SW bugs …
As to specs: Maybe we all have to get used to dealing with alternate facts these days …
I’ll check the 2.1 PCI spec again. But note that the post was not about PCI bus accesses exactly, and also it was not about unaligned access, rather smaller accesses within a DWORD.
The thing with I/O alignment (again, not the problem at hand) is that like with memory, it’s optional. Except “funny things” can happen, especially using MMIO. I won’t argue that using unaligned accesses is a good idea, because it’s not, or that Intel CPUs behave uniformly, because they don’t.
Re broken OHCI — the problem is that customers don’t care if it works on all hardware or only 98% of it. Their argument is very simple, if it works on their hardware but not in their VM, the VM is broken, end of discussion.
LGDT/LIDT — there is no problem there because the CPUs (as with SGDT/SIDT) behave the same and Intel and AMD documentation is also in agreement (operand size does matter). I’m working on a longer post on this topic, it’s interesting enough.
You did say that the problem with PCI config space was that a read not only to 0xCFC,0xCFD,0xCFE,0xCFF was done but also one from 0xD00 and beyond because the code was reading a 32-bit word not from 0xCFC but from 0xCFD,0xCFE or 0xCFF. That’s what I understood from the end of the article. And that’s certainly something that should be avoided.
For me the strange thing is that the code got into that odd state at all. The initial capability pointer and all capability blocks are DWORD aligned (but you have to be sure to mask out the lower 2 bits of NEXT_CAPABILITY pointer as required by the PCI spec !).
So how in the world could you run into that problem of unaligned access crossing a DWORD boundary in the first place by just following the capability chain ?
The NEXT_CAPABILITY pointer is not DWORD aligned but off by 1 byte, but then, it’s only 1 byte and if you would want to do something unorthodox it would have been sufficient to read a byte from 0xCFD.
Ugh, sorry, you’re right, I got confused. The code in question calculates the address (“iobase = 0xCFC + (index & 3);”) and then reads a DWORD for byte/word/dword access (“value = inpd(iobase);”). And for reasons that are not completely clear to me, this behaves differently when the access goes through the chipset vs. going to an on-CPU PCI device.
Either for the off-CPU access something breaks down the I/O into aligned accesses or the chipset just behaves differently than the CPU with regard to unaligned reads.
For their processors, Intel leaves the door open by saying
The exact order of bus cycles used to access unaligned ports is undefined and is not guaranteed to remain the same in future IA-32 processors.
So it’s the typical “maybe it works, maybe not, it could work today but maybe it won’t work tomorrow” answer to have all engineering freedom to change things later on.
As an engineer who has to use this stuff I would never rely on that “blabla” statement.
Nonetheless the article was very interesting and gives a lesson of why engineers are fairly well paid (because noone else understands how it really works …)
I am looking forward to your “why sgdt/sidt and lgdt/lidt are not symmetrical” article 🙂
The ordering shouldn’t matter in this case, it’s all just reading constant data.
I think the basic problem is that cross-referencing hundreds of thousands of lines of code against thousands of pages of documentation is not really feasible, especially when the documentation changes (and it does).
Working on the LGDT/SGDT etc. article 🙂