Last week I amused myself by porting some source code from Watcom C to Microsoft C. In general that is not difficult, because Watcom C was intended to achieve a high degree of compatibility with Microsoft’s C dialect.
Yet one small-ish program kept crashing when built with Microsoft C. It didn’t seem to be doing anything suspicious and didn’t produce any noteworthy warnings when built with either compiler.
After some head scratching and debugging, I traced the difference to a piece of code like this:
if( read( hdl, buf, BUF_SIZE ) != BUF_SIZE )
// Last file block read, deal with EOF
else
// Not near end of file
To my surprise, the return value from read()
is rather different between the two compilers’ run-time libraries when the file is open with the O_TEXT
flag (and therefore meant to translate line endings from CR/LF to LF when reading).
It’s hard to call it a bug because both run-time libraries behave as documented. Here’s what the Watcom documentation says:
The read function returns the number of bytes of data transmitted from the file to the buffer (this does not include any carriage-return characters that were removed during the transmission). Normally, this is the number given by the len argument. When the end of the file is encountered before the read completes, the return value will be less than the number of bytes requested.
While this is perhaps not as clear as it might be, read()
will attempt to fill the entire buffer, regardless of how many CR characters it might need to delete.
Microsoft’s documentation says something different (quoted from Microsoft C 5.0, but it has not substantially changed):
The read function returns the number of bytes actually read, which may be less than count if there are fewer than count bytes left in the file or if the file was opened in text mode (see below).
…
If the file was opened in text mode, the return value may not correspond to the number of bytes actually read . When text mode is in effect, each carriage-return-line-feed pair ( CR-LF) is replaced with a single line-feed character (LF). Only the single line-feed character is counted in the return value. The replacement does not affect the file pointer.
The discrepancy stems from the fact that for files opened in text mode, the total number of bytes stored in the file on disk will likely be higher than the total number of bytes read into application buffers. Because, of course, any CR/LF sequence will shrink to just LF.
While Microsoft chose the length argument to read()
to mean the number of bytes read from disk, Watcom instead interprets it as the number of bytes written to the application’s buffer.
Although both approaches make some logical sense, the approach chosen by Microsoft and other library writers (including at least Borland and IBM) has two drawbacks:
- It makes it impossible to test for end-of-file and error conditions by simply checking whether the number of bytes read equals the number of bytes requested
- It is inconsistent with the behavior of
fread()
for text files
In all the run-time libraries tested, fread()
on text files behaves like Watcom’s read()
. That is, fread()
attempts to fill the destination buffer with the number of bytes specified by caller, regardless of how many bytes need to be actually read from disk. This is perhaps understandable given the specification of fread()
which does not take a single size argument but rather uses a product of “number of items” and “size of item” to determine the number of bytes read.
The library writers quite possibly felt that although a call such as
n = fread( buf, 1, BUF_SIZE, f );
could conceivably return less than BUF_SIZE for text files, analogous to what Microsoft’s read()
does, but a call like
n = fread( buf, BUF_SIZE, 1, f );
would by necessity have to return zero in such case… and that would be pretty useless. It is much better to try and fill the specified buffer.
Needless to say, the behavior of read()
for text files is not specified by any standard, because the one standard which does specify read()
behavior (that is, POSIX and its successors) knows nothing of text files.
The behavior of the Watcom runtime in this regard has not changed since at least version 8.5 (1991); likewise the Microsoft runtime hasn’t changed at least since Microsoft C 5.0 (1987). Most likely it never changed. The fact that the discrepancy exists probably indicates that very few programmers use the POSIX file I/O calls with text files at all.
There are also differences in exactly how different run-time libraries deal with CR characters in text files (Microsoft, Watcom, and Borland run-times all behave differently), but that’s perhaps something for a different blog post.
Hello Michal,
Regarding read ()’s behaviour, I think POSIX.1-2024 does have a bit to say — about short read counts:
“The value returned [by read ()] may be less than nbyte […] if the read() request was interrupted by a signal [etc. …]
“If a read() is interrupted by a signal before it reads any data, it shall return -1 with errno set to [EINTR].” (https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html)
So even on POSIX systems where programs do not need to do newline conversion, a short read count does not necessarily mean that an end-of-file is reached. I normally test for end-of-file by checking if read () returns exactly 0 (with no error).
Thank you!
I do recall using read() under DOS years ago, generally in binary mode, but occasionally in text mode.
That would have been with the Borland C compilers.
However the method used to detect EOF was not return length less than specified read length, but rather read() eventually returning 0.
Simply because that is what one has to do under UNIX in order to detect EOF irrespective of what the fd is connected to.
i.e. short reads are possible even before EOF for “slow” devices, e.g. tty’s, sockets, etc.
True. The code in question was meant to work with disk files, and I’m fairly certain any read shorter than the required size really meant either EOF or an error.
True. A good reminder that the C stdio interface is much easier to use, in addition to being more portable.