Learn Something Old Every Day, Part XIV: read() Return Value May Surprise

Last week I amused myself by porting some source code from Watcom C to Microsoft C. In general that is not difficult, because Watcom C was intended to achieve a high degree of compatibility with Microsoft’s C dialect.

Yet one small-ish program kept crashing when built with Microsoft C. It didn’t seem to be doing anything suspicious and didn’t produce any noteworthy warnings when built with either compiler.

After some head scratching and debugging, I traced the difference to a piece of code like this:

  if( read( hdl, buf, BUF_SIZE ) != BUF_SIZE )
// Last file block read, deal with EOF
else
// Not near end of file

To my surprise, the return value from read() is rather different between the two compilers’ run-time libraries when the file is open with the O_TEXT flag (and therefore meant to translate line endings from CR/LF to LF when reading).

It’s hard to call it a bug because both run-time libraries behave as documented. Here’s what the Watcom documentation says:

The read function returns the number of bytes of data transmitted from the file to the buffer (this does not include any carriage-return characters that were removed during the transmission). Normally, this is the number given by the len argument. When the end of the file is encountered before the read completes, the return value will be less than the number of bytes requested.

While this is perhaps not as clear as it might be, read() will attempt to fill the entire buffer, regardless of how many CR characters it might need to delete.

Microsoft’s documentation says something different (quoted from Microsoft C 5.0, but it has not substantially changed):

The read function returns the number of bytes actually read, which may be less than count if there are fewer than count bytes left in the file or if the file was opened in text mode (see below).

If the file was opened in text mode, the return value may not correspond to the number of bytes actually read . When text mode is in effect, each carriage-return-line-feed pair ( CR-LF) is replaced with a single line-feed character (LF). Only the single line-feed character is counted in the return value. The replacement does not affect the file pointer.

The discrepancy stems from the fact that for files opened in text mode, the total number of bytes stored in the file on disk will likely be higher than the total number of bytes read into application buffers. Because, of course, any CR/LF sequence will shrink to just LF.

While Microsoft chose the length argument to read() to mean the number of bytes read from disk, Watcom instead interprets it as the number of bytes written to the application’s buffer.

Although both approaches make some logical sense, the approach chosen by Microsoft and other library writers (including at least Borland and IBM) has two drawbacks:

  • It makes it impossible to test for end-of-file and error conditions by simply checking whether the number of bytes read equals the number of bytes requested
  • It is inconsistent with the behavior of fread() for text files

In all the run-time libraries tested, fread() on text files behaves like Watcom’s read(). That is, fread() attempts to fill the destination buffer with the number of bytes specified by caller, regardless of how many bytes need to be actually read from disk. This is perhaps understandable given the specification of fread() which does not take a single size argument but rather uses a product of “number of items” and “size of item” to determine the number of bytes read.

The library writers quite possibly felt that although a call such as

    n = fread( buf, 1, BUF_SIZE, f );

could conceivably return less than BUF_SIZE for text files, analogous to what Microsoft’s read() does, but a call like

    n = fread( buf, BUF_SIZE, 1, f );

would by necessity have to return zero in such case… and that would be pretty useless. It is much better to try and fill the specified buffer.

Needless to say, the behavior of read() for text files is not specified by any standard, because the one standard which does specify read() behavior (that is, POSIX and its successors) knows nothing of text files.

The behavior of the Watcom runtime in this regard has not changed since at least version 8.5 (1991); likewise the Microsoft runtime hasn’t changed at least since Microsoft C 5.0 (1987). Most likely it never changed. The fact that the discrepancy exists probably indicates that very few programmers use the POSIX file I/O calls with text files at all.

There are also differences in exactly how different run-time libraries deal with CR characters in text files (Microsoft, Watcom, and Borland run-times all behave differently), but that’s perhaps something for a different blog post.

This entry was posted in C, Development. Bookmark the permalink.

11 Responses to Learn Something Old Every Day, Part XIV: read() Return Value May Surprise

  1. tk says:

    Hello Michal,

    Regarding read ()’s behaviour, I think POSIX.1-2024 does have a bit to say — about short read counts:

    “The value returned [by read ()] may be less than nbyte […] if the read() request was interrupted by a signal [etc. …]
    “If a read() is interrupted by a signal before it reads any data, it shall return -1 with errno set to [EINTR].” (https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html)

    So even on POSIX systems where programs do not need to do newline conversion, a short read count does not necessarily mean that an end-of-file is reached. I normally test for end-of-file by checking if read () returns exactly 0 (with no error).

    Thank you!

  2. Derek says:

    I do recall using read() under DOS years ago, generally in binary mode, but occasionally in text mode.

    That would have been with the Borland C compilers.

    However the method used to detect EOF was not return length less than specified read length, but rather read() eventually returning 0.

    Simply because that is what one has to do under UNIX in order to detect EOF irrespective of what the fd is connected to.

    i.e. short reads are possible even before EOF for “slow” devices, e.g. tty’s, sockets, etc.

  3. Michal Necasek says:

    True. The code in question was meant to work with disk files, and I’m fairly certain any read shorter than the required size really meant either EOF or an error.

  4. Michal Necasek says:

    True. A good reminder that the C stdio interface is much easier to use, in addition to being more portable.

  5. zeurkous says:

    @Necasek: Sorry, but the superficial convenience of stdio(3) fails to
    charm me. There are too many edge cases, too many obfuscations of
    what’s really going on. On modern systems, the buffering largely just
    gets in the way, and then there are “physically impossible” calls like
    ungetc(3), implemented through hackery. Ugh.

    With read(2), the return value (combined with errno if it’s -1) gives
    all the required information needed to proceed. No feof(3), no
    ferror(3) — just the return value (and errno).

    Me understands that a “text mode” kludge, however unfortunate, is
    needed here (bad design decisions…), but is the ensuing confusion
    about ‘len’ (which meknows as ‘nbytes’) really the fault of the read(2)
    interface? Medoesn’t think so.

  6. Michal Necasek says:

    Which was the bad design decision? Using LF for line endings? Using CR/LF for line endings? Something else?

  7. zeurkous says:

    That’s actually a rather good question. Me’ll try to provide a
    somewhat satisfying answer.

    There are several layers of design mistakes involved here. Zeroth,
    there’s the fact that ASCII does not have a separate newline character
    (though something like RS could be used). This made sense at the time,
    it reduced complexity; just like it made sense for some operating
    systems to settle on either LF xor CR for a newline, in lieu of a
    dedicated character. But in hindsight, they’re still mistakes. In fact,
    it’s an instance of the classic pattern of compounding one design
    mistake with another.

    Me supposes that, at the time, few realized how important line-based
    (as opposed to traditional record-based) processing would become; lines
    have virtually replaced old-school records, at least in lower-level
    applications. If they had, the issue would’ve probably been approached
    with a bit more care at the time.

    A *really* painful mistake, though, is to try to compensate for all that
    in something low-level like the read(2) routine. Sure, it makes porting
    of UNIX programs to mess-dos superficially easier, but as you found out,
    it confuses the interface. Death to O_TEXT! 🙂

  8. zeurkous says:

    Whoa, when me posted the above comment, WordPress returned a comment
    page where me new comment was #5 — me orig comment and your response
    were missing. You might want to have another look at the caching
    there… (Refreshing the page, of course, fixed it.)

  9. MiaM says:

    @zeurkous:
    The WordPress caching or whatnot seems to suck most of the time. Pro tip: The RSS feed seems to be updated even when the actual web page has an old cached copy.

    In general re CR/LR and whatnot, I have a strong suspicion that this has crept in to what I think were supposed to be binary or at least application specific file formats.

    In particular you can copy a Firefox profile directory from a Linux computer and use it on a Windows computer. Web sites you visit using that profile even still things you are using Linux as the profile seem to contain the browser ID and whatnot. But trying to do the other way around, copying a Firefox profile from Windows to Linux seems to just not work.

    This is just a strong speculation, but I think that the file I/O for the Firefox profile uses “cooked” files, i.e. uses some part of the compiler (or OS library/API) to use CR/LF in Windows and only LF in Linux, and the code in Windows accepts a lone LF but the code in Linux just gets angry if it sees both CR+LF.

    In hindsight I think that it would had been great if ASCII, the actual committee, had at some point in the late 70’s put down the foot and told everyone that except for output to mechanical devices like printers, one of CR or LF would be deprecated and any system/OS using that code would not be eligible for government contracts in the US. I.E. say any computer put on the market starting in 1980, that isn’t software wise clearly based on an earlier product from the same company, would have to use whichever of CR or LF that ASCII would decide on. It would had taken some time for this to propagate, but in particular all the 16-bit and 32-bit systems would had had the same standard, i.e. IBM PC, Mac and for that sake Amiga, Atari ST and whatnot. Also all the home computers that wasn’t already on the market, or based on older systems, would also had been compatible. (I.E. Commodore would had gotten away with continuing using CR on all their 8-bit computers even if LF had been selected by ASCII, as the original PET used CR and the APIs of all later 8-bit computers from Commodore were add-ons to the API on the PET).

  10. Michal Necasek says:

    I am sorry, but if some standard is designed to solve a specific problem (such as ANSI and teletypes/terminals), how is solving that problem a “design mistake”? ANSI had nothing to do with files and almost nothing to do with computers.

  11. zeurkous says:

    Me already admitted that caveat: likely, at the time, no-one realized
    that–

    a) ASCII’s main use would be in computing, not in oldskool
    telecomms;
    b) line-by-line processing would become more important than
    record-by-record processing, at least in lower-level applications;
    and, thus:
    c) that a separate, dedicated newline character would solve a lot of
    future problems.

    As MiaM pointed out, though, the issue could’ve been addressed
    relatively early on, and it wasn’t. Another way of doing so could’ve
    been to issue another revision of the standard. (In the end, we did get
    a ‘NEL’ character in C1, but that just opened another can of worms as
    implementing C1 turns ASCII from 7-bit into 8-bit.)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.