Projects relating to cpplib

As of 11 September 2000, cpplib has largely been completed. It has received six months of testing as the only preprocessor used by development gcc, and I'm pretty happy with its stability at this point.

cpplib is now linked into the C, C++ and Objective C front ends; this will be the case for GCC 3.0 too.

How to help test

Testing is not really necessary. If you do, be prepared for odd glitches - see below for the list of known problems.

The best thing to test with the integrated preprocessor is large packages that (ab)use the preprocessor heavily. The compiler itself is pretty good at that, but doesn't cover all the bases. If you've got cycles to burn, please try one or more of:

A bug report saying 'package FOO won't compile on system BAR' is useless. We need short testcases with no system dependencies. Aim for less than fifty lines and no #includes at all. I recognize this won't always be possible.

Also, please file off everything that would cause us legal trouble if we were to roll your test case into the distributed test suite. Short test cases will almost always fall under fair use guidelines, so don't sweat it too much. An example of a problem is if your test case includes a 200-line comment detailing inner workings of your program. (A 200-line comment might be what you need to provoke a bug, but its contents are unlikely to matter. Try running it through "tr A-Za-z x".)

As usual, report bugs to gcc-bugs@gcc.gnu.org. But please read the rest of this document first!

Bug reports in code which must be compiled with gcc -traditional are of interest, but much lower priority than standard conforming C/C++. Traditional mode is implemented by a separate program, not by cpplib. Oh, and the lack of support for varargs macros in traditional mode is a deliberate feature.

Work recently completed

  1. We decided to make backslash, whitespace, newline a line continuation (with a warning). This is almost always an editing mistake, and causes floods of errors if rejected. The new behaviour works everywhere - within comments, between tokens, and within tokens.
  2. The lexical analyzer has been rehashed again. It is quite clean now, still single-pass, and only steps backwards in the input stream in one place - when handling trigraphs and escaped newlines. This too should be fixable if necessary, and means multibyte character and UCN escape support within cpplib should now be fairly straight-forward.
  3. -traditional and -save-temps now work with the integrated preprocessor.
  4. The macro expander has been rewritten and fixes all known bugs, including one that exists in previous versions of GCC.
  5. When a macro is defined to itself, it bypasses the macro expander entirely.
  6. C99's _Pragma operator has been implemented.
  7. Integrated CPP is now the build default. Use configure --disable-c-cpplib to force use of the external preprocessor.

Known Bugs

  1. Character sets that are not strict supersets of ASCII may cause cpplib to mangle the input file, even in comments or strings. Unfortunately, that includes important character sets such as Shift JIS and UCS2. (Please see the discussion of character set issues, below.)
  2. Massively parallel builds may cause problems if your system has a global limit on the number of files mapped into memory. I am not aware of any system with this problem, it is purely theoretical.
  3. It's reported that on some targets that define their own #pragmas, the Fortran and Java compilers fail to link; the target-specific code expects a routine called c_lex which does not exist in those compilers. Possibly affected targets are the c4x, i370, i960, and v850.

Missing User-visible Features

  1. Character sets that are strict supersets of ASCII are safe to use, but extended characters cannot appear in identifiers. This has to be coordinated with the C and C++ front ends. See character set issues, below.
  2. C99 universal character escapes (\uxxxx, \Uxxxxxxxx) are not recognized except in string or character constants, and will be misinterpreted in character constants appearing in #if directives. Again, proper support has to be coordinated with the compiler proper.
  3. Precompiled headers are commonly requested; this entails the ability for cpp to dump out and reload all its internal state. You can get some of this with the debug switches, but not all, and not in a reloadable format. The front end must cooperate also.
  4. The dependency generator is lacking in several areas. Tom Tromey has a proposal for improving it - added features include the ability to control the name of the output file and the target of the generated rule, and add dummy rules to prevent lossage when a header is deleted. I would also like to see a mode in which GCC suppresses system headers from the dependency list based on where they're found, not what sort of quotation marks were used when they were included (as -MM currently does).

Internal work that needs doing

  1. The lexical analyzer and macro expander need to be profiled and tuned.
  2. We allocate lots of itty bitty items with malloc. Some work has been done on aggregating these into big blocks, using obstacks, but we could do even more. Again, this can be a performance issue.
  3. VMS support has suffered extreme bit rot. There may be problems with support for DOS, Windows, MVS, and other non-Unixy platforms. No one has complained, though.

Integrating cpplib with the C and C++ front ends

This is mostly done.

  1. Front ends need to use cpplib's line and column numbering interface directly. The existing code copies cpplib's internal state into the state used by diagnostic.c, which is better than writing out and processing linemarker commands, but still suboptimal.
  2. The identifier hash tables used by cpplib and the front end should be unified. In breadboard tests, this can net up to 10% speedup, mainly because the hash table used by front ends now (see tree.c) is no good.
  3. If Yacc did not insist on assigning its own values for token codes, there would be no need for a translation layer between the codes returned by cpplib and the codes used by the parser. Noises have been made about a recursive-descent parser that could handle all of C, C++, Objective C; if this ever happens, it should use cpplib's token codes.
  4. The work currently done by c-lex.c converting constants of various stripes to their internal representations might be better off done in cpplib. I can make a case either way.
  5. If the integrated preprocessor is used, -g3 does not add information about macro definitions to the debugging output. This is minor; -g3 only works with the obsolete DWARF version 1, and no one seems to mind.

Optimizations

  1. At the moment, we cache file buffers in memory as they appear on disk. It might be worthwhile to do lexical analysis over the entire file and cache it like that, before directive processing and macro expansion. This would save a good deal of work for files that are included more than once. However, it would be less efficient for files included only once due to increased memory requirements; how do we tell the difference?
  2. A complement to the usual one-huge-file scheme of precompiled headers would be to cache files on disk after lexical analysis. You could run a cruncher over /usr/include and save the results in a .jar file or similar, bypassing filesystem overhead as well as the work of lexical analysis.
  3. Wrapper headers - files containing only an #include of another file - should be optimized out on reinclusion. (Just tweak the include-file table entry of the wrapper to point to the file it reads.)

Character set issues

Proper non-ASCII character handling is a hard problem. Users want to be able to write comments and strings in their native language. They want the strings to come out in their native language and not gibberish after translation to object code. Some users also want to use their own alphabet for identifiers in their code. There is no one-to-one or many-to-one map between languages and character set encodings. The subset of ASCII that is included in most modern day character sets does not include all the punctuation C uses; some of the missing punctuation may be present but at a different place than where it is in ASCII. The subset described in ISO646 may not be the smallest subset out there.

At the present time, GCC supports the use of any encoding for source code, as long as it is a strict superset of 7-bit ASCII. By this I mean that all printable (including whitespace) ASCII characters, when they appear as single bytes in a file, stand only for themselves, no matter what the context is. This is true of ISO8859.x, KOI8-R, and UTF8. It is not true of Shift JIS and some other popular Asian character sets. If they are used, GCC may silently mangle the input file. The only known specific example is that a Shift JIS multibyte character ending with 0x5C will be mistaken for a line continuation if it occurs at the end of a line. 0x5C is "\" in ASCII.

Assuming a safe encoding, characters not in the base set listed in the standard (C99 5.2.1) are syntax errors if they appear outside strings, character constants, or comments. In strings and character constants, they are taken literally - converted blindly to numeric codes, or copied to the assembly output verbatim, depending on the context. If you use the C99 \u and \U escapes, you get UTF8, no exceptions. These too are only supported in string and character constants.

We intend to improve this as follows:

  1. cpplib will be reworked so that it can handle any character set in wide use, whether or not it is a strict superset of 7-bit ASCII. This means that cpplib will never confuse non-ASCII characters with C punctuators, comment delimiters, or whatever.
  2. In comments, naturally any character will be permitted to appear.
  3. All Unicode code points which are permitted by C99 Annex D to appear in identifiers, will be accepted in identifiers. All source-file characters which, when translated to Unicode, correspond to permitted code points, will also be accepted. In assembly output, identifiers will be encoded in UTF8, and then reencoded using some mangling scheme if the assembler cannot handle UTF8 identifiers. (Does the new C++ ABI have anything to say about this? What does the Java compiler do?)
    Unicode U+0024 will be permitted in identifiers if and only if $ is permitted.
  4. In strings and character constants, GCC will translate from the character set of the file (selectable on a per-file basis), to the current execution character set (chosen once per compilation). This may or may not be Unicode. UCN escapes will also be converted from Unicode to the execution character set; this happens independent of the source character set.
  5. Each file referenced by the compiler may state its own character set with a #pragma, or rely on the default established by the user with locale or a command line option. The #pragma, if used, must be the first line in the file. This will not prevent the multiple include optimization from working. GCC will also recognize MULE (Multilingual Emacs) magic comments, byte order marks, and any other reasonable in-band method of specifying a file's character set.

It's worth noting that the standard C library facilities for "multibyte character sets" are not adequate to implement the above. The basic problem is that neither C89 nor C99 gives you any way to specify the character set of a file directly. You can manipulate the "locale," which indirectly specifies the character set, but that's a global change. Further, locale names are not defined by the C standard nor is there any consistent map between them and character sets.

The Single Unix specification, and possibly also POSIX, provide the nl_langinfo and iconv interfaces which mostly circumvent these limitations. We may require these interfaces to be present for complete non-ASCII support to be functional.

One final note: EBCDIC is, and will be, supported as a source character set if and only if GCC is compiled for a host (not a target) which uses EBCDIC natively.