As of 11 September 2000, cpplib has largely been completed. It has received six months of testing as the only preprocessor used by development gcc, and I'm pretty happy with its stability at this point.
cpplib is now linked into the C, C++ and Objective C front ends; this will be the case for GCC 3.0 too.
Testing is not really necessary. If you do, be prepared for odd glitches - see below for the list of known problems.
The best thing to test with the integrated preprocessor is large packages that (ab)use the preprocessor heavily. The compiler itself is pretty good at that, but doesn't cover all the bases. If you've got cycles to burn, please try one or more of:
A bug report saying 'package FOO won't compile on system BAR' is useless. We need short testcases with no system dependencies. Aim for less than fifty lines and no #includes at all. I recognize this won't always be possible.
Also, please file off everything that would cause us legal trouble
if we were to roll your test case into the distributed test suite.
Short test cases will almost always fall under fair use guidelines, so
don't sweat it too much. An example of a problem is if your test case
includes a 200-line comment detailing inner workings of your program.
(A 200-line comment might be what you need to provoke a bug, but its
contents are unlikely to matter. Try running it through
"tr A-Za-z x"
.)
As usual, report bugs to gcc-bugs@gcc.gnu.org. But please read the rest of this document first!
Bug reports in code which must be compiled with gcc
-traditional
are of interest, but much lower priority than
standard conforming C/C++. Traditional mode is implemented by a
separate program, not by cpplib. Oh, and the lack of support for
varargs macros in traditional mode is a deliberate feature.
-traditional
and -save-temps
now work
with the integrated preprocessor.
_Pragma
operator has been implemented.
configure
--disable-c-cpplib
to force use of the external
preprocessor.
#pragma
s, the Fortran and Java compilers fail to
link; the target-specific code expects a routine called
c_lex
which does not exist in those compilers.
Possibly affected targets are the c4x, i370, i960, and v850.
\uxxxx
,
\Uxxxxxxxx
) are not recognized except in string
or character constants, and will be misinterpreted in character
constants appearing in #if directives. Again, proper support
has to be coordinated with the compiler proper.
-MM
currently does).
This is mostly done.
diagnostic.c
, which is
better than writing out and processing linemarker commands, but
still suboptimal.
tree.c
) is no good.
c-lex.c
converting
constants of various stripes to their internal representations
might be better off done in cpplib. I can make a case either
way.
-g3
does
not add information about macro definitions to the debugging
output. This is minor; -g3
only works with the
obsolete DWARF version 1, and no one seems to mind.
/usr/include
and save
the results in a .jar
file or similar, bypassing
filesystem overhead as well as the work of lexical analysis.
Proper non-ASCII character handling is a hard problem. Users want to be able to write comments and strings in their native language. They want the strings to come out in their native language and not gibberish after translation to object code. Some users also want to use their own alphabet for identifiers in their code. There is no one-to-one or many-to-one map between languages and character set encodings. The subset of ASCII that is included in most modern day character sets does not include all the punctuation C uses; some of the missing punctuation may be present but at a different place than where it is in ASCII. The subset described in ISO646 may not be the smallest subset out there.
At the present time, GCC supports the use of any encoding for source code, as long as it is a strict superset of 7-bit ASCII. By this I mean that all printable (including whitespace) ASCII characters, when they appear as single bytes in a file, stand only for themselves, no matter what the context is. This is true of ISO8859.x, KOI8-R, and UTF8. It is not true of Shift JIS and some other popular Asian character sets. If they are used, GCC may silently mangle the input file. The only known specific example is that a Shift JIS multibyte character ending with 0x5C will be mistaken for a line continuation if it occurs at the end of a line. 0x5C is "\" in ASCII.
Assuming a safe encoding, characters not in the base set listed in
the standard (C99 5.2.1) are syntax errors if they appear outside
strings, character constants, or comments. In strings and character
constants, they are taken literally - converted blindly to numeric
codes, or copied to the assembly output verbatim, depending on the
context. If you use the C99 \u
and \U
escapes, you get UTF8, no exceptions. These too are only supported in
string and character constants.
We intend to improve this as follows:
U+0024
will be permitted in
identifiers if and only if $
is permitted.
#pragma
, or rely on the default
established by the user with locale or a command line option.
The #pragma
, if used, must be the first line in
the file. This will not prevent the multiple include
optimization from working. GCC will also recognize MULE
(Multilingual Emacs) magic comments, byte order marks, and any
other reasonable in-band method of specifying a file's character set.
It's worth noting that the standard C library facilities for "multibyte character sets" are not adequate to implement the above. The basic problem is that neither C89 nor C99 gives you any way to specify the character set of a file directly. You can manipulate the "locale," which indirectly specifies the character set, but that's a global change. Further, locale names are not defined by the C standard nor is there any consistent map between them and character sets.
The Single Unix specification, and possibly also POSIX, provide the
nl_langinfo
and iconv
interfaces which
mostly circumvent these limitations. We may require these interfaces
to be present for complete non-ASCII support to be functional.
One final note: EBCDIC is, and will be, supported as a source character set if and only if GCC is compiled for a host (not a target) which uses EBCDIC natively.