Hello,
I had some fun recently trying to apply the technique of diverse double compiling, and the result I obtained would indicate that the gcc compiler is corrupted. I don't believe that as for now, but I don't explain the result.
To set the stage, I looked at this: ACSAC paper Countering Trusting Trust through Diverse Double-Compiling which is a technique which could answer the catch-22 of a corrupted compiler, as explained and demonstrated historically by Ken Thompson in his seminal paper:
https://www.ece.cmu.edu/~ganger/712....1-thompson.pdf
I set forth to do the opposite: not to try to recompile gcc with another compiler, tcc, as suggested, but by recompiling tcc by tcc, as tcc is much much simpler to compile than gcc, and is said to be self-compiling as it should.
I first tried the thing on a recent Ubuntu install, and because I had difficulties, I went downloading the old ubuntu server installer of Dapper (ubuntu 6.06 LTS) Ubuntu 6.06.2 LTS (Dapper Drake)
and installed the x86 server installer in virtualbox.
I downloaded the tcc source from Index of /releases/tinycc
I installed the gcc and build-essential and texinfo packages on my virtual machine.
I built and installed tcc (version 0.9.26) exactly as described in the README file (all standard options), using ./configure, make, make test and sudo make install. All this went smoothly.
Of course, now I had compiled the tcc source with gcc of the Dapper installation, this was not tcc compiled by itself. But as I had a working binary tcc install now, I could re-build tcc again, but this time using tcc itself.
This is done by cleaning out the directory with make clean, and using:
./configure --cc=tcc
followed by make, make test, and make install.
Of course, as the tcc source was now compiled with tcc and not with gcc, it is normal that we obtained ANOTHER binary of tcc (it is the output of a different compiler: tcc instead of gcc). So that the binaries were different was no surprise.
But here comes the strange part: I set apart all the *.o, *.a, the lib and the tcc executable in another directory, and I cleaned out the directory AGAIN, and I recompiled the tcc suite AGAIN using tcc.
Note the subtlety: the first time, I compiled the tcc suite using tcc, but which had been built by gcc. The second time, I used the result of that, tcc built with tcc, to build the tcc suite.
In principle, I was expecting identical binaries. It turns out that the *.o files resulting from this second tcc compilation, are indeed identical, but the strange thing is that the executable, tcc, resulting from this, is DIFFERENT.
Still stranger: I installed this tcc over the former tcc, and when I rebuilt the tcc suite AGAIN, a third time, this time, the binaries are identical.
This is, according to the DDC test, the indication of a tampered binary of a compiler if one doesn't find any other indication. With my tin foiled hat on, I would say: there's a backdoor in the gcc binary since at least 2006 (Dapper) ! Of course, I think there's something else that's wrong, but I wouldn't know what.
Here are the byte lengths of the different versions:
tcc executable (A) from tcc suite compiled with gcc: 385632 bytes
tcc executable (B) from tcc suite compiled with (A): 489020 bytes
tcc executable (C) from tcc suite compiled with (B): 486564 bytes
tcc executable (D) from tcc suite compiled with (C): 486564 bytes
C and D are byte by byte identical.
I was expecting, if gcc didn't have any back door using the DDC test, that B, C and D should be identical.
The OBJECT files *.o from the compiled suites B C and D are, however, identical, so it is tcc as a linker that is the cullprit apparently.
Note that the *.a files are different, but that is normal as the *.a files contain the file dates, and the object files are of course created at different moments. But as they are made by up of identical object files, that shouldn't really matter. They have identical lengths between the B, C and D versions, and they differ with A, as expected:
libtcc1.a has length 40020 and libtcc.a has length 511102 in version A, and they have length 29670 respectively 474498 in versions B, C and D.
This is pretty puzzling. Anybody an idea what can explain the difference between the executable between version B (tcc compiled with tcc which had been compiled with gcc) and versions C and D ?
The reason why this shouldn't happen is the following.
Consider that you have a source code S in language L which describes a program that, say, takes text files as input, and puts out a text file that is the alphabetic list of words in the input file, one per line. Normally, the source code S and the language definition L determine perfectly the output file O.txt that one should obtain, if one gives it an input file I.txt.
If one compiles the source code S with a compiler C1 to obtain an executable E1, and one compiles the source code with a compiler C2 to obtain an executable E2, then of course, E1 and E2 will be different executables. Their performance can differ. But they should be *semantically* equivalent: executable E1 should take the text file I.txt and produce an output O.txt, and executable E2 should take text file I.txt and produce output O.txt, because that is what the source code S and the language definition L specify ! If the executable does other things than the source code indicates, then the executable is not a correctly compiled executable of that source code and language.
Now, in my test, the source code S is the source code of tcc, and is written in C. It is the source code of a compiler, that is, a program that takes source code I.txt, and produces a binary executable O.txt.
If I compile source code S with two different compilers, namely, gcc, and tcc, I should of course obtain different binaries, but these binaries are both executables that should do what the source code S (here, the source code of tcc) and language L (here, C) tells it to do.
If I apply these two different executables to the same input I.txt (here, the source code of tcc), I expect the same output (the generated binary of tcc). The fact that the two binaries (B) and (C) are different responses to the two different executables E1 and E2 (in my exercise, versions A and B of tcc executable), indicates that E1 and E2 cannot be both correct executables corresponding to the same source code S.
So my question is: what happens ? I have difficulties thinking this is a genuine security issue in the gcc compiler, but if not, what is it ?
Can a linker (here, tcc) which has been compiled with two different compilers, but from the same source code, generate different executables from the same object files without something fishy going on ?
PS: btw, I had posted a very similar message to codingforums.com, but it didn't get any replies, so maybe this place is more suited.
Bookmarks