My boss just brought this book over to my desk. It's one of those theory books that's all about measuring lines of code and whatnot. But the reason he brought it over, is Chapter 8: "Beyond Lines of Code: Do We Need More Complex Metrics?". Two pages in, it begins with: **Measuring the Source Code** We have selected for our case study the ArchLinux software distribution (http://archlinux.org), which contains thousands of packages, all open source. ArchLinux is a lightweight GNU/Linux distribution whose maintainers refuse to modify the source code packaged for the distribution, in order to meet the goal of drastically reducing the time that elapses between the official release of a package and its integration into the distribution. ... Because of the size of ArchLinux, using it as a case study gives us access to the original source code of thousands of open source projects, through the build scripts used by ABS (see Example 8-1) The chapter goes on with statistics and all that junk. They're not studying Arch, but using Arch as a launch pad for getting large amounts of open source source code for analysis. The numbers are interesting: The ArchLinux repositories contained 4096 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4015 packages, containing 1272748 source code files. Among all those files, 576511 were written in C. However, there were repeated files. In the overall sample, only 776573 were unique files; in the C subsample, only 338831 were unique files. From these unique C files, 212167 were nonheader files and 126664 were header files.