[arch-dev-public] O'Reilly book: Making Software
My boss just brought this book over to my desk. It's one of those theory books that's all about measuring lines of code and whatnot. But the reason he brought it over, is Chapter 8: "Beyond Lines of Code: Do We Need More Complex Metrics?". Two pages in, it begins with: **Measuring the Source Code** We have selected for our case study the ArchLinux software distribution (http://archlinux.org), which contains thousands of packages, all open source. ArchLinux is a lightweight GNU/Linux distribution whose maintainers refuse to modify the source code packaged for the distribution, in order to meet the goal of drastically reducing the time that elapses between the official release of a package and its integration into the distribution. ... Because of the size of ArchLinux, using it as a case study gives us access to the original source code of thousands of open source projects, through the build scripts used by ABS (see Example 8-1) The chapter goes on with statistics and all that junk. They're not studying Arch, but using Arch as a launch pad for getting large amounts of open source source code for analysis. The numbers are interesting: The ArchLinux repositories contained 4096 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4015 packages, containing 1272748 source code files. Among all those files, 576511 were written in C. However, there were repeated files. In the overall sample, only 776573 were unique files; in the C subsample, only 338831 were unique files. From these unique C files, 212167 were nonheader files and 126664 were header files.
On 22 November 2010 19:39, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
My boss just brought this book over to my desk. It's one of those theory books that's all about measuring lines of code and whatnot.
But the reason he brought it over, is Chapter 8: "Beyond Lines of Code: Do We Need More Complex Metrics?". Two pages in, it begins with:
**Measuring the Source Code** We have selected for our case study the ArchLinux software distribution (http://archlinux.org), which contains thousands of packages, all open source. ArchLinux is a lightweight GNU/Linux distribution whose maintainers refuse to modify the source code packaged for the distribution, in order to meet the goal of drastically reducing the time that elapses between the official release of a package and its integration into the distribution. ... Because of the size of ArchLinux, using it as a case study gives us access to the original source code of thousands of open source projects, through the build scripts used by ABS (see Example 8-1)
The chapter goes on with statistics and all that junk. They're not studying Arch, but using Arch as a launch pad for getting large amounts of open source source code for analysis. The numbers are interesting:
The ArchLinux repositories contained 4096 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4015 packages, containing 1272748 source code files. Among all those files, 576511 were written in C. However, there were repeated files. In the overall sample, only 776573 were unique files; in the C subsample, only 338831 were unique files. From these unique C files, 212167 were nonheader files and 126664 were header files.
Great! Proud to see Arch here :) The author might also be an Archer, who knows. BTW I also like the idea of your boss coming to give you the book like "Oh and there this O'Reilly book about some of the stuff you do...". -- Guillaume
On Mon, 22 Nov 2010 12:39:57 -0600 Aaron Griffin <aaronmgriffin@gmail.com> wrote:
My boss just brought this book over to my desk. It's one of those theory books that's all about measuring lines of code and whatnot.
But the reason he brought it over, is Chapter 8: "Beyond Lines of Code: Do We Need More Complex Metrics?". Two pages in, it begins with:
**Measuring the Source Code** We have selected for our case study the ArchLinux software distribution (http://archlinux.org), which contains thousands of packages, all open source. ArchLinux is a lightweight GNU/Linux distribution whose maintainers refuse to modify the source code packaged for the distribution, in order to meet the goal of drastically reducing the time that elapses between the official release of a package and its integration into the distribution. ... Because of the size of ArchLinux, using it as a case study gives us access to the original source code of thousands of open source projects, through the build scripts used by ABS (see Example 8-1)
The chapter goes on with statistics and all that junk. They're not studying Arch, but using Arch as a launch pad for getting large amounts of open source source code for analysis. The numbers are interesting:
The ArchLinux repositories contained 4096 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4015 packages, containing 1272748 source code files. Among all those files, 576511 were written in C. However, there were repeated files. In the overall sample, only 776573 were unique files; in the C subsample, only 338831 were unique files. From these unique C files, 212167 were nonheader files and 126664 were header files.
hmmm.. 39% of all files in our packages are duplicates. That's interesting. Wonder where they come from. Dieter
On 23/11/10 07:09, Dieter Plaetinck wrote:
On Mon, 22 Nov 2010 12:39:57 -0600 Aaron Griffin<aaronmgriffin@gmail.com> wrote:
My boss just brought this book over to my desk. It's one of those theory books that's all about measuring lines of code and whatnot.
But the reason he brought it over, is Chapter 8: "Beyond Lines of Code: Do We Need More Complex Metrics?". Two pages in, it begins with:
**Measuring the Source Code** We have selected for our case study the ArchLinux software distribution (http://archlinux.org), which contains thousands of packages, all open source. ArchLinux is a lightweight GNU/Linux distribution whose maintainers refuse to modify the source code packaged for the distribution, in order to meet the goal of drastically reducing the time that elapses between the official release of a package and its integration into the distribution. ... Because of the size of ArchLinux, using it as a case study gives us access to the original source code of thousands of open source projects, through the build scripts used by ABS (see Example 8-1)
The chapter goes on with statistics and all that junk. They're not studying Arch, but using Arch as a launch pad for getting large amounts of open source source code for analysis. The numbers are interesting:
The ArchLinux repositories contained 4096 packages (as of April 2010), with some of the packages being different versions of the same upstream project. After removing different versions, we obtained a sample of 4015 packages, containing 1272748 source code files. Among all those files, 576511 were written in C. However, there were repeated files. In the overall sample, only 776573 were unique files; in the C subsample, only 338831 were unique files. From these unique C files, 212167 were nonheader files and 126664 were header files.
hmmm.. 39% of all files in our packages are duplicates. That's interesting. Wonder where they come from.
A lot of projects include sources from their dependencies inside their tarball. Allan
participants (4)
-
Aaron Griffin
-
Allan McRae
-
Dieter Plaetinck
-
Guillaume ALAUX