On 12/12/2009, at 10:11 PM, Allan McRae wrote:
Sebastian Nowicki wrote:
As you may have heard, I started a proper PKGBUILD parser[1], which parses according to shell semantics and does a little interpreting. I just released the first version, which doesn't handle errors, or multi-line values (like arrays or escaped newlines) very well. It does however support split packages. I'm in the process of modifying parched to essentially turn it into python bindings[2] for pkgparse. You probably already have a parser at this point, so I'm not sure how useful this would be to you (it might be overkill anyway), I just though I'd let you know. [1]: http://github.com/sebnow/pkgparse [2]: http://github.com/sebnow/parched/tree/pkgparse_pyrex
Looks interesting. I will take it for a spin later. I assume this is going towards AUR2?
I had not done any further work on my parser as I was uncertain what was the best way to go in developing a makepkg test suite. Given the makepkg test suite will use a safe set of PGKBUILDs, I was thinking of just using bash to parse them.
That would probably be the simplest and most accurate way (you don't want bugs in the parser to fail tests). The namcap method is a good way to go. You could look at the current AUR2 implementation to get information (it sources the PKGBUILD and spits out python code)[1]. It is indeed for AUR2. I thought that since you were contemplating a PKGBUILD parser, this might be of use to you. On 12/12/2009, at 11:44 PM, Xavier wrote:
I can't help but think this whole situation is stupid.
I would suppose that PKGBUILDs were written in bash for simplicity reason : makepkg just needs to source them, and that's it. Whole parsing done for free. And now we realize that when using untrusted source, we cannot do that anymore. And now we basically have to rewrite a bash parser from scratch. I mean, it's hard to imagine a more flawed design, and more complex solution to a simple problem.
It is, but as Allan pointed out, makepkg's goals are different. AUR effectively uses PKGBUILDs in an unconventional way.
Somehow we manage to go from a very KISS solution to a completely anti-KISS one.
I only see two solutions : - we keep using bash, but try to do that in the most restricted environment possible (e.g. namcap way , or maybe there is something even more restrictive and secure ?)
I don't think it's possible to make the bash environment any more restrictive than namcap's way. There are fundamental things about bash that make it insecure for blind parsing. The first example that comes to mind are infinite loops. They may not be deliberate, or occur often, but it's a possibility. The others regard various utilities which are required for a sane build environment (mv, install, cat, etc) but could possibly be dangerous when used maliciously. I recall Callan (or someone else) provided a nice example of this on the aur- dev mailing list. It's probably lost now. A custom parser omits these problems by not executing commands (and by inference possibly getting inaccurate metadata, e.g. sed) and restrictive language constructs like loops (maximum amount of iterations before an error is raised). There are of course pros and cons to both approaches.
It is not particularly possible given all the bash "tricks" used in PKGBUILDs.
Why is that ? Things like calling external commands and such ? And would the alternative parser support these bash tricks anyway ?
So why don't we just forbid / not support them ?
Some "tricks" are required for the PKGBUILDs to be sane and remove duplication. One particular trick that comes to mind is: if [ "$CARCH" = "x86_64"]; then ... fi. This is something that is very common with binary sources, and the PKGBUILD format has no native support for it. Removing support for it would require two PKGBUILDs, almost identical. Due to things like this, a bit of interpretation is required for the parser to work properly. I plan on supporting conditionals like these in pkgparse. I have also seen for loops populating fields (I think in the kernel package). These tricks should not be supported as it is due to the laziness of the maintainer. Essentially, for building packages, or extracting metadata from safe packages, using bash directly is the way to go. For environments where arbitrary packages are parsed and security is an issue, a parser is currently required.
- we decide that pkgbuild format is a flawed design, and was too limited for our needs, and switch to a new one (in which case Xyne's brainstorming could help : http://xyne.archlinux.ca/ideas/pkgmeta )
I have been thinking of designing a new package metadata format myself, one that is more universal (i.e. not specific to pacman). I doubt this is achievable, but it's an interesting experiment. The idea is mostly inspired by the rockspec format for luarocks[2]. It seems very flexible and provides lots of data. The Lua syntax resembles JSON, so that format can be used instead. YAML could also be used to make it "prettier" and even more flexible. Anyway that's a discussion for another thread. A new format is unlikely to be used by Archlinux any time soon. Sorry for going a little off-topic there. [1]: http://github.com/sebnow/aur2/blob/1ad21387a58eab0d3cf4296f08b6edee0a1e27d2/... [2]: http://www.luarocks.org/en/Creating_a_rock