[pacman-dev] [RFC] support for URL query strings and fragments
Hi, pacman/libalpm currently only supports "bare" server URIs of the form <scheme name> : <hierarchical part> whereas a full URI is <scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ] Support for query strings would allow for more flexible server configurations using dynamic content. For the sake of a concrete example, I'm in the middle of rewriting Pacserve and I could really use "?repo=$repo&arch=$arch" to keep all packages in one apparent server directory while still being able to correctly redirect to external mirrors. Redirection requires both values to determine and interpolate the server URL before returning it to Pacman with a 307. At first I thought this would be relatively easy to do. I took a quick look at the code but I didn't find a common function to affix the file name to the URL (although I did see "sanitize_url" in db.c). I see two ways of doing this: 1) Support a "$file" variable in the URL. If the URL doesn't contain the variable, add the file name to the end as usual. 2) Remove the query and fragment, treat the URL as usual, then restore the query and fragment. The first would allow for file names in query strings (e.g. "?file=$file"). It could be used for a build server, for example. Of course, you can extract the file name from the request path but that requires hacking the server code or using something like mod_rewrite to mangle URLS. Having the file name sent in a get variable is much more convenient for server-side programming. In either case, it should be enough to have a single, central function that accepts the template URL and the filename (pkg or db) and returns the full URL. The sanitize_url function would also need to handle query strings and fragments. Would there be any objection to this if a patch were submitted?
On Fri, May 10, 2013 at 02:34:34PM +0000, Xyne wrote:
Hi,
pacman/libalpm currently only supports "bare" server URIs of the form
<scheme name> : <hierarchical part>
whereas a full URI is
<scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]
Support for query strings would allow for more flexible server configurations using dynamic content. For the sake of a concrete example, I'm in the middle of rewriting Pacserve and I could really use "?repo=$repo&arch=$arch" to keep all packages in one apparent server directory while still being able to correctly redirect to external mirrors. Redirection requires both values to determine and interpolate the server URL before returning it to Pacman with a 307.
At first I thought this would be relatively easy to do. I took a quick look at the code but I didn't find a common function to affix the file name to the URL (although I did see "sanitize_url" in db.c).
I see two ways of doing this:
1) Support a "$file" variable in the URL. If the URL doesn't contain the variable, add the file name to the end as usual.
2) Remove the query and fragment, treat the URL as usual, then restore the query and fragment.
The first would allow for file names in query strings (e.g. "?file=$file"). It could be used for a build server, for example. Of course, you can extract the file name from the request path but that requires hacking the server code or using something like mod_rewrite to mangle URLS. Having the file name sent in a get variable is much more convenient for server-side programming.
Maybe I didn't understand your problem. But wouldn't using the Content-Disposition header solve it?
In either case, it should be enough to have a single, central function that accepts the template URL and the filename (pkg or db) and returns the full URL. The sanitize_url function would also need to handle query strings and fragments.
Would there be any objection to this if a patch were submitted?
On 2013-05-10 18:02 +0300 Mohammad_Alsaleh wrote:
The first would allow for file names in query strings (e.g. "?file=$file"). It could be used for a build server, for example. Of course, you can extract the file name from the request path but that requires hacking the server code or using something like mod_rewrite to mangle URLS. Having the file name sent in a get variable is much more convenient for server-side programming.
Maybe I didn't understand your problem. But wouldn't using the Content-Disposition header solve it?
That is unrelated. Going back to the Pacserve example. the server runs on localhost. When a package is requested via an HTTP GET request, it checks the local cache for the package and returns it if it is there. If not, it queries other Pacserves on the LAN and sends a simple 303 response to redirect to a Pacserve server that has the package, if there is one. Because Pacserve only serves packages (and not databases), there is no need to have different directories for different repos and architectures. The architecture is contained in the package name and the repo is unimportant. The same reasons permit the official servers to keep all of their packages in the "pool" directories without any confusion. The problem arises when no local Pacserve servers have the package and Pacserve needs to redirect to an external mirror. It then needs to know which repo and which architecture the package is for so that it can select the correct URL from pacman.conf and replace the "$repo" and "$arch" variables in the URL before returning it to the client with the package name appended. Currently this information must be gathered by creating paths such as /core/os/i686 /core/os/x86_64 /extra/os/i686 /extra/os/x86_64 ... on the server. Even if you have access to the server software and can tweak the configuration or settings for path rewriting (e.g. with Apache's mod_rewrite, directly or via .htaccess), it's still a pain and it's silly if all you need are the repo and arch values. It should be possible to pass those values via GET parameters in such a way that pacman can convert: Server = http://example.com/pkgs/?repo=$repo&arch=$arch to http://example.com/pkgs/foo-1.3-4-x86_64.pkg.tar.xz?repo=bar&arch=x86_64 Pacman blindly interpolates $repo and $arch, so that works (although it really should percent-encode them to be sure), but does not understand the query string and fragment parts of the URI, so it can't append the name to the path. For now I have worked around it with Server = http://localhost:15678/pkg/?repo=$repo&arch=$arch&file= but that is a kludge that requires additional server processing and it still generates malformed URLs because it will be converted to "...&file/foo...". The forward slash should be percent encoded and there is no way to get Pacman to omit the slash, so even if it works in some cases, it is technically wrong. Obviously I have started this discussion because I could really use this for pacserve, but it would also be very useful for scripting package servers. You could send all the desired information to the server via GET parameters (repo, arch, package) and have the server locate or build the package. With this, everything is controlled entirely by a single script on the server. Without it, the path has to be mangled, which requires access to the server software or particular settings. To preempt one possible argument, Arch might not officially support such URIs, but pacman aims to be distro-agnostic. Besides, modularizing code and making it more robust is a good thing. I hope that clears up the idea. I'm tired and likely rambling, so I'll shut up now.
participants (2)
-
Mohammad_Alsaleh
-
Xyne