[pacman-dev] [PATCH v2 5/8] Avoid problematic use of Python's StringIO.

Mon Oct 14 16:48:03 EDT 2013

On 15 October 2013 02:01, Jeremy Heiner <scalaprotractor at gmail.com> wrote:
> On Sun, Oct 13, 2013 at 7:55 PM, Allan McRae <allan at archlinux.org> wrote:
>> I am going to merge all these patches apart from this one and the final
>> patch.  If a consensus can be found on how to deal with this issue, I
>> will pull it in - I am not familiar enough with python issues to make
>> the decision myself.
>
> Thanks, Allan. I'm gratified that I can help make (some small) improvements.
>
> Sorry I got delayed, but I said I would explain how the Python 2
> string gotchas impact the pacman testing framework. I think I found a
> way to shorten this from what I had anticipated, so hopefully it won't
> be completely boring...
>
> There are two pmtests with non-7bit-ascii chars: remove071 and
> sync600. remove071 creates one pmpkg (p2) and adds it to the "local"
> pmdb. sync600 copy-n-pastes that same p2 pmpkg setup, but also creates
> and adds sp2 to the "sync" pmdb.
>
> The framework does very different things for the pmdbs: "local" stuff
> get written to the filesystem (simulating in Python code what pacman
> would do to install), while "sync" stuff get written to a tarfile (for
> later processing by the pacman binary being tested). That is the key
> difference and stumbling block (and also why this can't be dealt with
> in sync600).
>
> Python's filesystem write API gracefully handles strings of all sorts,
> automatically converting char-to-byte as needed, so the "local" pmpkg
> p2 (in both pmtests) works great, but...
>
> The tarfile.addfile API requires a fileobject, so the caller of that
> API is responsible for handling the low-level char-to-byte conversion.
> Python 2.7's StringIO meets that need. But in 3.x there aren't just
> fileobjects, there's RawIOBase (the parent class for BytesIO) which
> reads and writes bytes, and TextIOBase (the parent class for StringIO)
> which reads and writes chars. tarfile.addfile writes bytes, so in 3.x
> it fails when it tries to read bytes from a TextIOBase.
>
> So how do we feed tarfile.addfile what it wants without special-casing
> for the Python runtime version? Rather than typing up a long
> explanation of why there is no way to meet that goal I've attached a
> Python script that tries all the options I could think of and produces
> a nice printout of the reason for failure in each case. The last line
> of the printout lists the successful options - those that work for
> that particular Python runtime. Running it on 2.7 and 3.3 shows no
> single option is successful for both.
>
> The attached script covers what Martin suggested (assuming I haven't
> misunderstood what he meant). And if anyone can think of an option
> that I didn't please post a reply - I love learning new things.

Here are two suggestions:

1. If you put the “u” prefix on the Chinese string, it becomes a
Unicode string in Python 2, and encode("utf-8") then works for 2.7 and
3.3+. You can also have the “u” prefix on the other two ASCII strings
but that is optional:

# . . .
for entry in ["ascii", u"错误", u"7bit"]:
    # . . .

2. Put the following __future__ line right at the top of the file, and
remove all the “u” prefixes on the strings. This effectively makes
them all Unicode strings in Python 2 and 3. The encode("utf-8") call
should work for 2.7, and since the “u” prefix can be removed, it
should also work for Python 3.2 and earlier.

# coding=utf8
from __future__ import unicode_literals

import io
import os
# . . .

for entry in ["ascii", "错误", "7bit"]:
    # . . .

I think the second option is the best option since it uses proper
Python 3 syntax and should allow compatibility with Python 3.2. I am
going away from the Internet for a week, but if you are still having
trouble coming up with a good solution after that I might be motivated
to actually run the test suite myself and come up with a patch :P