Re: [aur-dev] some notes from converting the aur from mysql to postgresql

26 Apr 2011


      On Mon, Apr 25, 2011 at 7:49 PM, Dan McGee <dpmcgee@gmail.com> wrote:
...
On Mon, Apr 25, 2011 at 9:35 PM, elij <elij.mx@gmail.com> wrote:
...
While converting the aur codebase from mysql to postgresql (just for
fun/to see if I could), I ran into a few mysql specific gotchas, and a
few oddities. Here are some of my notes from the experience.
Mind sending the diff along for funzies? Even if we can't apply it as
a patch, it would be good to see and maybe chop up into "this will be
compatible with all DBs so we should fix it anyway" and "if we switch
we will need to look into this".
Sure. I will probably just do a git-format patch and dump the patches
into a tarball and upload it somewhere, since it is quite large, and
not exactly 'split into single change patches'. ;)
...
...
* I ran into a few sql integer comparisons that are using string
comparisons. postgres didn't like this, but mysql was fine with it
 ex: `$q .= "WHERE s.UsersId = '" . $userID . "' ";`
Should definitely be fixed anyway...awesome.
Yeah. Some of these fixes in the above patchset are wrapped up in
other things, but I can try to
split out the "these can be fixed regardless" and "there were fixes to
get postgres working".
...
...
* using the limit offset format of `"LIMIT x,y"`
 postgresql doesn't support this format, so I had to track each
occurrence down and change it to `"Limit x OFFSET y"`
This is SQL standard, so you could use this form anyway, correct? Of
course the MySQL docs claim this:
For compatibility with PostgreSQL, MySQL also supports the LIMIT
row_count OFFSET offset syntax.
Yeah. LIMIT w/OFFSET is sql standard. Not sure why mysql had to create
their own limit/offset
standard, but well.. yeah.
...
...
* A few instances of `"LIMIT 0,10"`. A `"LIMIT 10"` would have been
sufficient. Not sure why offset was specified in these few instances.
Sure it wasn't auto-generated, in which case it is easier for the DB
query analyzer to just throw it away anyway?
It looks like there were just two instances.

web/html/rss.php:
    $q = "SELECT * FROM Packages ";
    $q.= "WHERE DummyPkg != 1 ";
    $q.= "ORDER BY SubmittedTS DESC ";
    $q.= "LIMIT 0 , 20";

web/lib/stats.inc:
    $q = 'SELECT * FROM Packages WHERE DummyPkg != 1 ORDER BY
GREATEST(SubmittedTS,ModifiedTS) DESC LIMIT 0 , 10';
...
...
* A pair of column names were reserved words. "user' and 'end' in the
TU_VoteInfo table. I changed the column names to
 username and ending, to avoid issues and/or having to quote the name
all the time.
* postgresql table and column names are lowercase (when not quoted).
This required me to either quote all instances of table and
 column names (ugh) or change the fetch_assoc php references to
lowercase. I chose the latter, even though it was tedious to
 track down all instances and change them.
* postgres has no mysql_insert_id counterpart. Instead I appended to
the query "RETURNING ID" and then used pg_fetch_result
 to get the ID that was created.
* postgres LIKE and ILIKE queries did full table scans, and was quite
slow. I think mysql was caching these queries when I was doing my
 performance tests. The end result was postgres performing poorly in
my load test for LIKE/ILIKE operations. So I created a postgres
fulltext
 search column for the packages table, indexed it, defined triggers
to keep it udpated, and converted to using postgres fulltext searching
 `"@@ to_tsquery('term')"`. The result was quite speedy and gave nice
results. The syntax was a bit different for 'prefix searching' though,
 and the results were not the same as the LIKE/ILIKE queries, because
fulltext uses stemming instead of partial string comparison.
And case sensitivity would come into play, no?
I built the fulltext index as follows:
alter table packages add column tsv tsvector;
update packages set tsv = to_tsvector('english', coalesce(lower(name),
'') || ' ' || coalesce(lower(description), ''));

then added a trigger to update it for new data:

    DROP FUNCTION IF EXISTS package_fulltext_trigger() CASCADE;
    CREATE FUNCTION package_fulltext_trigger() RETURNS trigger as $$
     begin
      new.tsv :=
       setweight(to_tsvector('english', coalesce(lower(new.name), '')), 'A') ||
       setweight(to_tsvector('english',
coalesce(lower(new.description), '')), 'B');
      return new;
     end
    $$ LANGUAGE plpgsql;

    DROP TRIGGER IF EXISTS packages_tsv_update;
    CREATE TRIGGER packages_tsv_update BEFORE INSERT OR UPDATE
      on packages FOR EACH ROW EXECUTE PROCEDURE package_fulltext_trigger();
...
...
* There is quite a bit of sql code sprinkled all over the place. It
might make sense to try and consolidate some of the sql to fewer
files.
* It might make sense to generalize the 'mysql_real_escape_string'
instances to a wrapper 'db_escape' or something similar. This would
could
 make things a bit cleaner, ease possible future porting to other
databases (fewer things to track down and change individually).
* postgres doesn't support `"UNIX_TIMESTAMP()"`. Instead of changing
all instances of it, I defined a few convenience sql functions.
       --
       -- helper functions to ease conversion from mysql
       -- from:
http://janusz.slota.name/blog/2009/06/mysql-from_unixtime-and-unix_timestamp...
       --
       -- no params
       CREATE OR REPLACE FUNCTION unix_timestamp() RETURNS BIGINT AS '
           SELECT EXTRACT(EPOCH FROM CURRENT_TIMESTAMP(0))::bigint AS result;
       ' LANGUAGE 'SQL';
       -- timestamp without time zone (i.e. 1973-11-29 21:33:09)
       CREATE OR REPLACE FUNCTION unix_timestamp(TIMESTAMP) RETURNS BIGINT AS '
           SELECT EXTRACT(EPOCH FROM $1)::bigint AS result;
       ' LANGUAGE 'SQL';
       -- timestamp with time zone (i.e. 1973-11-29 21:33:09+01)
       CREATE OR REPLACE FUNCTION unix_timestamp(TIMESTAMP WITH TIME
zone) RETURNS BIGINT AS '
           SELECT EXTRACT(EPOCH FROM $1)::bigint AS result;
       ' LANGUAGE 'SQL';
I think that is it.