[aur-dev] some patches to gendummydata (take 2)
Updated patchset to take comments and criticisms into account. - tabs to spaces is now its own patch, with appropriate commit message - removal of DBUG variable - removal of straggling 're' module import - removal of 'working...' log messages (collapsed later patch into earlier) - broke 'wrap sql' patch into its own ---- Some patches for gendummydata script. - remove need for sql connection. this allows someone to run the script on a dev box with no sql connection (for example) and then ship the output sql wherever needed. - remove need to have category names. only the actual numbers are needed, and if you are using dummy data, you are likely using the base schema. even if that is not the case, as long as the base number of categories _at least_ are present, the dummy data is still 'fine' (eg. if more categories are added, then no dummy packages will just use those categories until the counter in the script is incremented) - use logging module - remove 'progress' logging output. the script doesn't run slow enough to warrant the extra noise - use spaces in the python script. spaces in python are _a damn good idea_
- remove need to use mysql for generating the sql - just consider categories an integer range, specified to the size of that in the aur-schema. - use the logging module instead of writing directly to stderr this makes the code cleaner as it removes the numerous tests for the value of DBUG, yet allows devs to control the level of output verbosity. --- support/schema/gendummydata.py | 106 +++++++++------------------------------ 1 files changed, 25 insertions(+), 81 deletions(-) diff --git a/support/schema/gendummydata.py b/support/schema/gendummydata.py index 7b1d0cf..8ed9f69 100755 --- a/support/schema/gendummydata.py +++ b/support/schema/gendummydata.py @@ -15,9 +15,9 @@ import os import sys import cStringIO import commands +import logging - -DBUG = 1 +log_level = logging.DEBUG # logging level. set to logging.INFO to reduce output SEED_FILE = "/usr/share/dict/words" DB_HOST = os.getenv("DB_HOST", "localhost") DB_NAME = os.getenv("DB_NAME", "AUR") @@ -33,6 +33,7 @@ PKG_FILES = (8, 30) # min/max number of files in a package PKG_DEPS = (1, 5) # min/max depends a package has PKG_SRC = (1, 3) # min/max sources a package has PKG_CMNTS = (1, 5) # min/max number of comments a package has +CATEGORIES_COUNT = 17 # the number of categories from aur-schema VOTING = (0, .30) # percentage range for package voting RANDOM_PATHS = ( # random path locations for package files "/usr/bin", "/usr/lib", "/etc", "/etc/rc.d", "/usr/share", "/lib", @@ -45,44 +46,25 @@ RANDOM_URL = ("http://www.", "ftp://ftp.", "http://", "ftp://") RANDOM_LOCS = ("pub", "release", "files", "downloads", "src") FORTUNE_CMD = "/usr/bin/fortune -l" +# setup logging +logformat = "%(levelname)s: %(message)s" +logging.basicConfig(format=logformat, level=log_level) +log = logging.getLogger() if len(sys.argv) != 2: - sys.stderr.write("Missing output filename argument"); + log.error("Missing output filename argument") raise SystemExit # make sure the seed file exists # if not os.path.exists(SEED_FILE): - sys.stderr.write("Please install the 'words' Arch package\n"); - raise SystemExit - -# Make sure database access will be available -# -try: - import MySQLdb -except: - sys.stderr.write("Please install the 'mysql-python' Arch package\n"); - raise SystemExit - -# try to connect to database -# -try: - db = MySQLdb.connect(host = DB_HOST, user = DB_USER, - db = DB_NAME, passwd = DB_PASS) - dbc = db.cursor() -except: - sys.stderr.write("Could not connect to database\n"); + log.error("Please install the 'words' Arch package") raise SystemExit -esc = db.escape_string - - # track what users/package names have been used # seen_users = {} seen_pkgs = {} -categories = {} -category_keys = [] user_keys = [] # some functions to generate random data @@ -95,14 +77,14 @@ def genVersion(): ver.append("%d" % random.randrange(0,100)) return ".".join(ver) + "-u%d" % random.randrange(1,11) def genCategory(): - return categories[category_keys[random.randrange(0,len(category_keys))]] + return random.randrange(0,CATEGORIES_COUNT) def genUID(): return seen_users[user_keys[random.randrange(0,len(user_keys))]] # load the words, and make sure there are enough words for users/pkgs # -if DBUG: print "Grabbing words from seed file..." +log.debug("Grabbing words from seed file...") fp = open(SEED_FILE, "r") contents = fp.readlines() fp.close() @@ -117,7 +99,7 @@ else: # select random usernames # -if DBUG: print "Generating random user names..." +log.debug("Generating random user names...") user_id = USER_ID while len(seen_users) < MAX_USERS: user = random.randrange(0, len(contents)) @@ -130,7 +112,7 @@ user_keys = seen_users.keys() # select random package names # -if DBUG: print "Generating random package names..." +log.debug("Generating random package names...") num_pkgs = PKG_ID while len(seen_pkgs) < MAX_PKGS: pkg = random.randrange(0, len(contents)) @@ -149,22 +131,6 @@ while len(seen_pkgs) < MAX_PKGS: # contents = None -# Load package categories from database -# -if DBUG: print "Loading package categories..." -q = "SELECT * FROM PackageCategories" -dbc.execute(q) -row = dbc.fetchone() -while row: - categories[row[1]] = row[0] - row = dbc.fetchone() -category_keys = categories.keys() - -# done with the database -# -dbc.close() -db.close() - # developer/tu IDs # developers = [] @@ -179,8 +145,7 @@ out.write("BEGIN;\n") # Begin by creating the User statements # -if DBUG: print "Creating SQL statements for users.", -count = 0 +log.debug("Creating SQL statements for users.") for u in user_keys: account_type = 1 # default to normal user if not has_devs or not has_tus: @@ -201,22 +166,18 @@ for u in user_keys: # a normal user account # pass - + s = "INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd) VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n" % (seen_users[u], account_type, u, u, u) out.write(s) - if count % 10 == 0: - if DBUG: print ".", - count += 1 -if DBUG: print "." -if DBUG: - print "Number of developers:", len(developers) - print "Number of trusted users:", len(trustedusers) - print "Number of users:", (MAX_USERS-len(developers)-len(trustedusers)) - print "Number of packages:", MAX_PKGS + +log.debug("Number of developers: %d" % len(developers)) +log.debug("Number of trusted users: %d" % len(trustedusers)) +log.debug("Number of users: %d" % (MAX_USERS-len(developers)-len(trustedusers))) +log.debug("Number of packages: %d" % MAX_PKGS) # Create the package statements # -if DBUG: print "Creating SQL statements for packages.", +log.debug("Creating SQL statements for packages.") count = 0 for p in seen_pkgs.keys(): NOW = int(time.time()) @@ -237,26 +198,21 @@ for p in seen_pkgs.keys(): genCategory(), NOW, uuid, muid) out.write(s) - if count % 100 == 0: - if DBUG: print ".", count += 1 # create random comments for this package # num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) for i in range(0, num_comments): - fortune = esc(commands.getoutput(FORTUNE_CMD).replace("'","")) + fortune = commands.getoutput(FORTUNE_CMD).replace("'","") now = NOW + random.randrange(400, 86400*3) s = "INSERT INTO PackageComments (PackageID, UsersID, Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n" % (seen_pkgs[p], genUID(), fortune, now) out.write(s) -if DBUG: print "." - # Cast votes # track_votes = {} -if DBUG: print "Casting votes for packages.", -count = 0 +log.debug("Casting votes for packages.") for u in user_keys: num_votes = random.randrange(int(len(seen_pkgs)*VOTING[0]), int(len(seen_pkgs)*VOTING[1])) @@ -270,9 +226,6 @@ for u in user_keys: track_votes[pkg] = 0 track_votes[pkg] += 1 out.write(s) - if count % 100 == 0: - if DBUG: print ".", - count += 1 # Update statements for package votes # @@ -282,8 +235,7 @@ for p in track_votes.keys(): # Create package dependencies and sources # -if DBUG: print "."; print "Creating statements for package depends/sources.", -count = 0 +log.debug("Creating statements for package depends/sources.") for p in seen_pkgs.keys(): num_deps = random.randrange(PKG_DEPS[0], PKG_DEPS[1]) this_deps = {} @@ -307,17 +259,9 @@ for p in seen_pkgs.keys(): seen_pkgs[p], src) out.write(s) - if count % 100 == 0: - if DBUG: print ".", - count += 1 - - # close output file # out.write("COMMIT;\n") out.write("\n") out.close() - -if DBUG: print "." -if DBUG: print "Done." - +log.debug("Done.") -- 1.7.4.1
On Tue, Apr 05, 2011 at 11:57:46PM -0700, elij wrote:
- remove need to use mysql for generating the sql - just consider categories an integer range, specified to the size of that in the aur-schema. - use the logging module instead of writing directly to stderr this makes the code cleaner as it removes the numerous tests for the value of DBUG, yet allows devs to control the level of output verbosity. --- support/schema/gendummydata.py | 106 +++++++++------------------------------ 1 files changed, 25 insertions(+), 81 deletions(-)
I agree with both changes, but please split that one into two separate patches.
diff --git a/support/schema/gendummydata.py b/support/schema/gendummydata.py index 7b1d0cf..8ed9f69 100755 --- a/support/schema/gendummydata.py +++ b/support/schema/gendummydata.py @@ -15,9 +15,9 @@ import os import sys import cStringIO import commands +import logging
- -DBUG = 1 +log_level = logging.DEBUG # logging level. set to logging.INFO to reduce output
I'm not a Python coder, but is there any reason to use lowercase here whereas we use uppercase for all other constants?
SEED_FILE = "/usr/share/dict/words" DB_HOST = os.getenv("DB_HOST", "localhost") DB_NAME = os.getenv("DB_NAME", "AUR") @@ -33,6 +33,7 @@ PKG_FILES = (8, 30) # min/max number of files in a package PKG_DEPS = (1, 5) # min/max depends a package has PKG_SRC = (1, 3) # min/max sources a package has PKG_CMNTS = (1, 5) # min/max number of comments a package has +CATEGORIES_COUNT = 17 # the number of categories from aur-schema VOTING = (0, .30) # percentage range for package voting RANDOM_PATHS = ( # random path locations for package files "/usr/bin", "/usr/lib", "/etc", "/etc/rc.d", "/usr/share", "/lib", @@ -45,44 +46,25 @@ RANDOM_URL = ("http://www.", "ftp://ftp.", "http://", "ftp://") RANDOM_LOCS = ("pub", "release", "files", "downloads", "src") FORTUNE_CMD = "/usr/bin/fortune -l"
+# setup logging +logformat = "%(levelname)s: %(message)s" +logging.basicConfig(format=logformat, level=log_level) +log = logging.getLogger()
if len(sys.argv) != 2: - sys.stderr.write("Missing output filename argument"); + log.error("Missing output filename argument") raise SystemExit
# make sure the seed file exists # if not os.path.exists(SEED_FILE): - sys.stderr.write("Please install the 'words' Arch package\n"); - raise SystemExit - -# Make sure database access will be available -# -try: - import MySQLdb -except: - sys.stderr.write("Please install the 'mysql-python' Arch package\n"); - raise SystemExit - -# try to connect to database -# -try: - db = MySQLdb.connect(host = DB_HOST, user = DB_USER, - db = DB_NAME, passwd = DB_PASS) - dbc = db.cursor() -except: - sys.stderr.write("Could not connect to database\n"); + log.error("Please install the 'words' Arch package") raise SystemExit
Shouldn't we rather use "sys.exit(1);" here instead of raising a SystemExit exception? That way we'd have a proper exit status, also. Might be something to include in the debugging/error handling patch.
-esc = db.escape_string - - # track what users/package names have been used # seen_users = {} seen_pkgs = {} -categories = {} -category_keys = [] user_keys = []
# some functions to generate random data @@ -95,14 +77,14 @@ def genVersion(): ver.append("%d" % random.randrange(0,100)) return ".".join(ver) + "-u%d" % random.randrange(1,11) def genCategory(): - return categories[category_keys[random.randrange(0,len(category_keys))]] + return random.randrange(0,CATEGORIES_COUNT) def genUID(): return seen_users[user_keys[random.randrange(0,len(user_keys))]]
# load the words, and make sure there are enough words for users/pkgs # -if DBUG: print "Grabbing words from seed file..." +log.debug("Grabbing words from seed file...") fp = open(SEED_FILE, "r") contents = fp.readlines() fp.close() @@ -117,7 +99,7 @@ else:
# select random usernames # -if DBUG: print "Generating random user names..." +log.debug("Generating random user names...") user_id = USER_ID while len(seen_users) < MAX_USERS: user = random.randrange(0, len(contents)) @@ -130,7 +112,7 @@ user_keys = seen_users.keys()
# select random package names # -if DBUG: print "Generating random package names..." +log.debug("Generating random package names...") num_pkgs = PKG_ID while len(seen_pkgs) < MAX_PKGS: pkg = random.randrange(0, len(contents)) @@ -149,22 +131,6 @@ while len(seen_pkgs) < MAX_PKGS: # contents = None
-# Load package categories from database -# -if DBUG: print "Loading package categories..." -q = "SELECT * FROM PackageCategories" -dbc.execute(q) -row = dbc.fetchone() -while row: - categories[row[1]] = row[0] - row = dbc.fetchone() -category_keys = categories.keys() - -# done with the database -# -dbc.close() -db.close() - # developer/tu IDs # developers = [] @@ -179,8 +145,7 @@ out.write("BEGIN;\n")
# Begin by creating the User statements # -if DBUG: print "Creating SQL statements for users.", -count = 0 +log.debug("Creating SQL statements for users.") for u in user_keys: account_type = 1 # default to normal user if not has_devs or not has_tus: @@ -201,22 +166,18 @@ for u in user_keys: # a normal user account # pass - + s = "INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd) VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n" % (seen_users[u], account_type, u, u, u) out.write(s) - if count % 10 == 0: - if DBUG: print ".", - count += 1 -if DBUG: print "." -if DBUG: - print "Number of developers:", len(developers) - print "Number of trusted users:", len(trustedusers) - print "Number of users:", (MAX_USERS-len(developers)-len(trustedusers)) - print "Number of packages:", MAX_PKGS + +log.debug("Number of developers: %d" % len(developers)) +log.debug("Number of trusted users: %d" % len(trustedusers)) +log.debug("Number of users: %d" % (MAX_USERS-len(developers)-len(trustedusers))) +log.debug("Number of packages: %d" % MAX_PKGS)
# Create the package statements # -if DBUG: print "Creating SQL statements for packages.", +log.debug("Creating SQL statements for packages.") count = 0 for p in seen_pkgs.keys(): NOW = int(time.time()) @@ -237,26 +198,21 @@ for p in seen_pkgs.keys(): genCategory(), NOW, uuid, muid)
out.write(s) - if count % 100 == 0: - if DBUG: print ".", count += 1
# create random comments for this package # num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) for i in range(0, num_comments): - fortune = esc(commands.getoutput(FORTUNE_CMD).replace("'","")) + fortune = commands.getoutput(FORTUNE_CMD).replace("'","")
Why did you drop escape_string() here?
now = NOW + random.randrange(400, 86400*3) s = "INSERT INTO PackageComments (PackageID, UsersID, Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n" % (seen_pkgs[p], genUID(), fortune, now) out.write(s)
-if DBUG: print "." - # Cast votes # track_votes = {} -if DBUG: print "Casting votes for packages.", -count = 0 +log.debug("Casting votes for packages.") for u in user_keys: num_votes = random.randrange(int(len(seen_pkgs)*VOTING[0]), int(len(seen_pkgs)*VOTING[1])) @@ -270,9 +226,6 @@ for u in user_keys: track_votes[pkg] = 0 track_votes[pkg] += 1 out.write(s) - if count % 100 == 0: - if DBUG: print ".", - count += 1
# Update statements for package votes # @@ -282,8 +235,7 @@ for p in track_votes.keys():
# Create package dependencies and sources # -if DBUG: print "."; print "Creating statements for package depends/sources.", -count = 0 +log.debug("Creating statements for package depends/sources.") for p in seen_pkgs.keys(): num_deps = random.randrange(PKG_DEPS[0], PKG_DEPS[1]) this_deps = {} @@ -307,17 +259,9 @@ for p in seen_pkgs.keys(): seen_pkgs[p], src) out.write(s)
- if count % 100 == 0: - if DBUG: print ".", - count += 1 - - # close output file # out.write("COMMIT;\n") out.write("\n") out.close() - -if DBUG: print "." -if DBUG: print "Done." - +log.debug("Done.") -- 1.7.4.1
On Wed, Apr 6, 2011 at 12:04 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
I agree with both changes, but please split that one into two separate patches.
ugh. yeah I can probably do that.
-DBUG = 1 +log_level = logging.DEBUG # logging level. set to logging.INFO to reduce output
I'm not a Python coder, but is there any reason to use lowercase here whereas we use uppercase for all other constants?
No reason really. I am just not used to uppercasing variables. When I refactor to split the above changes I will try to remember to make that variable format similar to the others.
raise SystemExit
Shouldn't we rather use "sys.exit(1);" here instead of raising a SystemExit exception? That way we'd have a proper exit status, also. Might be something to include in the debugging/error handling patch.
Possibly. sys.exit actually raises SystemExit, if I remember correctly. Setting a shell exit value is a good idea though. I will add that in.
num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) for i in range(0, num_comments): - fortune = esc(commands.getoutput(FORTUNE_CMD).replace("'","")) + fortune = commands.getoutput(FORTUNE_CMD).replace("'","")
Why did you drop escape_string() here?
It relies upon mysql, and since the other instance of mysql usage was removed by one of my patches, I removed this as well (to remove the dep entirely). For dummy data there really isn't a danger of sql injection, and removing ' characters from the fortune_cmd result string should be enough to keep from causing the written sql to be badly formatted.
On Wed, Apr 06, 2011 at 12:35:32PM -0700, elij wrote:
On Wed, Apr 6, 2011 at 12:04 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) for i in range(0, num_comments): - fortune = esc(commands.getoutput(FORTUNE_CMD).replace("'","")) + fortune = commands.getoutput(FORTUNE_CMD).replace("'","")
Why did you drop escape_string() here?
It relies upon mysql, and since the other instance of mysql usage was removed by one of my patches, I removed this as well (to remove the dep entirely). For dummy data there really isn't a danger of sql injection, and removing ' characters from the fortune_cmd result string should be enough to keep from causing the written sql to be badly formatted.
The problem is not someone actually trying to exploit this but fortunes containing single quotes which will lead to broken MySQL queries. There's two things we can do here: * Keep the mysql-python dependency just for escape_string(). * Implement escape_string() in Python and use it instead (should be no more than 10 lines).
--- support/schema/gendummydata.py | 34 +++++++++++++++++++++++----------- 1 files changed, 23 insertions(+), 11 deletions(-) diff --git a/support/schema/gendummydata.py b/support/schema/gendummydata.py index 8ed9f69..78dc5c9 100755 --- a/support/schema/gendummydata.py +++ b/support/schema/gendummydata.py @@ -167,7 +167,9 @@ for u in user_keys: # pass - s = "INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd) VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n" % (seen_users[u], account_type, u, u, u) + s = ("INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd)" + " VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n") + s = s % (seen_users[u], account_type, u, u, u) out.write(s) log.debug("Number of developers: %d" % len(developers)) @@ -191,11 +193,15 @@ for p in seen_pkgs.keys(): uuid = genUID() # the submitter/user if muid == 0: - s = "INSERT INTO Packages (ID, Name, Version, CategoryID, SubmittedTS, SubmitterUID, MaintainerUID) VALUES (%d, '%s', '%s', %d, %d, %d, NULL);\n" % (seen_pkgs[p], p, genVersion(), - genCategory(), NOW, uuid) + s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," + " SubmittedTS, SubmitterUID, MaintainerUID) VALUES" + " (%d, '%s', '%s', %d, %d, %d, NULL);\n") + s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid) else: - s = "INSERT INTO Packages (ID, Name, Version, CategoryID, SubmittedTS, SubmitterUID, MaintainerUID) VALUES (%d, '%s', '%s', %d, %d, %d, %d);\n" % (seen_pkgs[p], p, genVersion(), - genCategory(), NOW, uuid, muid) + s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," + " SubmittedTS, SubmitterUID, MaintainerUID) VALUES " + " (%d, '%s', '%s', %d, %d, %d, %d);\n") + s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid, muid) out.write(s) count += 1 @@ -206,7 +212,9 @@ for p in seen_pkgs.keys(): for i in range(0, num_comments): fortune = commands.getoutput(FORTUNE_CMD).replace("'","") now = NOW + random.randrange(400, 86400*3) - s = "INSERT INTO PackageComments (PackageID, UsersID, Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n" % (seen_pkgs[p], genUID(), fortune, now) + s = ("INSERT INTO PackageComments (PackageID, UsersID," + " Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n") + s = s % (seen_pkgs[p], genUID(), fortune, now) out.write(s) # Cast votes @@ -220,7 +228,9 @@ for u in user_keys: for v in range(num_votes): pkg = random.randrange(1, len(seen_pkgs) + 1) if not pkgvote.has_key(pkg): - s = "INSERT INTO PackageVotes (UsersID, PackageID) VALUES (%d, %d);\n" % (seen_users[u], pkg) + s = ("INSERT INTO PackageVotes (UsersID, PackageID)" + " VALUES (%d, %d);\n") + s = s % (seen_users[u], pkg) pkgvote[pkg] = 1 if not track_votes.has_key(pkg): track_votes[pkg] = 0 @@ -230,7 +240,8 @@ for u in user_keys: # Update statements for package votes # for p in track_votes.keys(): - s = "UPDATE Packages SET NumVotes = %d WHERE ID = %d;\n" % (track_votes[p], p) + s = "UPDATE Packages SET NumVotes = %d WHERE ID = %d;\n" + s = s % (track_votes[p], p) out.write(s) # Create package dependencies and sources @@ -243,7 +254,8 @@ for p in seen_pkgs.keys(): while i != num_deps: dep = random.randrange(1, len(seen_pkgs) + 1) if not this_deps.has_key(dep): - s = "INSERT INTO PackageDepends VALUES (%d, %d, NULL);\n" % (seen_pkgs[p], dep) + s = "INSERT INTO PackageDepends VALUES (%d, %d, NULL);\n" + s = s % (seen_pkgs[p], dep) out.write(s) i += 1 @@ -255,8 +267,8 @@ for p in seen_pkgs.keys(): p, RANDOM_TLDS[random.randrange(0,len(RANDOM_TLDS))], RANDOM_LOCS[random.randrange(0,len(RANDOM_LOCS))], src_file, genVersion()) - s = "INSERT INTO PackageSources VALUES (%d, '%s');\n" % ( - seen_pkgs[p], src) + s = "INSERT INTO PackageSources VALUES (%d, '%s');\n" + s = s % (seen_pkgs[p], src) out.write(s) # close output file -- 1.7.4.1
reformat with recommendation from pep8 for using spaces --- support/schema/gendummydata.py | 256 ++++++++++++++++++++-------------------- 1 files changed, 128 insertions(+), 128 deletions(-) diff --git a/support/schema/gendummydata.py b/support/schema/gendummydata.py index 78dc5c9..3354722 100755 --- a/support/schema/gendummydata.py +++ b/support/schema/gendummydata.py @@ -36,10 +36,10 @@ PKG_CMNTS = (1, 5) # min/max number of comments a package has CATEGORIES_COUNT = 17 # the number of categories from aur-schema VOTING = (0, .30) # percentage range for package voting RANDOM_PATHS = ( # random path locations for package files - "/usr/bin", "/usr/lib", "/etc", "/etc/rc.d", "/usr/share", "/lib", - "/var/spool", "/var/log", "/usr/sbin", "/opt", "/usr/X11R6/bin", - "/usr/X11R6/lib", "/usr/libexec", "/usr/man/man1", "/usr/man/man3", - "/usr/man/man5", "/usr/X11R6/man/man1", "/etc/profile.d" + "/usr/bin", "/usr/lib", "/etc", "/etc/rc.d", "/usr/share", "/lib", + "/var/spool", "/var/log", "/usr/sbin", "/opt", "/usr/X11R6/bin", + "/usr/X11R6/lib", "/usr/libexec", "/usr/man/man1", "/usr/man/man3", + "/usr/man/man5", "/usr/X11R6/man/man1", "/etc/profile.d" ) RANDOM_TLDS = ("edu", "com", "org", "net", "tw", "ru", "pl", "de", "es") RANDOM_URL = ("http://www.", "ftp://ftp.", "http://", "ftp://") @@ -52,14 +52,14 @@ logging.basicConfig(format=logformat, level=log_level) log = logging.getLogger() if len(sys.argv) != 2: - log.error("Missing output filename argument") - raise SystemExit + log.error("Missing output filename argument") + raise SystemExit # make sure the seed file exists # if not os.path.exists(SEED_FILE): - log.error("Please install the 'words' Arch package") - raise SystemExit + log.error("Please install the 'words' Arch package") + raise SystemExit # track what users/package names have been used # @@ -70,16 +70,16 @@ user_keys = [] # some functions to generate random data # def genVersion(): - ver = [] - ver.append("%d" % random.randrange(0,10)) - ver.append("%d" % random.randrange(0,20)) - if random.randrange(0,2) == 0: - ver.append("%d" % random.randrange(0,100)) - return ".".join(ver) + "-u%d" % random.randrange(1,11) + ver = [] + ver.append("%d" % random.randrange(0,10)) + ver.append("%d" % random.randrange(0,20)) + if random.randrange(0,2) == 0: + ver.append("%d" % random.randrange(0,100)) + return ".".join(ver) + "-u%d" % random.randrange(1,11) def genCategory(): - return random.randrange(0,CATEGORIES_COUNT) + return random.randrange(0,CATEGORIES_COUNT) def genUID(): - return seen_users[user_keys[random.randrange(0,len(user_keys))]] + return seen_users[user_keys[random.randrange(0,len(user_keys))]] # load the words, and make sure there are enough words for users/pkgs @@ -89,25 +89,25 @@ fp = open(SEED_FILE, "r") contents = fp.readlines() fp.close() if MAX_USERS > len(contents): - MAX_USERS = len(contents) + MAX_USERS = len(contents) if MAX_PKGS > len(contents): - MAX_PKGS = len(contents) + MAX_PKGS = len(contents) if len(contents) - MAX_USERS > MAX_PKGS: - need_dupes = 0 + need_dupes = 0 else: - need_dupes = 1 + need_dupes = 1 # select random usernames # log.debug("Generating random user names...") user_id = USER_ID while len(seen_users) < MAX_USERS: - user = random.randrange(0, len(contents)) - word = contents[user].replace("'", "").replace(".","").replace(" ", "_") - word = word.strip().lower() - if not seen_users.has_key(word): - seen_users[word] = user_id - user_id += 1 + user = random.randrange(0, len(contents)) + word = contents[user].replace("'", "").replace(".","").replace(" ", "_") + word = word.strip().lower() + if not seen_users.has_key(word): + seen_users[word] = user_id + user_id += 1 user_keys = seen_users.keys() # select random package names @@ -115,17 +115,17 @@ user_keys = seen_users.keys() log.debug("Generating random package names...") num_pkgs = PKG_ID while len(seen_pkgs) < MAX_PKGS: - pkg = random.randrange(0, len(contents)) - word = contents[pkg].replace("'", "").replace(".","").replace(" ", "_") - word = word.strip().lower() - if not need_dupes: - if not seen_pkgs.has_key(word) and not seen_users.has_key(word): - seen_pkgs[word] = num_pkgs - num_pkgs += 1 - else: - if not seen_pkgs.has_key(word): - seen_pkgs[word] = num_pkgs - num_pkgs += 1 + pkg = random.randrange(0, len(contents)) + word = contents[pkg].replace("'", "").replace(".","").replace(" ", "_") + word = word.strip().lower() + if not need_dupes: + if not seen_pkgs.has_key(word) and not seen_users.has_key(word): + seen_pkgs[word] = num_pkgs + num_pkgs += 1 + else: + if not seen_pkgs.has_key(word): + seen_pkgs[word] = num_pkgs + num_pkgs += 1 # free up contents memory # @@ -147,30 +147,30 @@ out.write("BEGIN;\n") # log.debug("Creating SQL statements for users.") for u in user_keys: - account_type = 1 # default to normal user - if not has_devs or not has_tus: - account_type = random.randrange(1, 4) - if account_type == 3 and not has_devs: - # this will be a dev account - # - developers.append(seen_users[u]) - if len(developers) >= MAX_DEVS * MAX_USERS: - has_devs = 1 - elif account_type == 2 and not has_tus: - # this will be a trusted user account - # - trustedusers.append(seen_users[u]) - if len(trustedusers) >= MAX_TUS * MAX_USERS: - has_tus = 1 - else: - # a normal user account - # - pass + account_type = 1 # default to normal user + if not has_devs or not has_tus: + account_type = random.randrange(1, 4) + if account_type == 3 and not has_devs: + # this will be a dev account + # + developers.append(seen_users[u]) + if len(developers) >= MAX_DEVS * MAX_USERS: + has_devs = 1 + elif account_type == 2 and not has_tus: + # this will be a trusted user account + # + trustedusers.append(seen_users[u]) + if len(trustedusers) >= MAX_TUS * MAX_USERS: + has_tus = 1 + else: + # a normal user account + # + pass - s = ("INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd)" - " VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n") - s = s % (seen_users[u], account_type, u, u, u) - out.write(s) + s = ("INSERT INTO Users (ID, AccountTypeID, Username, Email, Passwd)" + " VALUES (%d, %d, '%s', '%s@example.com', MD5('%s'));\n") + s = s % (seen_users[u], account_type, u, u, u) + out.write(s) log.debug("Number of developers: %d" % len(developers)) log.debug("Number of trusted users: %d" % len(trustedusers)) @@ -182,94 +182,94 @@ log.debug("Number of packages: %d" % MAX_PKGS) log.debug("Creating SQL statements for packages.") count = 0 for p in seen_pkgs.keys(): - NOW = int(time.time()) - if count % 2 == 0: - muid = developers[random.randrange(0,len(developers))] - else: - muid = trustedusers[random.randrange(0,len(trustedusers))] - if count % 20 == 0: # every so often, there are orphans... - muid = 0 + NOW = int(time.time()) + if count % 2 == 0: + muid = developers[random.randrange(0,len(developers))] + else: + muid = trustedusers[random.randrange(0,len(trustedusers))] + if count % 20 == 0: # every so often, there are orphans... + muid = 0 - uuid = genUID() # the submitter/user + uuid = genUID() # the submitter/user - if muid == 0: - s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," - " SubmittedTS, SubmitterUID, MaintainerUID) VALUES" - " (%d, '%s', '%s', %d, %d, %d, NULL);\n") - s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid) - else: - s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," - " SubmittedTS, SubmitterUID, MaintainerUID) VALUES " - " (%d, '%s', '%s', %d, %d, %d, %d);\n") - s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid, muid) + if muid == 0: + s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," + " SubmittedTS, SubmitterUID, MaintainerUID) VALUES" + " (%d, '%s', '%s', %d, %d, %d, NULL);\n") + s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid) + else: + s = ("INSERT INTO Packages (ID, Name, Version, CategoryID," + " SubmittedTS, SubmitterUID, MaintainerUID) VALUES " + " (%d, '%s', '%s', %d, %d, %d, %d);\n") + s = s % (seen_pkgs[p], p, genVersion(), genCategory(), NOW, uuid, muid) - out.write(s) - count += 1 + out.write(s) + count += 1 - # create random comments for this package - # - num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) - for i in range(0, num_comments): - fortune = commands.getoutput(FORTUNE_CMD).replace("'","") - now = NOW + random.randrange(400, 86400*3) - s = ("INSERT INTO PackageComments (PackageID, UsersID," - " Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n") - s = s % (seen_pkgs[p], genUID(), fortune, now) - out.write(s) + # create random comments for this package + # + num_comments = random.randrange(PKG_CMNTS[0], PKG_CMNTS[1]) + for i in range(0, num_comments): + fortune = commands.getoutput(FORTUNE_CMD).replace("'","") + now = NOW + random.randrange(400, 86400*3) + s = ("INSERT INTO PackageComments (PackageID, UsersID," + " Comments, CommentTS) VALUES (%d, %d, '%s', %d);\n") + s = s % (seen_pkgs[p], genUID(), fortune, now) + out.write(s) # Cast votes # track_votes = {} log.debug("Casting votes for packages.") for u in user_keys: - num_votes = random.randrange(int(len(seen_pkgs)*VOTING[0]), - int(len(seen_pkgs)*VOTING[1])) - pkgvote = {} - for v in range(num_votes): - pkg = random.randrange(1, len(seen_pkgs) + 1) - if not pkgvote.has_key(pkg): - s = ("INSERT INTO PackageVotes (UsersID, PackageID)" - " VALUES (%d, %d);\n") - s = s % (seen_users[u], pkg) - pkgvote[pkg] = 1 - if not track_votes.has_key(pkg): - track_votes[pkg] = 0 - track_votes[pkg] += 1 - out.write(s) + num_votes = random.randrange(int(len(seen_pkgs)*VOTING[0]), + int(len(seen_pkgs)*VOTING[1])) + pkgvote = {} + for v in range(num_votes): + pkg = random.randrange(1, len(seen_pkgs) + 1) + if not pkgvote.has_key(pkg): + s = ("INSERT INTO PackageVotes (UsersID, PackageID)" + " VALUES (%d, %d);\n") + s = s % (seen_users[u], pkg) + pkgvote[pkg] = 1 + if not track_votes.has_key(pkg): + track_votes[pkg] = 0 + track_votes[pkg] += 1 + out.write(s) # Update statements for package votes # for p in track_votes.keys(): - s = "UPDATE Packages SET NumVotes = %d WHERE ID = %d;\n" - s = s % (track_votes[p], p) - out.write(s) + s = "UPDATE Packages SET NumVotes = %d WHERE ID = %d;\n" + s = s % (track_votes[p], p) + out.write(s) # Create package dependencies and sources # log.debug("Creating statements for package depends/sources.") for p in seen_pkgs.keys(): - num_deps = random.randrange(PKG_DEPS[0], PKG_DEPS[1]) - this_deps = {} - i = 0 - while i != num_deps: - dep = random.randrange(1, len(seen_pkgs) + 1) - if not this_deps.has_key(dep): - s = "INSERT INTO PackageDepends VALUES (%d, %d, NULL);\n" - s = s % (seen_pkgs[p], dep) - out.write(s) - i += 1 + num_deps = random.randrange(PKG_DEPS[0], PKG_DEPS[1]) + this_deps = {} + i = 0 + while i != num_deps: + dep = random.randrange(1, len(seen_pkgs) + 1) + if not this_deps.has_key(dep): + s = "INSERT INTO PackageDepends VALUES (%d, %d, NULL);\n" + s = s % (seen_pkgs[p], dep) + out.write(s) + i += 1 - num_sources = random.randrange(PKG_SRC[0], PKG_SRC[1]) - for i in range(num_sources): - src_file = user_keys[random.randrange(0, len(user_keys))] - src = "%s%s.%s/%s/%s-%s.tar.gz" % ( - RANDOM_URL[random.randrange(0,len(RANDOM_URL))], - p, RANDOM_TLDS[random.randrange(0,len(RANDOM_TLDS))], - RANDOM_LOCS[random.randrange(0,len(RANDOM_LOCS))], - src_file, genVersion()) - s = "INSERT INTO PackageSources VALUES (%d, '%s');\n" - s = s % (seen_pkgs[p], src) - out.write(s) + num_sources = random.randrange(PKG_SRC[0], PKG_SRC[1]) + for i in range(num_sources): + src_file = user_keys[random.randrange(0, len(user_keys))] + src = "%s%s.%s/%s/%s-%s.tar.gz" % ( + RANDOM_URL[random.randrange(0,len(RANDOM_URL))], + p, RANDOM_TLDS[random.randrange(0,len(RANDOM_TLDS))], + RANDOM_LOCS[random.randrange(0,len(RANDOM_LOCS))], + src_file, genVersion()) + s = "INSERT INTO PackageSources VALUES (%d, '%s');\n" + s = s % (seen_pkgs[p], src) + out.write(s) # close output file # -- 1.7.4.1
On Tue, Apr 05, 2011 at 11:57:48PM -0700, elij wrote:
reformat with recommendation from pep8 for using spaces --- support/schema/gendummydata.py | 256 ++++++++++++++++++++-------------------- 1 files changed, 128 insertions(+), 128 deletions(-)
Won't push that one, as long as we don't agree on an amendment of our coding standards. Refer to "HACKING".
On Wed, Apr 6, 2011 at 12:06 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
On Tue, Apr 05, 2011 at 11:57:48PM -0700, elij wrote:
reformat with recommendation from pep8 for using spaces --- support/schema/gendummydata.py | 256 ++++++++++++++++++++-------------------- 1 files changed, 128 insertions(+), 128 deletions(-)
Won't push that one, as long as we don't agree on an amendment of our coding standards. Refer to "HACKING".
Ah. That is too bad. I consider pep8 coding convention to be a 'good smell' when contributing to python codebases. Based on patch feedback so far, it seems like our standards and conventions are too dissimilar to be beneficial. I will fix up the previous patch-set and resend it, then I will stop wasting my time and yours.
On Wed, Apr 06, 2011 at 12:59:10PM -0700, elij wrote:
On Wed, Apr 6, 2011 at 12:06 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
On Tue, Apr 05, 2011 at 11:57:48PM -0700, elij wrote:
reformat with recommendation from pep8 for using spaces --- support/schema/gendummydata.py | 256 ++++++++++++++++++++-------------------- 1 files changed, 128 insertions(+), 128 deletions(-)
Won't push that one, as long as we don't agree on an amendment of our coding standards. Refer to "HACKING".
Ah. That is too bad. I consider pep8 coding convention to be a 'good smell' when contributing to python codebases.
Based on patch feedback so far, it seems like our standards and conventions are too dissimilar to be beneficial.
I didn't say this is a no-go but we need to discuss this one and fix our coding guidelines if we agree on pushing this changeset. There already is some inconsistency with aurblup using spaces for indentation so this certainly is an area we need to work on.
I will fix up the previous patch-set and resend it, then I will stop wasting my time and yours.
Dude, I hope you don't take our feedback personal. I'm just trying to keep inconsistent and inconvenient stuff out of the code base as it already is way too patchy. Your patches are highly appreciated and I'll definitely push some of them if you clean them up.
On Wed, Apr 6, 2011 at 1:16 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
On Wed, Apr 06, 2011 at 12:59:10PM -0700, elij wrote:
On Wed, Apr 6, 2011 at 12:06 PM, Lukas Fleischer <archlinux@cryptocrack.de> wrote:
On Tue, Apr 05, 2011 at 11:57:48PM -0700, elij wrote:
reformat with recommendation from pep8 for using spaces --- support/schema/gendummydata.py | 256 ++++++++++++++++++++-------------------- 1 files changed, 128 insertions(+), 128 deletions(-)
Won't push that one, as long as we don't agree on an amendment of our coding standards. Refer to "HACKING".
Ah. That is too bad. I consider pep8 coding convention to be a 'good smell' when contributing to python codebases.
Based on patch feedback so far, it seems like our standards and conventions are too dissimilar to be beneficial.
I didn't say this is a no-go but we need to discuss this one and fix our coding guidelines if we agree on pushing this changeset. There already is some inconsistency with aurblup using spaces for indentation so this certainly is an area we need to work on.
I will fix up the previous patch-set and resend it, then I will stop wasting my time and yours.
Dude, I hope you don't take our feedback personal. I'm just trying to keep inconsistent and inconvenient stuff out of the code base as it already is way too patchy. Your patches are highly appreciated and I'll definitely push some of them if you clean them up.
It is nothing personal. It just seems that my convention and coding style is not compatible with the aur team of today. Like I said, I am just trying to not waste your time or mine. Don't feel bad, or as if you drove someone away. You are the aur lead, so things *do* need to be optimized for your consumption and style/convention.
participants (2)
-
elij
-
Lukas Fleischer