thoughts by clayg

Friday, August 14, 2009

TypeError: dict objects are unhashable

My mom told me to update my blog. Hi mom.

I've been wanting to write this one for awhile anyway.

In retrospect it was rather naive - but I wonder who hasn't at one time tried to create a set of dictionaries:

|>>> parts = [
|... {'id':1, 'desc':'widget', 'detail':'rear widget'},
|... {'id':1, 'desc':'widget', 'detail':'front widget'},
|... {'id':2, 'desc':'gear', 'size':4},
|... {'id':3, 'desc':'cog', 'type':'green'},
|... ]
|>>> myset = set(parts)
|Traceback (most recent call last):
| File "", line 1, in ?
|TypeError: dict objects are unhashable

What did I expect it to do? I think the first time I tried this it seemed more reasonable. I think my list of dictionaries actually contained exactly the same keys, with some exact duplicates and I needed to uniquify the list. I acctually ended up doing something like the example at the bottom...

Dict objects are not hashable, read about hash tables and sets if it isn't obvious why it's important that objects in a set support a __hash__ method.

A hash function is really simple idea:
two equal objects MUST return the same hash
two un-equal objects should RARELY return the same key

But there's no really obvious reasonable way to implement hash on a dictionary of arbitrary keys and values.

Here's a couple dumb ideas for adding a hash function to a dict:

|def __hash__(self):
| key = 0

|def __hash__(self):
| id = 0
| for pair in sorted(self.items()):
| id += hash(pair)
| return hash(id)

And they "work" - at least in the sense that they remove the error:

|>>> class Part(dict):
|... def __hash__(self):
|... return self['id']
|...
|>>> myset = set([Part(x) for x in parts])
|>>> for part in myset:
|... print part
|...
|{'desc': 'widget', 'detail': 'rear widget', 'id': 1}
|{'desc': 'gear', 'id': 2, 'size': 4}
|{'type': 'green', 'id': 3, 'desc': 'cog'}
|{'desc': 'widget', 'detail': 'front widget', 'id': 1}

And that's great, if my parts list contained EXACTLY equal dictionaries they would be removed - I could turn it back into a list an continue on with everything uniquified! It might be worth nothing that the two parts with 'id' = 1 were not considered equal just because they returned the same hash. When there is a hash collision, the inhereited __eq__() method recognized that 'rear widget' != 'left widget' and that the two parts were distinct.

But what's really interesting what I've done by making a mutable object hashable...

It can be used as a dictionary key with surprisingly bad results:

|>>> part.__class__ # part is an instance of my Part class
|&ltclass '__main__.Part'>
|>>> part # a mutable object with a hash function
|{'desc': 'widget', 'detail': 'front widget', 'id': 1}
|>>> mydict = {} # mydict is a plain dictionary
|>>> mydict[part] = 1 # i can use part as a key!
|>>> part['id'] = 2 # I then modify the key
|>>> mydict[part] # there is no value assigned to this new "modified" key
|Traceback (most recent call last):
|File "", line 1, in ?
| KeyError: {'desc': 'widget', 'detail': 'front widget', 'id': 2}
|>>> mydict # or is there?
|{{'desc': 'widget', 'detail': 'front widget', 'id': 2}: 1}

Here's a fairly reasonable attempt at making a custom class that is mostly mutable dictionary, but has a safe and reasonable hash function. I'll also over-ridden __eq__ to ignore minor differences in the 'detail' between objects.

|>>> class Part(dict):
|... def __init__(self, part_dict):
|... if 'id' not in part_dict:
|... raise TypeError("Parts must have an id")
|... dict.__init__(self, part_dict)
|... def __setitem__(self, key, value):
|... if key == 'id':
|... raise ValueError("Part id's can't change - create a new part")
|... return dict.__setitem__(self, key, value)
|... def __hash__(self):
|... return self['id']
|... def __eq__(self, other):
|... a = self.copy()
|... del a['detail']
|... b = other.copy()
|... del b['detail']
|... return a == b
|...
|>>> for part in set([Part(x) for x in parts]):
|... print part
|...
|{'desc': 'widget', 'detail': 'rear widget', 'id': 1}
|{'desc': 'gear', 'id': 2, 'size': 4}
|{'type': 'green', 'id': 3, 'desc': 'cog'}

So if parts is a list of build materials, and you wanted to know how many distinct parts it takes to build this thing... the above Part class might be the right track. Aside from the ambiguity of added by over-riding "==" like that... dose anyone see any other problems with this?

Thursday, February 5, 2009

pymssql and sqlalchemy

At the time of this writing the latest version of sqlalchemy (0.5.2) does not support the recent re-write of pymssql (1.0.0), which was released last week.

attempting to create a sqlalchemy engine object will result in an exception:

File "/lib/python2.5/site-packages/SQLAlchemy-0.5.2-py2.5.egg/sqlalchemy/databases/mssql.py", line 1294, in create_connect_args
self.dbapi._mssql.set_query_timeout(self.query_timeout)
AttributeError: 'module' object has no attribute 'set_query_timeout'
>>>

According to the pymssql news page, the low level module in this major version release is not backwards compatible:

BEWARE however, if you were using the lower level _mssql module, it changed in incompatible way. You will need to change your scripts, or continue to use pymssql 0.8.0. This is why major version number was incremented.

As a 'work-around' you can always install an older stable version of pymssql (0.8.0)

$easy_install pymssql==0.8.0

Tuesday, January 6, 2009

Quake Live Beta Invites Crash Test

I finally got my QuakeLive Closed Beta Account - and you may have too!

The QuakeLive Beta is warming up!

on Monday we're going to be sending out our largest number of new beta invites ever - hopefully more than doubling our current active player base.

- QuakeLive News

id is planning a big "crash test" on Wednesday afternoon - Jan 7th 2009. Go check your email, and be ready for a three part sign-up and activation rig-o-mo-rag...

id has asked us to "BRING THE HEAT"

Monday, December 29, 2008

Set PYTHONPATH

To set the enviornment variable PYTHONPATH in bash:
export PYTHONPATH=/path/to/modules

just setting PYTHONPATH=/path/to/modules won't work - you have to use export. If you want a variable passed on to a child processes it has to show up when you type 'env'

But since you used export - next time you start python, '/path/to/modules' will automatically be appended to the front of your sys.path

Obviously '/path/to/modules' should be the full path to whereever you're keeping your modules - something like /home/clayg/lib. (relative path's will work, but not ~/ , probably best to avoid both) You can separate multiple directories with a colon:

export PYTHONPATH=/path/to/modules:/path/to/other/modules

Setting python path is handy if, for example, you are using setuptools or distutils to install python modules on a system which you do not have root privileges.

Just download the source dist and find the directory with setup.py, then run:
python setup.py build

Then re-locate the folder with the __init__.py (usually ./build/lib/packagename) to your /path/to/modules folder.

if you want your PYTHONPATH to stick around - you should add it to .bashrc

import packagename should be good to go!

Thursday, December 4, 2008

get dates from excel with python xlrd

a1_as_datetime = datetime.datetime(*xlrd.xldate_as_tuple(a1, 0))

UPDATE: Please read the discussion of the second argument to xldate_as_tuple - "datemode" in the comments section of this post before using this example. It is LIKELY that hard-coding the "datemode" option will not meet your long term needs.

I had to piece this line together from two other articles, sorry don't remember which.

Full Example:

>>> import datetime, xlrd
>>> book = xlrd.open_workbook("myfile.xls")
>>> sh = book.sheet_by_index(0)
>>> a1 = sh.cell_value(rowx=0, colx=0)
>>> print "Cell A1 is ", a1
Cell A1 is 39811.0
>>> a1_as_datetime = datetime.datetime(*xlrd.xldate_as_tuple(a1, book.datemode))
>>> print 'datetime: %s' % a1_as_datetime
datetime: 2008-12-29 00:00:00
>>>

This might make more sense if you're familiar with xlrd - A Python module for extracting data from MS Excel ™ spreadsheet files.

If you are familiar with xlrd, then the only part really worth discussing is xldate_as_tuple, which will convert the float that excel is using to store the date as something more useful, like a tuple:
(2008, 12, 29, 0, 0, 0)

Note that the first argument to the xldate_as_tuple function is the variable I defined as a1. xldate_as_tuple will not accept a cell reference 'a1' or some such thing - you have to give it the float!

The datetime module has a constructor for dates that requires at minimum three positional arguments:

datetime( year, month, day[, hour[, minute[, second[, microsecond[, tzinfo]]]]])

This would also work:
a1_as_date = datetime.date(*xlrd.xldate_as_tuple(a1, 0)[:3])

You can pass the items of the tuple as positional arguments by prefacing the tuple with an asterisk (wtg python!)

Friday, October 17, 2008

Simple example of Threads in Python

The first time it was immediately obvious to me that there would be a significant gain from 'threading' a program I had written - was in the context of screen scraping. I had a handful of HTTP GET requests from almost 20 pages that were being processed one... after... the... other. I realized of course that if I would just start the next request before waiting on the last one to finish - the entire process would be over much more quickly.

In this example the screen scraping 'worker' function is replaced with a simpler 'random wait' function:


#! /usr/bin/env python

import sys
import threading
import time
import random

# The worker function does the processing
def worker(arg):
 arg = random.randint(2,10)
 time.sleep(arg)
 return arg

# The myThreadObj wraps the worker function in a thread
class myThreadObj(threading.Thread):
 def __init__(self, arg):
  threading.Thread.__init__(self)
  self.arg = arg
  self.value = 0
 def run(self):
  self.value = worker(self.arg)
  print 'Thread %d Ended.' % self.arg

# my array of arguments to be processed by the worker function
myArgs = range(5)

# create a myThreadObj to process each argument
myThreadList = []
for i in myArgs:
 myThreadList.append(myThreadObj(i))
 # and start it immediately
 myThreadList[i].start()

# wait for all threads to finish
for each in myThreadList:
 each.join()

print 'All threads have completed.'

for i in myArgs:
 print "myThreadList[%d] = %d" % (i, myThreadList[i].value)

The myThreadObj wrapper should accept whatever arguments you normally pass to the worker, and when the worker is completed - it will store the returned value in 'self.value'

The .join() function blocks until the .isAlive() method would return false. I process each thread iteratively to verify that all have completed. It doesn't matter if .join() blocks for 8 seconds while it's waiting on the first thread, or if it gets to a thread that's already been completed for 6 seconds cause an earlier .join was waiting on a previous thread that took longer. The point is that, by the time all of the .join() statements complete - ALL THREADS HAVE FINISHED.

Once the threads are done we expect myThreadObj.value to contain the return value of the worker function.

If your 'worker' function is something like an API call, or database query - anything with some built in lag from a system that's designed to serve multiple simultaneous requests - as long as you can queue them up - threading will provide a significant improvement.

e.g.


clayg@m-net:~$ cat nonthread.py
#! /usr/bin/env python

import sys
import threading
import time
import random

# The worker function does the processing
def worker(arg):
arg = random.randint(2,10)
time.sleep(arg)
return arg

myArgs = range(5)

for i in myArgs:
print "myThreadList[%d] = %d" % (i, worker(i))
clayg@m-net:~$ time ./nonthread.py ; echo ; time ./simplethread.py
myThreadList[0] = 5
myThreadList[1] = 5
myThreadList[2] = 10
myThreadList[3] = 2
myThreadList[4] = 9

real    0m31.073s
user    0m0.045s
sys     0m0.024s

Thread 4 Ended.
Thread 0 Ended.
Thread 1 Ended.
Thread 2 Ended.
Thread 3 Ended.
All threads have completed.
myThreadList[0] = 5
myThreadList[1] = 10
myThreadList[2] = 10
myThreadList[3] = 10
myThreadList[4] = 2

real    0m10.078s
user    0m0.045s
sys     0m0.030s

Wednesday, October 1, 2008

Parsing a list of numbers in Python

I find that I often need to get a selection of numbers in a range as input. I'm using Python more and more these days it seems, so I needed to port this classic function over. I must have done this 4 weeks ago - I'd been meaning to put it up here.

The valid input will be a comma separated list of integers, which could possibly contain a 'range' defined as "x-y" - where x and y are both integers.

I tried not to make any special stipulation for the order of these integers, or even that the input string would not contain bad characters.

Here it is:

#! /usr/local/bin/python
import sys
import os

# return a set of selected values when a string in the form:
# 1-4,6
# would return:
# 1,2,3,4,6
# as expected...

def parseIntSet(nputstr=""):
  selection = set()
  invalid = set()
  # tokens are comma seperated values
  tokens = [x.strip() for x in nputstr.split(',')]
  for i in tokens:
     try:
        # typically tokens are plain old integers
        selection.add(int(i))
     except:
        # if not, then it might be a range
        try:
           token = [int(k.strip()) for k in i.split('-')]
           if len(token) > 1:
              token.sort()
              # we have items seperated by a dash
              # try to build a valid range
              first = token[0]
              last = token[len(token)-1]
              for x in range(first, last+1):
                 selection.add(x)
        except:
           # not an int and not a range...
           invalid.add(i)
  # Report invalid tokens before returning valid selection
  print "Invalid set: " + str(invalid)
  return selection
# end parseIntSet

print 'Generate a list of selected items!'
nputstr = raw_input('Enter a list of items: ')

selection = parseIntSet(nputstr)
print 'Your selection is: '
print str(selection)

When trying to copy this from someone else I came across a similar function written in Ruby, in case you needed that instead.