Library reference

The Reader classes

cdblib.Reader reads standard “32-bit” cdb files, such as those produced by the cdbmake CLI tool. cdblib.Reader64 reads “64-bit” cdb files, which can be produced by this package.

The Reader classes can be instantiated by passing one positional argument, a bytes-like object with a database’s content:

>>> import cdblib
>>> with open('info.cdb', 'rb') as f:
...     data = f.read()
>>> reader = cdblib.Reader(data)

Alternatively, you can use the Reader classes as a context manager and give either a file path or a file-like object.

>>> with cdblib.Reader.from_file_path('info.cdb') as reader:
...    print(reader.items())

>>> with open('info.cdb', 'rb') as f:
...     with cdblib.Reader.from_file_obj(f) as reader:
...         print(reader.items())

When using the .from_file_path() or .from_file_obj() constructors, a memory-mapped file object is created. This keeps the whole database from being read into memory. See the Python docs for more information on mmap.

Retrieving data

The .items() method returns a list of (key, value) tuples representing all of the records stored in the database (in insertion order). Note that a single key can have multiple values associated with it.

>>> reader.items()
[(b'k1', b'v1'), (b'k2', b'v2a'), (b'k2', b'v2b')]

The .iteritems() method is like .items(), but it returns an iterator over the items rather than a list.

The .keys() method returns a list of the keys stored in the database (in insertion order). The .iterkeys() method returns an iterator over the keys. Note that keys will be repeated if a single key has multiple values associated with it.

The .values() method returns a list of the values stored in the database (in insertion order). The .itervalues() method returns an iterator over the values.

Calling len() on a Reader instance returns the number of records (key-value pairs) stored in the database.

>>> len(reader)
3

The in operator can be used to test whether a key is present in the database.

>>> b'k1' in reader
True
>>> b'k3' in reader
False

The .get() method returns the first value in the database for key. If the key isn’t in the database, None will be returned. To use a different default value, use the default keyword:

>>> reader.get(b'k2')
b'v2a'
>>> reader.get(b'missing')
None
>>> reader.get(b'missing', default=b'fallback')
b'fallback'

The .gets() method returns an iterator over all the values associated with key.

>>> list(reader.gets(b'k2'))
[b'v2a', b'v2b']

Reader instances also support dict-like retrieval of the first value associated with key. KeyError will be raised if the requested key isn’t in the database.

>>> reader[b'k2']
b'v2a'
>>> reader[b'missing2']
KeyError: b'missing'

Note that the values retrieved by the .get() and .gets() methods are bytes objects.

If the values in the database represent integers, you can retrieve them as Python int objects with the .getint() and .getints() methods.

>>> reader.get(b'key_with_int_value')
b'1'
>>> reader.getint(b'key_with_int_value')
1

Similarly, the .getstring() and .getstrings() methods will retrieve the values as str objects.

>>> reader.get(b'key_with_str_value')
b'text data'
>>> reader.getstring(b'key_with_str_value')
'text data'

You may specify an encoding with the encoding keyword argument.

>>> reader.get(b'fancy_a_or_f')
b'\xc4'
>>> reader.getstring(b'fancy_a_or_f', encoding='cp1252')
'Ä'
>>> reader.getstring(b'fancy_a_or_f', encoding='mac-roman')
'ƒ'

Encoding and strict mode

Database keys are stored as bytes objects. By default, Reader instances will attempt to convert str keys and int keys automatically.

>>> reader.get(b'1')  # Binary key
b'value_for_1'
>>> reader.get('1')  # Text key
b'value_for_1'
>>> reader.get(1)  # Integer key
b'value_for_1'

To disable this behavior, pass strict=True when creating the Reader instance. This will increase read performance, and is useful when you want to deal with bytes keys only.

>>> import cdblib
>>> with open('info.cdb', 'rb') as f:
...     data = f.read()
>>> reader = cdblib.Reader(data, strict=True)
>>> reader.get(b'1')  # Binary key
b'value_for_1'
>>> reader.get(1)
...
TypeError: key must be of type 'bytes'

The Writer classes

cdblib.Writer produces standard “32-bit” cdb files, which should be readable by other cdb tools like cdbget and cdbdump. cdblib.Writer64 produces “64-bit” cdb files, which can be read by this package.

The Writer classes take one positional argument, a file-like object opened in binary mode.

>>> import cdblib
...
... with open('info.cdb', 'wb') as f:
...     writer = cdblib.Writer(f):
...     writer.put(b'k1', b'v1a')
...     writer.finalize()

Writer instances don’t create readable databases until their .finalize() method is called. You should use them as a context manager wherever possible - this ensures that .finalize() is called.

>>> with open('info.cdb', 'wb') as f:
...     with cdblib.Writer(f) as writer:
...         writer.put(b'k1', b'v1a')

Storing data

The .put() method is used to create a database record for a binary key and a binary value.

>>> import io
>>> import cdblib
>>> f = io.BytesIO()  # Use an in-memory database
>>> writer = cdblib.writer(f)
>>> writer.put(b'k1', b'v1a')

The .puts() method adds multiple binary values at the same key.

>>> writer.puts(b'k2', [b'v2a', b'v2b'])

To store integer values, use .putint() or .putints().

>>> writer.putint(b'key_with_int_values', 1)
>>> writer.putints(b'key_with_int_values', [2, 3])

To store text data, use .putstring() or .putstrings(), with an optional encoding keyword argument. The default encoding is ‘utf-8’.

>>> writer.putstring(b'fancy_a', 'Ä')  # stores b'\xc3\x84'
>>> writer.putstring(b'fancy_a', 'Ä', encoding='cp1252')  # stores b'\xc4'
>>> writer.putstrings(b'boring_a', ['a', 'A'])

As above, don’t forget to call .finalize() to write the database to disk if you’re not using a context manager.

>>> writer.finalize()

Encoding and strict mode

Database keys are stored as bytes objects. As with Reader instances, Writer instances will attempt to convert text keys and integer keys automatically.

To disable this behavior, pass strict=True when creating the Writer instance. This will increase write performance, and is useful when you want to deal with bytes keys only.

Advanced usage

Alternate hash functions

By default python-pure-cdb will use the standard cdb hash function described on djb’s page.

You can substitute in your own hash function when using a Writer instance, if you’re so inclined. This will of course require you to use the same hash function when reading the database.

>>> import io
... import zlib
...
... import cdblib
...
...
... def custom_hash(x):
...     return zlib.adler32(x) & 0xffffffff
...
...
... with io.BytesIO() as f:
...     with cdblib.Writer(f, hashfn=custom_hash) as writer:
...         writer.put(b'k1', b'v1a')
...         writer.puts(b'k2', [b'v2a', b'v2b'])
...
...     reader = cdblib.Reader(f.getvalue(), hashfn=custom_hash)
...     reader.items()
[(b'k1', b'v1a'), (b'k2', b'v2a'), (b'k2', b'v2b')]

C extension hash function

When using CPython, you can build a C Extension that speeds up using the cdb hash function.

Set the ENABLE_DJB_HASH_CEXT environment variable when executing setup.py to enable the extension:

$ ENABLE_DJB_HASH_CEXT=1 python setup.py install