[R6RS] comments on rough draft for hash tables

Wed Mar 8 07:14:38 EST 2006

Thanks, Anton.  Here are my first reactions to the rough draft.

> Procedure: hash-table-get/flag hash-table key flag  => object

To me a flag is either a boolean or a special value
that's treated differently from other values.  Neither
is implied here.  I'd prefer hash-table-get/default,
by analogy with the name used by SRFI-69:
hash-table-ref/default.

We also should have something like

Procedure: hash-table-get/call hash-table key f => object

    Returns the value associated with key in the hash-table
if the hash table contains key; otherwise tail-calls f on key.

In my opinion, hash-table-get/call is more useful and more
efficient than hash-table-get/thunk, which often requires
the user to obtain the thunk by writing things like
(lambda () (f key)).

> Procedure: hash-table-contains hash-table

This should be

Procedure: hash-table-contains? hash-table key

If we change /flag to /default or add the /call version,
then we should make the corresponding change or addition
to the update procedures.

> Procedure: hash-table-fold hash-table procedure init => value
> Procedure: hash-table-for-each procedure hash-table => unspecified

What happens if the procedure adds or deletes keys from
the hash-table?  The specification of hash-table-clear!
implies that this is allowed.

Allowing the procedure to perform side effects on the
hash table being traversed can create race conditions
and other sources of nondeterminism that are probably best
resolved by making a copy of the hash-table every time one
of these procedures is called.  Many programmers are
unlikely to anticipate the cost of making that copy.

In Java, the problem is "resolved" for java.util.Iterator
by encouraging an Iterator to throw a
ConcurrentModificationException whenever the Iterator
is used after a modification to the data structure the
Iterator is traversing.  This condition cannot be
detected reliably, however, so the exception is rasied
on a best-effort basis.

> Procedure: eq-hash object => integer
>
>     Returns a hash value for object based on its location.
> This procedure is the hash function used by make-eq-hash-table.

No, we shouldn't guarantee that this procedure is used by
make-eq-hash-table.  In some systems it may be, but it is
unlikely to be the same hash function in systems whose
garbage collectors move objects.  In those systems, the
eq-hash procedure is likely to be implemented by entering
every hashed object into a global hash table that was
created at system startup by calling make-eq-hash-table.
Calling eq-hash on an object may prevent that object from
ever being garbage collected, or may delay its collection for
a very long time even if it becomes eligible for collection.

Many programmers are unlikely to anticipate the cost of
using eq-hash, so the R6RS should explain this (if it has
eq-hash at all).

> Procedure: eqv-hash object => integer
>
>     This procedure is the hash function used by make-eqv-hash-table.

Ditto.

> Procedure: hash-table-hash-function hash-table => procedure
>
>     Returns the hash function used by hash-table.

Should these procedures be defined on hash tables that
were created by calling make-eq-hash-table or
make-eqv-hash-table?  The point of having separate
constructors for those tables is to allow a hash function
whose behavior changes over time and is synchronized with
periodic rehashing of the table.

If hash-table-hash-function is defined on all hash tables,
then the primitive eq? and eqv?-compatible tables will have
to return eq-hash and eqv-hash.  Those are *not* necessarily
the hash functions that are actually used by the hash-table.
They may be much more expensive in both time and space.

> If so, then an approach more like that of MIT Scheme could be
> used, in which a rehash-after-gc? flag is provided to a hash
> table constructor-constructor.  The MIT Scheme approach is
> more general, allowing the programmer to define gc-dependent
> hash functions.

Hah.  The implementors of MIT Scheme did not anticipate
garbage collectors that run 100 times per second.

> Complexity: It may be appropriate to specify constraints on
> complexity, such as: O(1) for retrieval of elements from hash
> tables; constant time for hash-table-size.

O(1) retrieval cannot be guaranteed for arbitrary hash
functions.  You can't even guarantee amortized O(1)
retrieval for the system's hash functions.  The best
you can do is to guarantee average-case amortized O(1)
retrieval for the system's hash functions.

> Concurrency: R6RS does not deal with concurrency.  Should
> this proposal say any thing about it?

We might not say anything about it, but we need to think
about it.  Any implementation that supports concurrency is
going to have to implement some kind of mutual exclusion
for operations that have side effects, and some will need
a bit of mutual exclusion even for hash-table-ref.

As specified, the updating operations are not atomic, so
they create no new problems.  (Some implementors might try
to implement them as atomic operations, but we can let
them learn the hard way.)

The hash-table-fold and hash-table-for-each procedures
already have a problem, even without concurrency.

Will