hashgen
.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public Licencse as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later option.
The hashgen
utility generates the source of a lookup function
for a given static keyset (reserved words of a programming
language, assembler mnemonics, XML-based language elements,
etc.)
Every keyset's entry contains a key (either string or binary)
and arbitrary user-defined atributes associated with that key.
The keys must be unique.
The generated function will find any keyword in constant time using exactly one comparison. A typical lookup may take as little as 100ns on an average desktop machine.
The source code can be generated in C++, ANSI C, or K&R C.
The hashgen
utility does basically the same thing and uses the same
basic algorithm as the gperf
utility by Douglas C. Schmidt
(see Acknowledgements).
While gperf
is a great piece of software, I found it a bit
difficult to use. The major points addressed by hashgen
are these:
gperf
one has to select the character key positions
by hand, which may be a daunting task for large keysets with
long, highly redundant keys. Just selecting all positions,
may yield an unnecessary complex hash function (and will not necessarily
resolve collisions).
Hashgen
will automatically select a minimal set of positions
that is enough to distinguish the keys of a given keyset.
gperf
, the hash table of a keyset with two
or more keys that consist of the same characters and differ only in their
order will always have collisions (so gperf
cannot generate
a perfect hash function for, say, Bourne shell keywords with its
if
/fi
and case
/esac
pairs.)
Hashgen
will always generate a perfect hash function
for such keysets (but see Bugs).
With hashgen
you don't have to care about the algorithmic
details (although there are options that allow you to finetune the
results).
Gperf
's input file format is not very intuitive (especially if you are not familiar with lex/yacc) and is not self-contained.
There are two (interdependent) sources of input information--the
input file and the command-line options.
With hashgen
, all options can be specified in the input
file. Its format is comprehensible and one can easily tailor one
of the samples packaged with hashgen
even without reading
any documentation (Example Input). Any option still can be
specified on the command-line.
results).
hashgen
is generation of the
lookup test source that both tests the generated lookup function
and measures the average lookup time.
Hashgen
allways retains the original order of the keys.
So in addition to hashed lookups you can also traverse the keys in a
well defined order, or do binary searches to find the closest
key if the key are sorted. For C++ this functionality can be
easily added in a class derived from the generated class.
Synopsis:
hashgen [options]... [input-file]
options are expected in either standard short or GNU-style long format.
input-file contains the keylist, options, and C declarations. If input-file is not specified, the standard input is read. Only one input file can be handled at a time.
The generated code is written to the standard output or to the files
specified with -o
, -c
, and -h
options.
An option can be specified in either the [options]
section
of the input file or as a command-line option. The latter may
be specified in either short or GNU-style long form.
The long form of a command-line option has the same name as
in the [options]
section of the input file with two
dashes prepended.
Example. A fragment of an input file like this
[options] size=3 header=keylookup.h ...
is equivalent to the following command line fragment:
./hashgen --size=3 --header=keylookup.h ...
or, in the short form,
./hashgen -N3 -h keylookup.h ...
A Boolean option foo
can be turned on and off like this
foo=yes foo=no
or like this
foo=1 foo=0
or just
foo no-foo
All of the above forms can be used on the command-line with
two preceding dashes, e.g. --no-foo
.
For a short form of a Boolean option, say i
,
-i
and -i+
turn i
on,
-i-
turns it off.
Short Boolean options can be combined; the last option in a cluster
of options can have non-Boolean argument, e.g.
-zi-l C++
, will turn z
on,
i
off, and l
will have the argument C++
.
The blank between a short non-Boolean command-line option and its
argument is not required.
Options specified in the input file override the options specified
on the command-line. That is, if in the [options]
section
of the input file you specified output=foo
, but on the
command-line --output=bar
, the output will still go to
foo
, no warnings will be issued.
This section list all program's options. All Boolean options are
given in their non-default form. E.g., zero-term
is on by
default, so it's listed here as ---no-zero-term
or -z-
.
--output=FILE
-o FILE
output
option is specified, the generated
lookup code goes to the specified file.
In order to separate the header and the implementation, use
header
and source
options instead.
--header=HEADER
--source=SOURCE
-h HEADER
-c SOURCE
-h
option,
and the name of the implementation file with -c
option.
If you specify, say, only -h
, the implementation source
will go to the file specified with -o
if it's given, or
to the standard output if it's not.
--language=LANG
-l LANG
--namespace-name=NAME
-n NAME
--lookup-func-name=NAME
-f NAME
struct entry* lookup(const char* key, size_t len);
--entry-struct-name=NAME
-e NAME
--key-name=NAME
-k NAME
--lookup-struct-name=NAME
-s NAME
Even if the selected language is C, all the lookup tables are bundled
into a struct with this name.
--lookup-data-name=NAME
-d NAME
--key-table-name=NAME
--weight-table-name=NAME
--hash-table-name=NAME
--length-table-name=NAME
--collision-table-name=NAME
--test=FILE
-T FILE
If the header
option was specified, the generated test source
will include the lookup header (and you will have to link the test program
with the lookup implementation). Otherwise, it will contain a copy
of the lookup source, and the test executable may be compiled
from the test source alone.
--inline-lookup-func
-I
The inline
prefix will be #ifdef
ed, it will expand to
inline
for C++, __inline__
for GNU, or,
if INLINE
macro is defined (define it in the [declarations]
section), to INLINE
.
--static-lookup-data
-t
--no-include-string-h
-i-
strcmp()
(or memcmp()
), and NULL
(for C.)
If the option is disabled, it becomes you responsibility.
--size=SIZE
-N SIZE
SIZE
must be between 0 and 9, the default value is 0.
Affects the size of the generated hash table, the larger is the
value, the larger will be the hash table (this is true
only for large keysets).
Larger hash tables will decrease time necessary to decide that a
key does not belong to the set (there will be more empty slots in the
hash table; when we hit an empty slot there is no need to compare
keys). At the same time, very large hash tables may increase time of
successful searches a negligible amount because of worse locality
(more CPU cache misses.)
--min-expected-char=MINCHARCODE
--max-expected-char=MAXCHARCODE
-m MINCHARCODE
-M MAXCHARCODE
M??CHARCODE
is decimal, hexadecimal (0xnn),
or octal (0nnn) character code.
The default values are 0x01 and 0xff respectively.
More likely than not the keys you are going to look up are not random byte arrays but rather the result of some kind of parsing. E.g., a C keyword is going to consist of nothing but lower case letters.
Narrowing the range of expected characters will result in smaller lookup tables. But be warned that if the character range is violated at runtime, the results may be disastrous.
If min-expected-char
is 0, the zero-term
option
is turned off (a zero terminated string cannot contain zeroes).
--no-zero-term
-z-
hashgen
assumes that the input keys are standard
zero terminated C strings and the generated code will use
strcmp()
for key comparison.
If the option is disabled, memcmp()
will be used instead,
the keys at runtime don't have to be zero terminated and may
contain NUL characters.
The option is automattically disabled when a NUL character is
encountered in one of the input keys, or the min-expected-char
option is set to 0.
--no-length-switch
-S-
If you know that the target compiler cannot do that, disable the option,
in this case if
/goto
pairs will be generated instead.
--optimize-hits
-H
hashgen
will try to minimize the time
of both successful searches and failures. When the option
is enabled, successful searches may become a couple of cycles
faster at the cost of failures, which will become a bit
slower.
This option affects only the generated lookup code, not the algorithm (the generated hash table will be the same.)
You may want to enable this option if you know that the lookup function
is going to be called much more often for valid keys (in this case it
also makes sense to try to generate a smaller hash table).
--random
-r
hashgen
execution time will decrease.
--randimize
-R
--random
option.
--debug
--D
--version
--help
The input file format is similar to that of MS Windows (R) .ini
files.
An input file is divided into sections. Every section starts with a section
header line (e.g. [options]
) and
terminates at the beginning of the next section or end of file.
There are three kinds of sections: [options]
,
[declarations]
, and [entries]
.
The order of sections does not matter.
Empty lines, and lines starting with "#" are ignored (except for
the [definitions]
section, which contains C/C++ declarations, and
is inserted into the generated source verbatim.)
All sections except [entries]
are optional. The [entries]
section header may be omitted if this section is first in the file.
Thus, a minimal valid input file contains just a list of keywords,
each starting on a new line.
[options]
Section
[declarations]
Section
[entries]
Section
[options]
SectionThe [options]
section contains all the program's options.
This section is optional, and if it's omitted, all the options
will get their default values, or values specified on the command
line.
Each option starts on a new line, and has name=value
format.
Empty lines and lines starting with the hash sign ("#") are
ignored.
For detailed description of the program's options see the Invoking Chapter (Invoking.)
[declarations]
SectionThe [declaration]
section contains arbitrary C/C++ declarations
that are copied verbatim into the generated code.
Within the [declarations]
section, lines starting with "#"
and empty lines are not ignored. If you need to comment
something, use C/C++ comments instead. No line may start with the
"[" character.
One thing that must be present in the [definitions]
section is declaration of the structure of the key table entry.
It must be a struct with the same name as specified by the
entry-struct-name
option (the default value is entry
).
The first field of the struct must be name
(or the name
specified with the key-name
option), and be of type
char*
(const char*
) or char[]
.
Instead of defining the entry structure in the input file, you can do it
in some other file and just #include it in the [declarations]
section.
If the [declarations]
section is empty, the following entry
declaration will be generated:
struct entry { char* name; };
(the names entry
and name
may be different, as specified
by entry-struct-name
and key-name
options.)
If the [declarations]
section is not empty, the entry
structure will not be defined automatically.
Example:
[options] entry-struct-name = TokenEntry [declarations] struct TokenEntry { const char* name; int token; int flags; }; [entries] "for", 1, 0 "while", 2, 0 "do", 3, 0
Also see the Example Chapter (Example).
[entries]
SectionEach line of the [entries]
section contains a key and
attributes associated with that key separated with commas. The
first field is the key itself, the attributes are optional. Empty
lines and lines starting with "#" are ignored.
The syntax of each line must be suitable for C structure initializer
(without the enclosing braces). It will be used to initialize an
entry in the key table, the structure of the entry is defined in
the [declarations]
section (see [declarations]).
The [entries]
section is assumed by default when the parsing begins,
so a minimal valid input file may contain just a list of keywords,
each starting on a new line.
Example:
[declarations] struct entry { char* name; int id; }; [entries] "foo", 1 "bar", 2 "baz", 3
Within the key field C backslash escapes are allowed. If the key
does not contain double quotes, commas, spaces, backslash escapes,
and does not start with "[", the surrounding double quotes may
be omitted:
[entries] foo, 1 bar, 2 baz, 3
The attributes (after the first comma and up to the following newline character) are not parsed at all, just included in the generated initializer as is. It's your responsibility to ensure their syntax is correct.
For instance, for entries like these
key1, 1 /* one */, 2, something key2 key3, 3 ...
the following initializers will be generated
{ "key1", 1 /* one */, 2, something }, { "key2" }, { "key3", 3 }, ...
The following sections show an example of the input file and lookup sources generated from it.
# Hashgen input file example: # Lookup class for C++ reserved words. # The [entries] section: lists keys and associated data. # The [entries] section header is implied by default. # Double quotes around the key values are omitted. asm, TOKEN_ASM auto, TOKEN_AUTO break, TOKEN_BREAK case, TOKEN_CASE catch, TOKEN_CATCH char, TOKEN_CHAR class, TOKEN_CLASS const, TOKEN_CONST continue, TOKEN_CONTINUE default, TOKEN_DEFAULT delete, TOKEN_DELETE do, TOKEN_DO double, TOKEN_DOUBLE else, TOKEN_ELSE enum, TOKEN_ENUM extern, TOKEN_EXTERN float, TOKEN_FLOAT for, TOKEN_FOR friend, TOKEN_FRIEND goto, TOKEN_GOTO if, TOKEN_IF inline, TOKEN_INLINE int, TOKEN_INT long, TOKEN_LONG new, TOKEN_NEW operator, TOKEN_OPERATOR overload, TOKEN_OVERLOAD private, TOKEN_PRIVATE protected, TOKEN_PROTECTED public, TOKEN_PUBLIC register, TOKEN_REGISTER return, TOKEN_RETURN short, TOKEN_SHORT signed, TOKEN_SIGNED sizeof, TOKEN_SIZEOF static, TOKEN_STATIC struct, TOKEN_STRUCT switch, TOKEN_SWITCH template, TOKEN_TEMPLATE this, TOKEN_THIS typedef, TOKEN_TYPEDEF union, TOKEN_UNION unsigned, TOKEN_UNSIGNED virtual, TOKEN_VIRTUAL void, TOKEN_VOID volatile, TOKEN_VOLATILE while, TOKEN_WHILE [options] # All the options set explicitly for demonstration purposes # (for most the default values would be okay.) language = C++ header = cxx_keywords.h source = cxx_keywords.cxx test = cxx_keywords_test.cxx namespace-name = Scanner entry-struct-name = Token key-name = text lookup-struct-name = TokenHash lookup-data-name = tokenHash lookup-func-name = lookup key-table-name = tokens weight-table-name = charWeights length-table-name = keyLengths hash-table-name = hashTable collision-table-name = hashCollisions optimize-hits = yes length-switch = yes #inline-lookup-func = yes static-lookup-data = no min-expected-char = 0x30 max-expected-char = 0x7a # The [declarations] sections: defines the structure # of a key table entry (`Token' in this example). # Also may contain any additional declarations, # everything is copied verbatim into the generated header. [declarations] struct Token { const char* text; int token; }; enum { TOKEN_ASM, TOKEN_AUTO, TOKEN_BREAK, TOKEN_CASE, TOKEN_CATCH, TOKEN_CHAR, TOKEN_CLASS, TOKEN_CONST, TOKEN_CONTINUE, TOKEN_DEFAULT, TOKEN_DELETE, TOKEN_DO, TOKEN_DOUBLE, TOKEN_ELSE, TOKEN_ENUM, TOKEN_EXTERN, TOKEN_FLOAT, TOKEN_FOR, TOKEN_FRIEND, TOKEN_GOTO, TOKEN_IF, TOKEN_INLINE, TOKEN_INT, TOKEN_LONG, TOKEN_NEW, TOKEN_OPERATOR, TOKEN_OVERLOAD, TOKEN_PRIVATE, TOKEN_PROTECTED, TOKEN_PUBLIC, TOKEN_REGISTER, TOKEN_RETURN, TOKEN_SHORT, TOKEN_SIGNED, TOKEN_SIZEOF, TOKEN_STATIC, TOKEN_STRUCT, TOKEN_SWITCH, TOKEN_TEMPLATE, TOKEN_THIS, TOKEN_TYPEDEF, TOKEN_UNION, TOKEN_UNSIGNED, TOKEN_VIRTUAL, TOKEN_VOID, TOKEN_VOLATILE, TOKEN_WHILE };
// Generated by hashgen 1.0b (Dec 8 2002) // language: C++ // 47 keys, hash table size 68 (145%), 0 collision(s) // expected char range: 0x30-0x7a // Mon Dec 9 14:50:21 2002 #ifndef TokenHash_INCLUDED #define TokenHash_INCLUDED #include <cstring> namespace Scanner { struct Token { const char* text; int token; }; enum { TOKEN_ASM, TOKEN_AUTO, TOKEN_BREAK, TOKEN_CASE, TOKEN_CATCH, TOKEN_CHAR, TOKEN_CLASS, TOKEN_CONST, TOKEN_CONTINUE, TOKEN_DEFAULT, TOKEN_DELETE, TOKEN_DO, TOKEN_DOUBLE, TOKEN_ELSE, TOKEN_ENUM, TOKEN_EXTERN, TOKEN_FLOAT, TOKEN_FOR, TOKEN_FRIEND, TOKEN_GOTO, TOKEN_IF, TOKEN_INLINE, TOKEN_INT, TOKEN_LONG, TOKEN_NEW, TOKEN_OPERATOR, TOKEN_OVERLOAD, TOKEN_PRIVATE, TOKEN_PROTECTED, TOKEN_PUBLIC, TOKEN_REGISTER, TOKEN_RETURN, TOKEN_SHORT, TOKEN_SIGNED, TOKEN_SIZEOF, TOKEN_STATIC, TOKEN_STRUCT, TOKEN_SWITCH, TOKEN_TEMPLATE, TOKEN_THIS, TOKEN_TYPEDEF, TOKEN_UNION, TOKEN_UNSIGNED, TOKEN_VIRTUAL, TOKEN_VOID, TOKEN_VOLATILE, TOKEN_WHILE }; class TokenHash { public: static size_t size() { return 47; } static const Token* at(size_t index) { return tokens + index; } static const Token* lookup(const char* key) { return lookup(key, std::strlen(key)); } static Token* lookup(const char* key, size_t len); protected: static unsigned char hashTable[68]; static unsigned char charWeights[75]; static Token tokens[47]; }; extern TokenHash tokenHash; } #endif
#include "cxx_keywords.h" namespace Scanner { TokenHash tokenHash; unsigned char TokenHash::hashTable[68] = { // Hash table. Each element either points to a key table entry, // or is null (47) if there is no key with this hash value. /* 0 */ 47, 47, 47, 0, 47, 47, 35, 40, 1, 47, /* 10 */ 36, 3, 7, 13, 34, 8, 31, 38, 19, 4, /* 20 */ 15, 6, 30, 25, 28, 11, 17, 23, 29, 10, /* 30 */ 9, 39, 37, 20, 24, 18, 14, 16, 32, 2, /* 40 */ 41, 45, 22, 47, 5, 33, 47, 26, 43, 47, /* 50 */ 47, 42, 47, 21, 47, 27, 47, 12, 47, 47, /* 60 */ 44, 47, 47, 47, 47, 47, 47, 46 }; unsigned char TokenHash::charWeights[75] = { // Weight table (one character hash function). /* 40 */ 255, 255, /* 50 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, /* 60 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, /* 70 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, /* 80 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, /* 90 */ 255, 255, 255, 255, 255, 255, 255, 0, 28, 7, /* 100 */ 23, 0, 23, 14, 27, 8, 255, 255, 9, 1, /* 110 */ 31, 0, 9, 255, 6, 0, 0, 4, 33, 26, /* 120 */ 14, 0, 255 }; Token TokenHash::tokens[47] = { // Key table. Holds keys and associated data (order preserved.) /* 0 */ { "asm", TOKEN_ASM }, /* 1 */ { "auto", TOKEN_AUTO }, /* 2 */ { "break", TOKEN_BREAK }, /* 3 */ { "case", TOKEN_CASE }, /* 4 */ { "catch", TOKEN_CATCH }, /* 5 */ { "char", TOKEN_CHAR }, /* 6 */ { "class", TOKEN_CLASS }, /* 7 */ { "const", TOKEN_CONST }, /* 8 */ { "continue", TOKEN_CONTINUE }, /* 9 */ { "default", TOKEN_DEFAULT }, /* 10 */ { "delete", TOKEN_DELETE }, /* 11 */ { "do", TOKEN_DO }, /* 12 */ { "double", TOKEN_DOUBLE }, /* 13 */ { "else", TOKEN_ELSE }, /* 14 */ { "enum", TOKEN_ENUM }, /* 15 */ { "extern", TOKEN_EXTERN }, /* 16 */ { "float", TOKEN_FLOAT }, /* 17 */ { "for", TOKEN_FOR }, /* 18 */ { "friend", TOKEN_FRIEND }, /* 19 */ { "goto", TOKEN_GOTO }, /* 20 */ { "if", TOKEN_IF }, /* 21 */ { "inline", TOKEN_INLINE }, /* 22 */ { "int", TOKEN_INT }, /* 23 */ { "long", TOKEN_LONG }, /* 24 */ { "new", TOKEN_NEW }, /* 25 */ { "operator", TOKEN_OPERATOR }, /* 26 */ { "overload", TOKEN_OVERLOAD }, /* 27 */ { "private", TOKEN_PRIVATE }, /* 28 */ { "protected", TOKEN_PROTECTED }, /* 29 */ { "public", TOKEN_PUBLIC }, /* 30 */ { "register", TOKEN_REGISTER }, /* 31 */ { "return", TOKEN_RETURN }, /* 32 */ { "short", TOKEN_SHORT }, /* 33 */ { "signed", TOKEN_SIGNED }, /* 34 */ { "sizeof", TOKEN_SIZEOF }, /* 35 */ { "static", TOKEN_STATIC }, /* 36 */ { "struct", TOKEN_STRUCT }, /* 37 */ { "switch", TOKEN_SWITCH }, /* 38 */ { "template", TOKEN_TEMPLATE }, /* 39 */ { "this", TOKEN_THIS }, /* 40 */ { "typedef", TOKEN_TYPEDEF }, /* 41 */ { "union", TOKEN_UNION }, /* 42 */ { "unsigned", TOKEN_UNSIGNED }, /* 43 */ { "virtual", TOKEN_VIRTUAL }, /* 44 */ { "void", TOKEN_VOID }, /* 45 */ { "volatile", TOKEN_VOLATILE }, /* 46 */ { "while", TOKEN_WHILE } }; // Key characters must be in 0x30-0x7a range. Token* TokenHash::lookup(const char* key, size_t len) { using namespace std; register struct Token* p; register unsigned h = len; switch (h) { default: return 0; case 9: case 8: case 7: case 6: case 5: case 4: h += tokenHash.charWeights[(unsigned char)key[3]-48]; case 3: case 2: h += tokenHash.charWeights[(unsigned char)key[1]-48]; case 1: h += tokenHash.charWeights[(unsigned char)key[0]-48]; case 0: ; } if (h > 67) return 0; h = tokenHash.hashTable[h]; if (h == 47) return 0; p = &tokenHash.tokens[h]; if (*key == *p->text && !strcmp(key, p->text)) return p; return 0; } // Please notice that in expression `a[x-c]', where `a' is a static // array and `c' is a constant, `a-c' is a constant sub-expression, // so the offset should not incur additional runtime overhead. }
// Test for hashed lookup code generated by hashgen 1.0b. // Measures the average time of both successful and unsuccessful lookups. #include "cxx_keywords.h" #include <ctime> #include <iostream> #include <iomanip> using namespace std; using namespace Scanner; #ifndef CLK_TCK #define CLK_TCK CLOCKS_PER_SEC #endif int main() { clock_t tick; double time; unsigned count; unsigned i; // Hit test. If/goto overhead is negligible, so it's okay to test for // both correctness and speed at the same time. There is no sense in // checking the key pointed to by the lookup function's return value // after a successful search, since the only place where a non-null // pointer is returned is right after successful strcmp() or memcmp(). count = 0; tick = clock(); do { for (i = 0; i < 21276; i++) { if (TokenHash::lookup("asm", 3) == 0) goto err_miss; if (TokenHash::lookup("auto", 4) == 0) goto err_miss; if (TokenHash::lookup("break", 5) == 0) goto err_miss; if (TokenHash::lookup("case", 4) == 0) goto err_miss; if (TokenHash::lookup("catch", 5) == 0) goto err_miss; if (TokenHash::lookup("char", 4) == 0) goto err_miss; if (TokenHash::lookup("class", 5) == 0) goto err_miss; if (TokenHash::lookup("const", 5) == 0) goto err_miss; if (TokenHash::lookup("continue", 8) == 0) goto err_miss; if (TokenHash::lookup("default", 7) == 0) goto err_miss; if (TokenHash::lookup("delete", 6) == 0) goto err_miss; if (TokenHash::lookup("do", 2) == 0) goto err_miss; if (TokenHash::lookup("double", 6) == 0) goto err_miss; if (TokenHash::lookup("else", 4) == 0) goto err_miss; if (TokenHash::lookup("enum", 4) == 0) goto err_miss; if (TokenHash::lookup("extern", 6) == 0) goto err_miss; if (TokenHash::lookup("float", 5) == 0) goto err_miss; if (TokenHash::lookup("for", 3) == 0) goto err_miss; if (TokenHash::lookup("friend", 6) == 0) goto err_miss; if (TokenHash::lookup("goto", 4) == 0) goto err_miss; if (TokenHash::lookup("if", 2) == 0) goto err_miss; if (TokenHash::lookup("inline", 6) == 0) goto err_miss; if (TokenHash::lookup("int", 3) == 0) goto err_miss; if (TokenHash::lookup("long", 4) == 0) goto err_miss; if (TokenHash::lookup("new", 3) == 0) goto err_miss; if (TokenHash::lookup("operator", 8) == 0) goto err_miss; if (TokenHash::lookup("overload", 8) == 0) goto err_miss; if (TokenHash::lookup("private", 7) == 0) goto err_miss; if (TokenHash::lookup("protected", 9) == 0) goto err_miss; if (TokenHash::lookup("public", 6) == 0) goto err_miss; if (TokenHash::lookup("register", 8) == 0) goto err_miss; if (TokenHash::lookup("return", 6) == 0) goto err_miss; if (TokenHash::lookup("short", 5) == 0) goto err_miss; if (TokenHash::lookup("signed", 6) == 0) goto err_miss; if (TokenHash::lookup("sizeof", 6) == 0) goto err_miss; if (TokenHash::lookup("static", 6) == 0) goto err_miss; if (TokenHash::lookup("struct", 6) == 0) goto err_miss; if (TokenHash::lookup("switch", 6) == 0) goto err_miss; if (TokenHash::lookup("template", 8) == 0) goto err_miss; if (TokenHash::lookup("this", 4) == 0) goto err_miss; if (TokenHash::lookup("typedef", 7) == 0) goto err_miss; if (TokenHash::lookup("union", 5) == 0) goto err_miss; if (TokenHash::lookup("unsigned", 8) == 0) goto err_miss; if (TokenHash::lookup("virtual", 7) == 0) goto err_miss; if (TokenHash::lookup("void", 4) == 0) goto err_miss; if (TokenHash::lookup("volatile", 8) == 0) goto err_miss; if (TokenHash::lookup("while", 5) == 0) goto err_miss; } count++; time = (clock() - tick) / (double)CLK_TCK; } while ((time < 0.5) && (count < 1000000)); cout << "successful search time " << setw(6) << int(time * 1e9 / 21276 / 47 / count) << "ns (" << setw(0) << int(999972 * count / time) << " searches/sec)\n"; // Miss test. Characters of the test keys are random within range // from min-expected-char to max-expected-char, their length from 0 // to a bit more than maximum length of the existing keys. // The results should not be taken too seriously, the keys are too // random, so strcmp() or memcmp() is not called most of the time // because the calculated hash values are out of range. count = 0; tick = clock(); do { for (i = 0; i < 21276; i++) { if (TokenHash::lookup("<T", 2)) goto err_hit; if (TokenHash::lookup("hPHv]", 5)) goto err_hit; if (TokenHash::lookup("UlkW", 4)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup("h16QrP@", 7)) goto err_hit; if (TokenHash::lookup("=Wk", 3)) goto err_hit; if (TokenHash::lookup("`JN4?", 5)) goto err_hit; if (TokenHash::lookup("0", 1)) goto err_hit; if (TokenHash::lookup("d1[GsilcF5", 10)) goto err_hit; if (TokenHash::lookup("3jx>", 4)) goto err_hit; if (TokenHash::lookup("77EDG5i", 7)) goto err_hit; if (TokenHash::lookup("O<1GP", 5)) goto err_hit; if (TokenHash::lookup("w9P@9H", 6)) goto err_hit; if (TokenHash::lookup("v0Edr", 5)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup("??wFTtGZb", 9)) goto err_hit; if (TokenHash::lookup("bWycwd_jmp", 10)) goto err_hit; if (TokenHash::lookup(":", 1)) goto err_hit; if (TokenHash::lookup("@:mteniZg", 9)) goto err_hit; if (TokenHash::lookup("?fRm2e=", 7)) goto err_hit; if (TokenHash::lookup("6", 1)) goto err_hit; if (TokenHash::lookup("Ei", 2)) goto err_hit; if (TokenHash::lookup("b6mT_8G<H", 9)) goto err_hit; if (TokenHash::lookup("yB", 2)) goto err_hit; if (TokenHash::lookup("Vd", 2)) goto err_hit; if (TokenHash::lookup("BbGxnm0Xdz", 10)) goto err_hit; if (TokenHash::lookup("7x5n_pJl", 8)) goto err_hit; if (TokenHash::lookup("R8JjBHe7", 8)) goto err_hit; if (TokenHash::lookup("O?", 2)) goto err_hit; if (TokenHash::lookup("6?gt", 4)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup(";O8R", 4)) goto err_hit; if (TokenHash::lookup("6X2e6L@ZW", 9)) goto err_hit; if (TokenHash::lookup("tGDve", 5)) goto err_hit; if (TokenHash::lookup("R9CutRJ", 7)) goto err_hit; if (TokenHash::lookup("=SaE[9TaJ@", 10)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup("EEcmFFmCu", 9)) goto err_hit; if (TokenHash::lookup("", 0)) goto err_hit; if (TokenHash::lookup("5", 1)) goto err_hit; if (TokenHash::lookup("Z", 1)) goto err_hit; if (TokenHash::lookup("CfI", 3)) goto err_hit; if (TokenHash::lookup("\\lP", 3)) goto err_hit; if (TokenHash::lookup("MYKg\\Dl1Z6", 10)) goto err_hit; if (TokenHash::lookup("L6crI]N", 7)) goto err_hit; if (TokenHash::lookup("LQtKd", 5)) goto err_hit; } count++; time = (clock() - tick) / (double)CLK_TCK; } while ((time < 0.5) && (count < 1000000)); cout << "search failure time " << setw(6) << int(time * 1e9 / 21276 / 47 / count) << "ns (" << setw(0) << int(999972 * count / time) << " searches/sec)\n"; return 0; err_hit: cerr << "cxx_keywords.cxx: non-existing key found\n"; exit(1); err_miss: cerr << "cxx_keywords.cxx: key not found\n"; exit(1); }
Bellow is a frament of the assembly code generated by GCC 2.95.3
(i386, FreeBSD 4.4) for the following line of the test source:
if (TokenHash::lookup("catch", 5) == 0) goto err_miss;
The lookup function is inlined and called with constant arguments.
GCC did not generate the code to check length of the key at all since
it's known at compile time. .LC2
contains "catch", it's
long enough to include all 3 hash positions, for shorter strings
the compliler generated access only to included positions (lines
marked with "*").
movzbl .LC2+3,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%edx * movzbl .LC2+1,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax * leal 5(%eax,%edx),%edx movzbl .LC2,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax * addl %eax,%edx cmpl $67,%edx ja .L540 movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx cmpl $47,%edx je .L540 leal 0(,%edx,8),%eax leal _Q27Scanner9TokenHash$tokens(%eax),%ebx movl _Q27Scanner9TokenHash$tokens(%eax),%edx movb (%edx),%al cmpb %al,.LC2 jne .L577 addl $-8,%esp pushl %edx pushl $.LC2 call strcmp movl %eax,%edx addl $16,%esp movl %ebx,%eax testl %edx,%edx je .L567 .L577: xorl %eax,%eax .L567: testl %eax,%eax je .L540
If the lookup function is not inline and the invocation context is
not known, a jump table is generated for the switch on the key's length
(jmp *.L21(,%edx,4)
):
lookup__Q27Scanner9TokenHashPCcUi: pushl %ebp movl %esp,%ebp subl $20,%esp pushl %ebx movl 8(%ebp),%ecx movl 12(%ebp),%edx cmpl $9,%edx ja .L24 jmp *.L21(,%edx,4) .p2align 2,0x90 .section .rodata .p2align 2 .p2align 2 .L21: .long .L9 .long .L19 .long .L18 .long .L18 .long .L16 .long .L16 .long .L16 .long .L16 .long .L16 .long .L16 .text .p2align 2,0x90 .L16: movzbl 3(%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L18: movzbl 1(%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L19: movzbl (%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L9: cmpl $67,%edx ja .L24 movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx cmpl $47,%edx je .L24 leal 0(,%edx,8),%eax leal _Q27Scanner9TokenHash$tokens(%eax),%ebx movl _Q27Scanner9TokenHash$tokens(%eax),%edx movb (%edx),%al cmpb %al,(%ecx) jne .L24 addl $-8,%esp pushl %edx pushl %ecx call strcmp testl %eax,%eax jne .L24 movl %ebx,%eax jmp .L26 .p2align 2,0x90 .L24: xorl %eax,%eax .L26: movl -24(%ebp),%ebx leave ret
hashgen
The keys may be of any reasonable length, but only the first 32 characters are used for hashing, they must be unique, otherwise the generated hash function will not be perfect.
hashgen
consumes lots of RAM when the keyset is large (~20Mb
for 3,000 keys, for larger keysets it switches to a slower and less
memory-hungry mode (~4Mb for 10,000 keys)).
This estimation is very inaccurate, the actual figures depend on the
average length of a key and its assotiated data, character code range,
etc.
There are many misspellings, misplaced definite articles, and other mistakes and inaccuracies in this manual. I would be grateful to anyone willing to help me to clean this mess up.
The hashgen
utility was written by Vladimir Shiryaev.
The program does basically the same thing and uses the same
basic algorithm (with a few improvements) as the gperf
utility by Douglas C. Schmidt.
The differencies in functionality are briefly outlined in
the Overview Chapter (Overview).
As Doug says in the gperf
manual, the general idea for the
perfect hash function generator was inspired by Keith Bostic's
algorithm, created at the University of California, Irvine.