Skip to content

Commit

Permalink
Merge pull request #855 from Unidata/newhash0.dmh
Browse files Browse the repository at this point in the history
Higher performance hash for metadata: step 0
  • Loading branch information
WardF committed Feb 22, 2018
2 parents a242afe + 8e16917 commit fc6ab98
Show file tree
Hide file tree
Showing 33 changed files with 2,761 additions and 620 deletions.
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1277,8 +1277,10 @@ CHECK_TYPE_SIZE("_Bool" SIZEOF__BOOL)

CHECK_TYPE_SIZE("size_t" SIZEOF_SIZE_T)


CHECK_TYPE_SIZE("ssize_t" HAVE_SSIZE_T)


# __int64 is used on Windows for large file support.
CHECK_TYPE_SIZE("__int64" SIZEOF___INT_64)
CHECK_TYPE_SIZE("int64_t" SIZEOF_INT64_T)
Expand Down
2 changes: 1 addition & 1 deletion cf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash
#NB=1
DB=1
#DB=1
#X=-x
FAST=1

Expand Down
6 changes: 3 additions & 3 deletions cf.cmake
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Visual Studio

# Is netcdf-4 and/or DAP enabled?
NC4=1
#DAP=1
#CDF5=1
#NC4=1
DAP=1
CDF5=1
#HDF4=1

case "$1" in
Expand Down
3 changes: 3 additions & 0 deletions config.h.cmake.in
Original file line number Diff line number Diff line change
Expand Up @@ -559,6 +559,9 @@ are set when opening a binary file on Windows. */
/* Define to `unsigned int' if <sys/types.h> does not define. */
#cmakedefine size_t unsigned int

/* Define to `unsigned long if <sys/types.h> does not define. */
#cmakedefine uintptr_t unsigned long

/* Define strcasecmp, snprintf on Win32 systems. */
#ifdef _WIN32
#ifndef HAVE_STRCASECMP
Expand Down
1 change: 1 addition & 0 deletions docs/Doxyfile.in
Original file line number Diff line number Diff line change
Expand Up @@ -750,6 +750,7 @@ INPUT = \
@abs_top_srcdir@/docs/install-fortran.md \
@abs_top_srcdir@/docs/types.dox \
@abs_top_srcdir@/docs/internal.dox \
@abs_top_srcdir@/docs/indexing.dox \
@abs_top_srcdir@/docs/windows-binaries.md \
@abs_top_srcdir@/docs/guide.dox \
@abs_top_srcdir@/docs/OPeNDAP.dox \
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ mainpage.dox tutorial.dox guide.dox types.dox cdl.dox \
architecture.dox internal.dox windows-binaries.md \
building-with-cmake.md CMakeLists.txt groups.dox install.md notes.md \
install-fortran.md all-error-codes.md credits.md auth.md \
obsolete/fan_utils.html bestpractices.md filters.md
obsolete/fan_utils.html bestpractices.md filters.md indexing.dox

# Turn off parallel builds in this directory.
.NOTPARALLEL:
Expand Down
218 changes: 218 additions & 0 deletions docs/indexing.dox
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
/** \file

\internal

\page nchashmap Indexed Access to Metadata Objects

\tableofcontents

The original internal representations of metadata in memory
relied on linear searching of lists to locate various objects
by name or by numeric id (e.g. varid or grpid).

In recent years, the flaws in that approach have become obvious
as users create files with extremely large numbers of objects:
group, variables, attributes, and dimensions. One case
has 14 megabytes of metadata. Creating and (especially) later
opening such files was exceedingly slow.

This problem was partially alleviated in both netcdfd-3 (libsrc)
and netcdf-4 (libsrc4) by adding name hashing tables.
However, and especially for netcdf-4, linear search still prevailed.

A pervasive change has been made to try to remove (almost) all
occurrences of linear search and replace it with either hashing
(for name-based lookup) or vectors (for numeric id-based
lookup). The cases left as linear search include these.

1. Enum constants for an enumeration
2. Dimensions associated with a variable
3. Fields of Compound types

This document describes the architecture and details of the netCDF
internal object lookup mechanisms now in place.

\section S1 Indexed Searches

There are, as a rule, two searches that are used to locate
metadata object: (1) search by name and (2) search by
externally visible id (e.g. dimid or varid).

Currently, and after all the metadata is read or created,
hashing is used for locating objects by name. In all other
cases -- apparently -- lookup is by linear search of some
kind of linked list or a vector.

It is relevant that, once created, no metadata object -- except
attributes -- can be deleted. They can be renamed, but that
does not change the associated id. Deletion only occurs when an
error occurs in creating an object or on invoking nc_close.

The numeric identifiers for dimensions, types, and groups are
all globally unique across a file. But note that variable id's
are not globally unique (IMO a bad design decision) but are only
unique within the containing group. Thus, in order to provide a
unique id for a variable it must be composed of the containing
group id plus the variable id.

Note also that names are unique only within a group and with respect
to some kind of metadata. That is a group cannot have e.g. two
dimensions with the same name. But it can have a variable and a dimension
with the same name (as with coordinate variables).

Finally, attribute names are unique only with respect to each other
and with respect to the containing object (a variable or a group).

\section S2 Basic Data Structures

The basic data structures used by the new lookup mechanisms
are described in the following sections.

\subsection SS1_1 NClist

With rare exceptions, vectors of objects are maintained as
instances of NClist, which provides a dynamically extendible
vector of pointers: pointers to metadata objects in this case.
It is possible to append new objects or insert at a specific
vector offset, or overwrite an existing pointer at a specific
offset.

The definition is as follows.

\code
typedef struct NClist {
size_t alloc;
size_t length;
void** content;
} NClist;
\endcode


\subsection SS1_2 NC_hashmap

The NC_hashmap type is a hash table mapping a name to a pointer.
As a rule, the pointer points to a metadata object. The current
implementation supports table expansion when the # of entries in
the table starts to get too large. Basically a simple linear
rehash is used for collisions and no separate hash-chain is
used. This means that when expanded, it must be completely
rebuilt. The performance hit for this has yet to be determined.

The hashtable definition is as follows.

\code
typedef struct NC_hashmap {
size_t size;
size_t count;
NC_hentry* table;
} NC_hashmap;
\endcode

where size is the current allocated size and count is the
number of active entries in the table. The "table" field is
a vector of entries of this form.

\code
typedef struct NC_hentry {
int flags;
void* data;
size_t hashkey; /* Hash id */
char* key; /* actual key; do not free */
} NC_hentry;
\endcode

The flags indicate the state of the entry and can be one of three states:

1. ACTIVE - there is an object referenced in this entry
2. DELETED - an entry was deleted, but must be marked so
that linear rehash will work.
3. EMPTY - unused

There is an important WARNING with respect to the "key" field.
The key is not a copy of the object's name, but in fact is a duplicate
pointer to that same string. This means (1) that it should never be
free()'d and (2) if the name of the metadata object is changed, then
it must be removed and re-inserted into the table to that the key
points to the current name.

The "data" field is of type void*. Often it is a pointer to an instance
of a variable, or dimension, or other object. When used as part of an
NC_listmap (see below), then the key is an integer index into the
associated vector. In order to do this correctly, we need to rely
on the type "uintptr_t". It is supposed to be the case
that a value of type uintptr_t is an integer of sufficient size to
hold a void* pointer. Usually, but not always, this would be the same
size as an "unsigned long" value. Using this allows the hashtable
to store either pointers or integer indices.

One further WARNING: any object that will be inserted into an NC_hashmap
must have its name as the first field so it can be cast
to char** for use with the hashtable.

\subsection SS1_3 NC_listmap

A listmap is a combination of an NClist and an NC_hashtable.
It is used to provide name-based lookup with respect to a
specific list of metadata objects. For example, the subgroups
of a group are stored using a listmap, where the list is a
vector of pointers to the subgroup objects and the hashmap
maps the subgroup name (unique to that group, remember) to
the corresponding index into the vector. In theory, only
the hashmap is needed because it could be walked to get all
of the metadata objects. However, the creation order is sometimes
important, so that is maintained by the vector.
This is especially important for attribute storage.

Note that currently, NC_listmap is only used in libsrc4,
but if performance issues warrant, it will also be used in
libsrc.

\section S3 Global Object Access

As mentioned, dimension, group, and type external id's (dimid,
grpid, typeid) are unique across the whole file. It is therefore
convenient to store in memory a per-file vector for each object
type such that the external id of the object is the same as the
position of that object in the corresponding per-file
vector. This maked lookup by external id efficient.
Note that this is was already the case for netcdf-3 (libsrc) so
this is a change for libsrc4 only.

The global set of dimensions, types, and groups is maintained by
three instances of NClist in the NC_HDF5_FILE_INFO structure:
alldims, alltypes, and allgroups.
The position of the object within the corresponding list determines
the object's external id. Thus, a position of a dimension object within the
"alldims" field of the file structure determines its dimid. Similarly
for types and groups.

\section S4 Per-Group Object Access

Each group object (NC_GRP_INFO) contains four
instances of NC_listmap. One is for dimensions, one is for
types, one is for subgroups, and one is for variables. A
listmap is used for two reasons. First, allows name-based lookup
for these items. Second, the declaration order is maintained by
the list within the listmap's vector. Note that the position of
an object in a group listmap vector has no necessary
relationship to the position of that object within the global
vectors. Note also that there is no global vector for variables
because variable external ids are unique only within the
group. In this special case, the external id for the variable is
the same as its offset in the listmap's vector for the group.

A note about typeids. Since user defined types have an external
id starting at NC_FIRSTUSERTYPEID, we leave the global type
vector entries 0..NC_FIRSTUSERTYPEID-1 empty.

\section S5 Exceptions

NC_Listmap is currently not used for enum constants and compound fields.
Additionally, it is not used for listing the dimensions associated
with a variable.

References between meta-data objects (e.g. group parent or
containing group) are stored directly and not using any kind
of vector or hashtable.

*/
2 changes: 1 addition & 1 deletion examples/C/filter_example.c
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ test_bzip2(void)
/* Show chunking */
printf("show chunks:");
for(i=0;i<actualdims;i++)
printf("%s%ld",(i==0?" chunks=":","),chunks[i]);
printf("%s%ld",(i==0?" chunks=":","),(unsigned long)chunks[i]);
printf("\n");

/* prepare to write */
Expand Down
2 changes: 1 addition & 1 deletion include/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ ncuri.h ncutf8.h ncdispatch.h ncdimscale.h netcdf_f.h err_macros.h \
ncbytes.h nchashmap.h ceconstraints.h rnd.h nclog.h ncconfigure.h \
nc4internal.h nctime.h nc3internal.h onstack.h nc_hashmap.h ncrc.h \
ncauth.h ncoffsets.h nctestserver.h nc4dispatch.h nc3dispatch.h \
ncexternl.h ncwinpath.h ncfilter.h hdf4dispatch.h
ncexternl.h ncwinpath.h ncfilter.h hdf4dispatch.h nclistmap.h

if USE_DAP
noinst_HEADERS += ncdap.h
Expand Down
28 changes: 13 additions & 15 deletions include/nc3internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ typedef enum {
NC_ATTRIBUTE = 12
} NCtype;


#ifndef NEWHASHMAP /* temporary hack until new hash is complete */
/*! Hashmap-related structs.
NOTE: 'data' is the dimid or varid which is non-negative.
we store the dimid+1 so a valid entry will have
Expand All @@ -72,14 +72,14 @@ typedef struct s_hashmap {
unsigned long size;
unsigned long count;
} NC_hashmap;

#endif

/*
* NC dimension structure
*/
typedef struct {
/* all xdr'd */
NC_string *name;
NC_string* name;
size_t size;
} NC_dim;

Expand Down Expand Up @@ -124,19 +124,21 @@ elem_NC_dimarray(const NC_dimarray *ncap, size_t elem);
*/
typedef struct {
size_t xsz; /* amount of space at xvalue */
/* below gets xdr'd */
/* begin xdr */
NC_string *name;
nc_type type; /* the discriminant */
size_t nelems; /* length of the array */
void *xvalue; /* the actual data, in external representation */
/* end xdr */
} NC_attr;

typedef struct NC_attrarray {
size_t nalloc; /* number allocated >= nelems */
/* below gets xdr'd */
/* begin xdr */
/* NCtype type = NC_ATTRIBUTE */
size_t nelems; /* length of the array */
NC_attr **value;
/* end xdr */
} NC_attrarray;

/* Begin defined in attr.c */
Expand Down Expand Up @@ -177,31 +179,28 @@ typedef struct NC_var {
size_t xsz; /* xszof 1 element */
size_t *shape; /* compiled info: dim->size of each dim */
off_t *dsizes; /* compiled info: the right to left product of shape */
/* below gets xdr'd */
NC_string *name;
/* begin xdr */
NC_string* name;
/* next two: formerly NC_iarray *assoc */ /* user definition */
size_t ndims; /* assoc->count */
int *dimids; /* assoc->value */
NC_attrarray attrs;
nc_type type; /* the discriminant */
size_t len; /* the total length originally allocated */
off_t begin;
/* end xdr */
int no_fill; /* whether fill mode is ON or OFF */
} NC_var;

typedef struct NC_vararray {
size_t nalloc; /* number allocated >= nelems */
/* below gets xdr'd */
/* begin xdr */
/* NCtype type = NC_VARIABLE */
size_t nelems; /* length of the array */
NC_hashmap *hashmap;
NC_var **value;
NC_hashmap *hashmap;
NC_var **value;
} NC_vararray;

/* Begin defined in lookup3.c */

/* End defined in lookup3.c */

/* Begin defined in var.c */

extern void
Expand Down Expand Up @@ -267,7 +266,6 @@ extern void NC_hashmapDelete(NC_hashmap*);

/* end defined in nc_hashmap.c */


#define IS_RECVAR(vp) \
((vp)->shape != NULL ? (*(vp)->shape == NC_UNLIMITED) : 0 )

Expand Down
Loading

0 comments on commit fc6ab98

Please sign in to comment.