Array Systems Assessment

This section lists the systems under test, criteria used, and test data sets. Additionally, for each system one or more "champions" are indicated who are going to perform the actual tests. A side effect of the evaluations will be tuning parameters and other best practices, collected at the end of this section.

Note: Several systems today are capable of integrating external code (e.g., SciDB, rasdaman). Therefore, it is indispensable for each functionality feature to clearly state if it is an integral part implemented in the core engine or not.

Functional Comparison

Logical Model

This is functionality the user (i.e., query writer) sees.

Data model expressiveness:

  • number of dimensions: what number of dimensions can an array have? Today, 3-D x/y/t image timeseries and x/y/z voxel cubes are prominent, but also 4-D x/y/z/t gas and fluid simulations, such as atmospheric weather predictions. However, other dimensions occur as well: 1-D and 2-D data appear not only standalone (as sensor and image data, resp.), but also as extraction results from any-dimensional datacubes (such as a pixel's history or image time slices). Also, higher dimensions occur regularly. Climate modellers like to think in 5-D cubes (with a second time axis), and statistical datacubes can have a dozen dimensions. Any array engine should offer support for this spectrum of dimensions.
  • extensibility of extent along dimensions: can an existing array be extended along each dimension's lower and upper bound? Imagine a map has been defined for a country, and now is to be extended to cover the whole continent. This means: every axis must be extensible, and it must be so on both its lower and upper bounds.
  • cell data types: support for numeric data types, for composite cells (e.g., red/green/blue pixels), etc. While radar imagery consists of single values (complex numbers), satellite images may have dozens or even hundreds of "bands". Climate modelers consider 50 and more "variables" for each location in the atmosphere, indicating measures like temperature, humidity, wind speed, trace gases, etc.
  • null values: is there support for null values? For single null values vs several null values? Proper treatment of null values in operations? Null values are well known in databases, and scientific data definitely require them, too. However, instrument observations typically know of more than one null value (such as "value unknown", "value out of range", "no value delivered", etc.), and these meanings typically are piggybacked on some value from the data type (such as -9999 for "unkown depth"). Such null values should be considered by array databases, too. Operations must treat null values appropriately so that they don't falsify results.
  • data integration: can queries integrate array handling with data represented in another model, such as: Relational tables? XML stores? RDF stores? Other? This is important, eg, for data/metadata integration - arrays never come standalone, but are ornamented with metadata critically contributing to their semantics. Such metadata typically reside already under ordered data management (much more so than the arrays themselves, traditionally) frequently utilizing some well-known data model.
  • domain specific? Array databases per se are domain independent and, hence, can be used for all application domains where arrays occur. However, some systems have been crafted with a particular domain in mind, such as geo data cubes, and consequently may be less applicable to other domains, such as medical imagery.

Processing model expressiveness

  • query language expressiveness (built-in): This section investigates functionality which is readly available through the primary query language and directly supported by the system (i.e., not through extension mechanisms).
    • formal semantics: is there a mathematical semantics definition underlying data and query model? While this may seem an academic exercise a formal semantics is indispensable to verify that the slate of functionality provided is sufficiently complete (for a particular requirements set), consistent, and without gaps. Practically speaking, a well-defined semantics enables safe machine-to-machine communication, such as automatic query generation without human interference.
    • declarative: does the system offer a high-level, declarative query language? Low-level procedural languages (such as C, C++, Java, python, etc.) have several distinct disadvantages: (i) They force users to write down concrete algorithms rather than just describing the intended result;  (ii) the server is constrained in the potential of optimising queries; (iii) delarative code can be analyzed by the server, e.g., to estimate costs and, based on this, enforce quota; (iv) a server accepting arbitrary procedural code has a substantial security hole. SQL still is the role model for declarative languages.
    • optimizable: can queries be optimized in the server to achieve performance improvements? What techniques are available? Procedural code typically is hard to optimize on server side, except for "embarrassingly parallel" operations, i.e., operations where parallelization is straightforward. Declarative languages usually open up vistas for more complex optimizations, such as query rewriting, query splitting, etc. (See also discussion later on system architectures.)
    • subsetting (trim, slice) operations: can arrays be subset along all dimensions in one request? Extraction of sub-arrays is the most fundamental operation on arrays. Trimming means reducing the extent by indicating new lower and upper bounds (which both lie inside the array under inspection) whereas slicing means extracting a slab at a particular position on an axis. Hence, trimming keeps the number of dimensions in the output while slicing reduces it; for example, a trim in x and y plus a slice in t would extract, from a 4-D x/y/z/t datacube, a 3-D x/y/z timeslice. Systems must support server-side trimming and slicing on any number of dimensions simultaneously to avoid transporting excessive amounts of data.
    • common arithmetic, Boolean, trigonometric operations, etc.: can all (unary and binary) operations which are available on the cells type known to the system also be applied element-wise to arrays? Example: a+b is defined in numbers, so A+B should be possible on arrays.
    • array construction: can new arrays be created in the databases (as opposed to creating arrays only from importing files)? For example, a histogram is a 1-D array derived from some other array(s).
    • aggregation operations: can aggregates be derived from an array, supporting common operations like sum, average, min, max? Can an aggregation query deliver scalars, or aggregated arrays, or both? Note that aggregation does not always deliver just a single number - aggregation may well just involve selected axes, hence return a (lower-dimensional) array as a result.
    • array joins: can two or more arrays be combined into a result array? Can they have different dimensions, extents, cell types? While such functionality is indispensable (think of overlaying two map images) it is nontrivial to implement (think of diverging partitioning array schemes), hence not supported by all systems.
    • Tomlin's Map Algebra support: are local, focal, zonal, global operations expressible in queries. Essentially, this allows to have arithmetic expressions as array indexes, such as in "a[x+1] - a[x-1]". Image filtering and convolution is maybe the most prominent application of such addressing, but there are many important operations requiring sophisticated array cell access.
  • external function invocation (also called UDF, User-Defined Functions): can external code be linked into the server at runtime so that this code can be invoked from within the query language? Commonly, array query languages are restricted in their expressiveness to remain "safe in evaluation". Operations more complex or for which code is already existing can be implemented through UDFs, that is: server-side code external to the DBMS which gets linked into the server at invocation time. Obviously, UDFs can greatly enhance DBMS functionality, e.g., for adding in domain-specific functionality. Some systems even implement core array functionality via UDFs. To avoid confusion we list built-in and UDF-enabled functionality separately.
  • import / export capabilities
    • data formats: what data formats are supported, and to what degree?
    • what mechanisms exist to deal with inconsistent and incompete import data?
    • how selective can array cells be updated?
  • client interfaces
    • domain-independent interfaces: which domain-independent clients exist for sending queries and presenting results?
    • domain-specific interfaces: which domain-specific clients exist for sending queries and presenting results?
  • functionality beyond arrays: can queries perform operations transcending the array paradigm?
    • polygon/raster clipping: 2D? nD?

Physical model

This is what the administrator sees, or what even remains invisible, but makes the system more efficient.

  • Tuning Parameters (accessible to the administrator)
    • partitioning
    • compression
    • distribution
    • caching
    • other
  • Optimization techniques (builtin and invisible)
    • query rewriting
    • cost-based optimization
    • other

System assessment

tbd: per system, an assessment against the criteria list. If too long, spawn subpages. Make sure every claim is supported by a reference / demo feature available online.

Synoptic feature table

tbd: In a final consolidation step, a synoptic feature table will be established.

References

  • G. Merticariu, D. Misev, P. Baumann: Measuring Storage Access Performance in Array Databases. Proc. 7th Workshop on Big Data Benchmarking (WBDB), December 14-15, 2015, New Delhi, India

Architectural Comparison

  • storage organization
    • does the system support partitioning (tiling, chunking) of arrays?
    • does the system support non-regular tiling schemes? Which ones?
    • What mechanisms does the system support for managing data partitioning?
    • can tiles of an array reside on separate computers, while the system maintains a logically integrated view on the array?
    • can the system process data maintained externally, not controlled by the DBMS?
  • parallelism
    • which parallelization mechanisms does the system support: local single thread vs multicore-local vs multinode-cluster/cloud vs federation
    • does the system have a single point of failure?
  • security

Performance Comparison (Benchmarks)

Environment

All systems ideally get installed on identical platforms. Should that not be possible then the target environment must be described concise enough to still allow comparison.

Criteria

  • storage
    • subsetting effectiveness: data read vs data delivered for a particular query
    • compression (including sparsity)
    • distributed storage
    • other
  • processing: speed of standardized operations
    • cell-wise operations ("local operations" in the Tomlin categorization), such as log(A)
    • generation of new arrays
    • aggregation
    • array join
    • local operations, such as nxn Sobel filter
    • Linear Algebra operations such as histogram, matrix multiplication
  • import:
    • insertion of a file forming a new array
    • updating an array from a file
  • export:
    • generating, say, a NetCDF file for 1D through 5D data, from 10 kB to 10 GB (note: larger sizings do not make so much sense from a practical viewpoint as such Big Data technology should provide crisp output to the user - Big Data are "too big to transport", and the transfer time would dominate soon)

Testing approaches

  • processing efficiency: single-thread performance
  • processing scalability: effectiveness of parallel/distributed processing

Results

Synoptic performance table? diagrams?

Issue: reproducibility for proprietary software?

References

  • G. Merticariu, D. Misev, P. Baumann: Measuring Storage Access Performance in Array Databases. Proc. 7th Workshop on Big Data Benchmarking (WBDB), December 14-15, 2015, New Delhi, India
  • V. Liaukevich, D. Misev, P. Baumann, V Merticariu: Location and Processing Aware Datacube Caching. Proc. 29th Intl. Conf. on Scientific and Statistical Database Management (SSDBM '17). ACM, New York, USA, Article 34

Tuning Parameters

Which tuning parameters does the system proide?

  • partitioning
  • compression
  • distribution
  • caching
  • other

Standards Supported

(in particular those of the standards listed)

  • ISO 9075 SQL Multi-Dimensional Arrays (MDA)
  • OGC Web Coverage Processing Service (WCPS) Interface Language

Data

Test data sets are to be provided (linked) here which will allow to compare the technologies synoptically.

Systems under Test

  • rasdaman: Jacobs University / rasdaman GmbH
  • tbd