Why Do We Need "Arrays"?
The significant increase in scientific data that occurred in the past decade – such as NASA’s archive growth from some hundred Terabytes in 2000 [2] to 32 Petabytes of climate observation data [3] – marked a change in the workflow of researchers and pro-grammers. Early approaches consisted mainly of retrieving a number of files from an FTP server, followed by manual filtering and extracting, and then either running a batch of computation processes on the user’s local workstation, or tediously writing and optimizing sophisticated single-use-case software designed to run on expensive supercomputing infrastructures. This is not feasible any more when dealing with Petabytes of data which need to be stored, filtered and processed beforehand. When data providers discovered this they started providing custom tools themselves, often leading to silo solutions which turn out to erode over time and make maintenance and evolution hard if not impossible. An alternative finding attention only recently are database-centric approaches, as these have shown significant potential; meantime, we find both small institutions [4] and large data-centers [6] using modern database architectures for massive spatio-temporal data sets.
Arrays - also called "raster data" or "gridded data" or, more recently, "datacubes" [42] - constitute an abstraction that appears in virtually all areas of science and engineering, and even beyond:
- Earth sciences: 1-D sensor data, 2-D satellite imagery, 3-D x/y/t image timeseries and x/y/z subsurface voxel data, 4-D x/y/z/t climate and weather data; etc.
- Life sciences: microarray data, image modalities like X-ray, sonography, PET, and fMRI delivering 2-D and increasingly 3-D data about human and non-human brains and further organs; 2-D through 4-D gene expression data; etc.
- Space sciences: optical and radio telescope data; 4-D x/y/z/t cosmological simulation output; planetary surface and subsurface data; etc.
- Statistics: "Datacubes" are known since long in the context of Data Warehousing and OLAP [43] where, instead of spatial, abstract axes are defined, usually together with a time axis. A main difference to the above data is that statistical datacubes are rather sparse (say, 3% - 5% of the data space is occupied by values) whereas Earth, Space, and Life science and engineering data tend to be rather dense, often completely dense (i.e., most or all cell positions in the grid hold some non-null value).
Fig. 1 gives an impression of the variety of different observed data specifically in Ocean science. Generally, arrays typically represent sensor, image, simulation, and statistics data of spatio-temporal or "abstract" dimensions.
Fig. 1. Integrating a variety of data sources and types in an Ocean Science Interoperability Experiment (src: OGC membership)
For decades now, SQL has proven its value in any-size data services in companies as well as public administration. Part of this success is the versatility of the query lanuage approach, as well as the degree of freedom for vendors to enhance performance through server-side scalability methods. Unfortunately, scientific and engineering environments could benefit only to a limited extent. The main reason is a fundamental lack in data structure support: While flat tables are suitable for accounting and product catalogues, science needs additional information categories, such as hierarchies, graphs, and arrays. The consequence of this missing support has been a historical divide been ”data” (large, constrained to download, no search) and ”metadata” (small, agile, searchable).
Array databases [44] provide flexible, scalable services on massive multi-dimensional arrays. consisting of storage management and processing functionality for multi-dimensional arrays which form a core data structure in science and engineering. They have been specifically designed to fill the gaps of the relational world when dealing with large binary datasets of structured information and have gained traction in the last years, in scientific communities as well as in industrial sectors like agriculture, mineral resource exploitation etc. 1-D sensor data, 2-D satellite and medical imagery, 3-D image timeseries, 4-D climate models are all at the core of virtually all science and engineering domains [6]. The currently most influential array database implementations are, in historical order, rasdaman [1][5][9][10][21] and SciDB [8][16]; Fig. 2 gives a brief outline on the historical development of this field. Each of them allows querying the data based on the array’s properties and contents using declarative languages that usually allow for a large degree of flexibility in both query formulization and internal query optimization techniques. Processing of arrays is core functionality in such databases with large sets of operations, ranging from simple sub-setting up to statistics, signal and image processing, and general Linear Algebra. A first Array Database workshop has been held in Uppsala in 2011 [11].
Fig. 2: History of Array Databases
Theoretical Foundations
Formally, a d-dimensional array is a function a: D → V with a domain consisting of the d-fold Euclidean cross product of closed integer intervals:
D = {lo1, …, hi1} x … x {lod, …, hid} with loi≤hii for 1≤i≤d
where V is some non-empty value set, called the array’s cell type. Single elements in such an array we call cells. Arrays sometimes are popularly referred to as datacubes emphasizing the higher dimensions, from 3 onwards. Still, the concept encompasses all dimensions, including 1-D and 2-D (0-dimensional arrays constitute a border case which can be considred as a single value - it is more or less a matter of taste whether to consider them arrays or not).
Tomlin has established a so-called Map Algebra [30] which categorizes array operations depending on how many cells of an input array contribute to each cell of the result array; here is an excellent compressed introduction. While Map Algebra was 2-D and has been extended to 3-D lateron, AFATL Image Algebra [31] is n-dimensional by design. Array Algebra [6] has been influenced by AFATL Image Algebra when establishing a formal framework for n-D arrays suitable for a database query language.
Querying Arrays
Although array query languages heavily overap there is not yet a common consensus on operations and their representation. For the brief introduction we rely on Array Algebra [6] because it is a minimal formal framework of well understood expressiveness and also forms the theoretical underpinning of the forthcoming ISO Array SQL standard, SQL/MDA (see standards section). In passing we note that array operations, being 2nd order with functions as parameters, introduce functionals. Array Algebra relies on only three core operators: An array constructor, marray, an aggregator called condenser, and an array sort operation (which we skip for this introduction). We introduce these in turn.
Deriving arrays
The marray operator creates an array of a given extent and assigns values to each cell through some expression which may contain occurrences of the cell’s coordinate. Sounds complicated? Let us start simple: assume we want to obtain a subset of an array A. This subset is indicated through array coordinates, i.e., we extract a sub-array. For a d-dimensional array this subset can be defined through a d-dimensional interval given by the lower corner coordinate (lo1, ..., lod) and upper corner coordinate (hi1,...,hid), respectively. To create the subset array we write
marray x in [ lo1:hi1, ..., lod:hid ] values a[x]
This extraction, which retains the dimensionality of the cube, is called trimming. Commonly this is abbreviated as
a[ lo1:hi1, ..., lod:hid ]
We can also reduce the dimension of the result by applying slicing in one or more coordinates. This means, instead of the "loi:hii" interval we provide only one coordinate, the slice position "si". Notably, if we slice d times we obtain a single value (or, if you prefer, a 0-d array), written as:
marray x in [ s1, ..., sd ] values a[x]
or in its shorthand
a[ s1, ..., sd ]
which resembles the common array cell access in programming languages. Figure X shows some examples of trimming and slicing on a 3-D array.
Fig. X: Various types of subsetting from an array: trimming (left, which keeps the original dimension) and slicing (which reduces the number of dimensions, right)
Now let as assume we want to change the individual cell values rather than doing extraction, for example deriving the logarithm of some input array of given domain extent D:
marray x in D values log( a[x] )
An example for a binary operator is addition of two images:
marray x in D values a[x] + b[x]
In fact, any unary or binary operation defined on the input arrays’ cell types "induces" a corresponding array operation. For binary operations - also referred to as array joins - we require that both operand arrays share the same spatial extent so that the pairwise matching of array cells is defined. Syntactically, we abbreviate such marray operations so that the above example can be written as
a + b
With this simple rule we have obtained already all the well-known arithmetic, Boolean, exponential, and trigonometric operations.
Extending binary to n-ary functions we find two practically useful operations, the case and concat operators. Following the syntax of SQL we can write an array case (or "if" operator) as in the following example which performs a traffic light classification of array values, based on thresholds t1 and t2:
case when a > t2 then {255,0,0} when a > t1 then {0,255,255} else {0,255,0} end
Another useful operation is array concatenation. We define, for two arrays a with domain A and b with domain B,
a concat b := marray x in A union B values case when x in A then a[x] else b[x] end
Obviously, the union of the input domains must be a valid array domain again. It is straightforward to extend concatenation to an n-ary function provided the input array domains altogether form a valid array partition.
Condensing arrays
All the above operations have served to derive a new array from one or more given arrays. Next, we look at the condenser which - in analogy to SQL aggregation - allows to derive summary values. The general condenser iterates over an array covering the domain indicated and aggregates the values found; actually, each value can be given by a location-aware expression. The following example adds all cell values of a in domain D (which obviously must be equal to or contained in the domain of array a):
condense + over x in D using a[x]
This can be abbreviated as
sum(a)
Not all operations can act as condensers as they must be injective, commutative, and associative. Common candidates are sum, avg, min, max, exists, and forall.
Operator combinations
The operators illustrated can all be combined freely to form expresssions of arbitrary complexity. We demonstrate this through two examples.
Example 1: The matrix product of a and b, yielding an mxp result matrix.
marray i in [0:m], j in [0:p] values condense + over k in [0:n] using a [ i, k ] * b [ k, j ]
Example 2: A histogram over an 8-bit greyscale image.
marray bucket in [0:255] values count_cells( img = bucket )
This way, general operations from image / signal processing, statistics, and Linear Algebra can be expressed.
Array integration
Some systems operate on arrays standalone, others integrate them into a host data model, typically: relations. Following ISO SQL we embed arrays into the relational model as a new column type which is shared by the majority of systems such as rasdaman, PostgreSQL, Oracle, and Teradata. This offers several practical advantages, such as a clear separation of concerns in query optimization and evaluation which eases mixed optimization [17]. For example, we can define a table of Landsat images as follows:
create table LandsatScenes( id: integer not null, acquired: date, scene: row( band1: integer, ..., band7: integer ) mdarray [ 0:4999,0:4999] )
which can be queried like this example shows:
select id, encode(scene.band1-scene.band2)/(scene.nband1+scene.band2)), "image/tiff" ) from LandsatScenes where acquired between "1990-06-01" and "1990-06-30" and avg( scene.band3-scene.band4)/(scene.band3+scene.band4)) > 0
A notable effect is that now data and metadata reside in the same information space and can be accessed and combined in one and the same query. Hence, in future the age-old distinction between data and metadata can be overcome.
Array Database Architectures
Storage
Access patterns on arrays are strongly linked to the Euclidean neighborhood of array cells (Fig. X), therefore it must be a main goal of any storage engine to preserve proximity on persistent storage through some suitable spatial clustering. It is common, therefore, to partition n-D arrays into n-D sub-arrays called tiles [1] or chunks [22] which then form the unit of access to persistent storage.
Fig. X n-D Euclidean neighborhood of array cells
Obviously, the concrete partitioning chosen greatly affects disk traffic and, hence, overall query performance. By adjusting the partitioning – statically in advance or dynamically at query time – to the workloads, the number of partitions fetched from persistent storage can be minimized, ideally: to a single disk access (Fig. X). The challenge is to find a partitioning which supports a given workload. For example, when building x/y/t remote sensing data cubes imagery comes in x/y slices with a thickness of 1 along t. Time series analysis, on the contrary calls for cutouts with long time extent and (possibly) limited spatial x/y extent.
While this principle is generally accepted partitioning techniques vary to some extent. PostGIS Raster [40] allows only 2D x/y tiles and suggests small sizes, like 100x100 pixels. Teradata arrays are limited to less than 64 kB [41]. SciDB offers a two-level partitioning where smaller partitions can be gathered in container partitions. Further, SciDB allows overlapping partitions so that queries requiring adjacent pixels (link in convolution operations) do not require reading the neighboring partitions [23]. In rasdaman, a storage layout sublanguage allows to define partition-ing along several strategies [5]. For example, in “directional tiling” ratios of partition edge extents are indicated, rather than absolute sizes; this allows to balance mixed workloads containing, e.g., spatial timeslice extraction and temporal timeseries analysis. In the “area of interest tiling” strategy, hot spots are indicated and the system automatically determines an optimal partitioning.
Fig. X Sample tiling of 2-D and 3-D arrays (left) and rasdaman tiling strategies area-of-interest, regular, and directional (right)
To quickly determine the partitions required – a typical range query – some spatial index, such as the R-Tree, proves advantageous. As opposed to spatial (i.e., vector) databases the situation with arrays is relatively simple: the target objects, which have a box structure (as opposed to general polygons), partition a space of known extent. Hence, most spatial indexes can be expected to perform decently.
Often, compression of tiles is advantageous [9]. Still, in face of very large array databases tertiary storage may be required, such as tape robots [22] [21].
Processing (tbd)
When it comes to query evaluation it turns out that, in general, array operations are heavily CPU bound; this is contrary to relational query processing which typically is I/O bound. Some array operations are trivially parallelizable, such as cell-wise pro¬cessing and combination (so-called “local” operations [22]) and simple aggregations. These can easily be distributed both on local processing nodes like multicore CPUs and general-purpose GPUs and remote nodes, like servers in a cloud network. Others have to be carefully analyzed, transformed and sometimes even rewritten in different sets of operations to gain such parallelizable characteristics, e.g. joins on differently partitioned arrays, histogram generators and, in general, array constructors with non-serial access patterns.
The following is a non-exhaustive list of optimizations proven effective in Array DBMSs like rasdaman:
- query rewriting
- parallelization. The fact that array operations involve applying the same operation on a large number of values, and also the observation that tiles map naturally to CPU cores sometimes leads to the hasty conclusion that array operations per se are "embarrassingly parallel". While this holds for simple operations, such as unary induced operations like "log(a)", this is not true in general. Already binary operations like "a+b" pose challenges - for example, both operand arrays can reside on diferent nodes, even data centers, and they may have an incompatible tiling which calls for advanced methods like Array Join [45]. Additional complexity, but also opportunities, come with complex queries.
- Cost-based optimization attempts to find an eficient execution plan out of the - usually large - search space of possible plans for a given query.
- Mixed hardware.
- approximative caching.
Client interfacing
While "datacubes" represent a convenient logical view on massive multi-dimensional data this does not mean clients need to see data in such a shape. Very often, clients will do some extraction and aggregation, thereby reducing and changing dimensionality away from the original. More importantly even, users should be able to remain as much as possible within their comfort zone of well known tools - for example, simple map naigation should still be able through clients like OpenLayers and Leaflet, embedding into Web GIS should support tools like QGIS and ESRI ArcGIS, virtual globes like NASA WebWorldWind and Cesium should be supported, whereas high-end analytics calls for access to datacubes through R and python.
Related Technology
Array databases, by definition, are characterized by offering a declarative query language on n-D arrays. Such technology can e implemented in various ways - as will be demonstrated by the systems overview in the next section - each coming with its individual characteristics. However, we will also look beyond the pure Array Database category and give a glance at other array technology, including array engines offering only procedural interfaces (rather than a query language) as well as libraries and command-line tools which form possible components of array services, but do not constitute a complete service tool per se. This way, we aim at providing a context for the novel category of Array Databases.
References
- 110+ publications on Array Databases and related topics
- Wikipedia: Array Databases
- [1] Baumann P (1994): On the Management of Multidimensional Discrete Data, VLDB Journal 4(3), pp. 401 - 444. Special Issue on Spatial Database Systems
- [5] Baumann P, Feyzabadi S, Jucovschi C (2010): Putting Pixels in Place: A Storage Layout Language for Scientific Data. Proc. IEEE ICDM Workshop on Spatial and Spatiotemporal Data Mining (SSTDM), December 14, 2010, Sydney, Australia, pp. 194 – 201
- [9] Dehmel A (2002): A Compression Engine for Multidimensional Array Database Systems. PhD thesis, TU München, 2002
- [20] PostGIS: PostGIS Raster manual. Seen on 2014-07-29
- [21] Reiner B, Hahn K (2002): Hierarchical Storage Support and Management for Large-Scale Multidimensional Array Database Management Systems. Proc. DEXA, 2002, Aix en Provence, France
- [22] Sarawagi S, Stonebraker M (1994): Efficient Organization of Large Multidimensional Arrays. Proc. International Conference on Data Engineering ICDE, Houston, USA, 1994, pp. 328-336
- [23] Soroush E, Balazinska M, Wang D (2011): ArrayStore: A Storage Manager for Complex Parallel Array Processing. Proc. ACM SIGMOD, Athens, Greece, pp. 253 – 264
- [25] Teradata: User‑Defined Data Type, ARRAY Data Type, and VARRAY Data Type Limits. Seen on 2014-07-29
- [2] W. Gibson, “Data, data everywhere”, The Economist Special report: Managing information, 2010. http://www.economist.com/node/15557443
- [3] P. Webster, “Supercomputing the Climate: NASA’s Big Data Mission”, CSC World. Computer Sciences Corporation, 2012
- [4] J. H. P. Oosthoek, A. P. Rossi, P. Baumann, D. Misev, and P. Campalani, “PlanetServer: Towards online analysis of planetary data”, in Planetary Data, 2012.
- [6] P. Baumann et al, “Towards Ad-Hoc Analytics on Big Earth Data: the EarthServer Initiative”, 2013.
- [13] M. Koubarakis, M. Datcu, C. Kontoes, U. Di Giammatteo, S. Manegold, and E. Klien, “TELEIOS: a database-powered virtual earth observatory,” Proc. VLDB Endow., vol. 5, pp. 2010–2013, 2012
- [16] P. Cudre-Mauroux et al, “A demonstration of SciDB: a science-oriented DBMS,” Proc. VLDB Endow., 2(2)2009, pp. 1534–1537, 2009
- [10] P. Baumann, “Management of multidimensional discrete data”, VLDB J., 3(4)1994, pp. 401–444, 1994
- [8] P. G. Brown, “Overview of sciDB: large scale array storage, processing and analysis”, in Proc. ACM SIGMOD, 2010, pp. 963–968
- [30] D. Tomlin, "A map algebra", Harvard Graduate School of Design, 1990.
- [31] Ritter, G. Wilson, J.; Davidson, J., "Image algebra: An Overview", Computer Vision, Graphics, and Image Processing, 49(1):297-331, 1990
- [11] P. Baumann, B. Howe, K. Orsborn, S. Stefanova: Proceedings of the 2011 EDBT/ICDT Workshop on Array Databases, Uppsala, Sweden, March 25, 2011 ACM 2011
- [17] D. Misev, P. Baumann: Enhancing Science Support in SQL. Proc. Workshop on Data and Computational Science Tech-nologies for Earth Science Research (co-located with IEEE BigData), Santa Clara, US, October 29, 2015
- [40] PostGIS: PostGIS Raster manual. Seen on 2014-07-29
- [41] Teradata: User‑Defined Data Type, ARRAY Data Type, and VARRAY Data Type Limits. Seen on 2014-07-29
- [42] P. Baumann: The Datacube Manifesto. Available on http://earthserver.eu/tech/datacube-manifesto, seen on 2017-06-27
- [43] M. Blaschka, C. Sapia, G. Höfling, B. Dinter: Finding Your Way through Multidimensional Data Models. DEXA Workshop Data Warehouse Design and OLAP Technology (DWDOT'98), Vienna, Austria, August 24-28, 1998, pp. 198-203
- [44] P. Baumann: Array Databases. In: T. Özsu, L. Liu (eds.): Encyclopedia of Database Systems, Springer, 2017
- [45] P. Baumann, V. Merticariu: On the Efficient Evaluation of Array Joins. Proc. Workshop Big Data in the Geo Sciences (co-located with IEEE Big Data), Santa Clara, US, October 29, 2015
- 11575 reads