Open Data, Open Source, Open Standards

In the era where opennes has become highly valued we sometimes observe confusion about the meaning and consequence of an "open X". We, therefore, briefly discuss three core terms heavily debated in the science data domain. Note that the goal is not to define or even explain in detail - this has been done many times elsewhere already -, but to specifically relate these three terms to each other.

Open Data

This addresses accessibility of data. Data are said to be open if they can be accessed without any restriction stemming from constrained user groups, etc. A related term is "Free Data" meaning that access is free of cost.

An indirect obstacle to free access, aside and independently from organizational restrictions, can be the difficulty of access due to reasons such as uncommon data formats, unwieldy data granules (such as 100 GB TIFF files), access interfaces requiring high technical backgrounds, or interfaces posing particular hardware requirements (high client-side hardware resource needs, high-bandwidth connection, etc). Hence, offering open data also has an implication on the ease of use of the data offered. In this context, an interesting and widely embraced initiative has been launched by the USGS Landsat team coining the term Analysis Ready Data. In this approach, data centers tenatively undertake high effort in preparing (homogenizing, restructuring, cleansing, etc.) data in a way that reduces such intrinsic obstacles to data access.

Open Source

This term refers to the software used, e.g., to serve or access data (i.e., servers and clients in Web-based information systems). By way of background, most programs are written by human developers in som ehigh-level language which is closer to human perception concepts than the computer's machine language - hence, programming becomes more efficient, less error prone, and resulting programs are better to maintain. For each language there are special programs - called compilers or interpreters - translating this "souce code" into "object code" which can be executed by a particular CPU. Note that for one and the same language different compilers may exist, and do so in practice - we will need this fact later.

Obviously, the machine code is hard to understand for humans, as opposed to the high-level source code which is digestible at least by programming experts. Hence, source code allows to find out what a program really does - whether it does the right thing, does computations correctly and without flaws like undue rounding errors, does not contain malicious code, etc. Of course, detecting any such issue requires high effort by skilled pogrammers, so not everybody is able to benefit from the openness of the source code.

Further, even open source code runs in the particular "environment" of the computer hosting the service. As such, the program will use external libraries whose source code may or may not be open, and it has been deried from the source code through a compiler which itself may or may not be open. Hence, even when inspected by experts openness of the source code of the tool under consideration is not necessarily a guarantee for completely overseeing its effects.

In particular, for data scientists (i.e., not computer scientists) it is generally not possible to verify the validity of open source code - and be it just for the lack of time to inspect all source code involved.

Generally, speaking, both open source and proprietary software build and maintenance approaches have their individual advantages and disadvantages. In today's planetary software ecosystem we find a wide spectrum of business models, ranging from fully open source over mixed models (like dual license) to fully closed, proprietary software (such as license sales or leases) - and often we find them in combination (such as running open-source clients on closed-source MS-Windows).

Open Standards

In Information Technology, standards typically establish data formats and interfaces between software components so that software artifacts written by different, independent producers (say, different companies or different departments within a company) still can communicate and perform a given task jointly. Building software based on only interface knowledge and without knowledge about the internals of how a component establishes the behaviour described by the interface definition is a key achievement in Software Engineering; without such boxed thinking, the complexity of today's software would be absolutely intractable and unmanageable.

Like with data, a standard is called open if it is accessible to everybody without discriminating; some of those standards additionally are free of cost (such as with the Open Geospatial Consortium, OGC) while others are available against a (usually moderate) fee (such as with ISO).

Some standardization bodies offer compliance suites which allows validating implementations against the standard. One example is the extensive OGC compliance test suite.

Importantly, it is sufficient for some tool to know its interface specifications ("if I input X I will get Y"). If this specification is an open standard, and if the tool has been confirmed to pass the corresponding compliance test, then the behaiour of this tool can be trusted with respect to this standard (of course, there may be further unwanted behaviour not addressed by the compliance test - for example, such a test will typically concentrate on functionality, but not on security).

Examples are mainfold: we trust SQL query language implementation, regardless whether the database management system is open or closed source; we trust our C++ compilers, python engines, numerical libraries, operating systems, etc. - at least concerning the core question addressed here: does this code provide me with the true, valid data (read from disk or processed)? And, for that matter, we trust the underlying hardware which ultimately executes the code.

Conclusion

Concluding, open data and open source and open standards are three different concepts, each one addressing separate concerns in the first place. Open data access is desirable from many aspects, although there are valid reasons for some data to be not openly accessible. The service software in particular plays an instrumental role in guaranteeing the promise of open data. Open source as such, though, is not a guarantee (and not a required prerequisite) for open data - open standards serve a much better role in this, although with the caveat that standards do not make a statement about the complete client or server stack, but only about the particular aspect addressed by the standard. However, by using well-crafted standards (ideally coming with a solid mathematical underpinning), such as the ISO SQL relational query language or the OGC WCPS geo datacube query language, a substantial contribution towards the Holy Grail of open data can be made. The interoperability established thereby - in this context: different servers using identical data will deliver identical results - constitutes a major advantage whose benefits are by far not leveraged in full today.