For example, JPEG changes data values in ways that are not readily noticeable to
human vision because the changes are designed to exploit limitations and
characteristics of human vision. As a consequence, formats such as JPEG are most
suitable for images intended for human consumption. However, such changes may be
very significant to analytic functions the data is intended to support. As a general rule,
if the data is to support analysis, only lossless compression should be used.
3.3.3 Raster Formats
As with vector data, there are a number of formats in common use for raster data.
Simple Raster Formats
A number of simple raster formats, some dating back to the days when data was read
directly from tape drives, remain in active use today. BIL (band interleaved by line),
BIP (band interleaved by pixel), and BSQ (band sequential) are formats for multi-
band raster data, though it would be more accurate to describe these as generic data
organization techniques that can be employed by formats. For example, colour USGS
digital orthophotos were initially organized as BIP, divided into fixed-length records
with an ASCII header (the USGS has since switched to GeoTIFF).
Arc/Info ASCII GRID
and USGS DEM
are simple, open, ASCII formats for
single-band raster data. Each simply lists raster cell values in left-to-right and top-to-
bottom order, augmented with georeferencing information in the header and/or trailer
records. These formats still find use in converting and processing raster data.
From a preservation perspective, these simple raster formats pose little curation
difficulty due to their open standards, widespread support, and ease of
More Complex Raster Formats
has emerged as a common format for storing and delivering raster data owing
to its open standard (the standard is controlled by Adobe, but openly published and
not subject to license), its flexibility in describing multiple bands and data types, its
extensible framework for embedded metadata (―tags‖), and its popularity in the
desktop publishing world. TIFF itself defines the semantics of a few tags; GeoTIFF
is an open standard that defines additional tags applicable to geospatial raster data,
including complete coordinate reference information.
is a relatively new standard that supports progressive, wavelet-based
compression. It offers a wealth of other features, including lossy and lossless
compression techniques, selective and adaptive compression, etc. JPEG 2000 also
allows arbitrary XML metadata to be embedded in image files, and the OGC has
defined a standard for embedding GML documents in JPEG 2000
. By exploiting the
full capabilities of GML, this opens up the possibility of embedding in image fields
not just coordinate reference system information, but also coverage metadata,
annotations, and even vector features. More detailed information about JPEG 2000
from a preservation point of view can be found in the relevant DPC Technology
A number of proprietary formats have also emerged for handling very large geospatial
imagery datasets including ECW
from ER Mapper (now part of ERDAS) and
from LizardTech, which use wavelet compression methods to reduce file
Support for JPEG 2000 is increasing, but today GeoTIFF is arguably the most
survivable format for geospatial raster data due to its widespread use and support.
3.3.4 Mosaicked Raster Data
Raster data is often used to represent continuous phenomena (e.g., surface elevation),
but for convenience of data management and delivery it is packaged into fixed-size
tiles divided along arbitrary tile boundaries. This means that it is often desirable to
mosaic the tiles back together into a seamless whole, and to thereby allow users to
browse and crop out just the portion of the entire dataset that is of interest to them.
The ramifications of mosaicking for preservation purposes depend greatly on the
implementation specifics. If the raster tiles are stored as files in a filesystem, for
example as GeoTIFFs, each independently carrying metadata and georeferencing
information, and if the mosaicking system is entirely automated, then the preservation
problem may be no more difficult than the problem of preserving a collection of files.
In this case, the mosaic can be viewed purely as an access mechanism. Preservation of
the raster tile files alone is sufficient to recreate the mosaic in the future, but only if
the coordinates of the tiles are preserved.
However, sophisticated mosaicking systems often perform edge alignment and colour
balancing across tile boundaries, and even allow for fully manual and/or manually
directed adjustments. In this case, the mosaicked image may effectively become a
new data product derived from source raster tiles, and as such it may merit
preservation independent of the source tiles.
3.3.5 Stereo, Oblique and Ground-Level Imagery
Our discussion of raster data so far has focused on data that can serve as a
representation of the Earth‘s surface, and hence is suitable for projection and layering.
Stereo and oblique imagery are types of imagery that are captured at varying angles to
the vertical, and are used to create stereo pair images and 3D models. Such imagery
requires additional metadata to describe the 3D spatial orientation of the images.
Ground-level photographs are not suitable for projection, but they can be point
georeferenced. The EXIF
metadata standard defines a means of capturing coordinate
reference system information in JPEG files, and is compatible with GPS systems that
are often the source of such metadata.
3.3.6 Raster Data Size
Because raster data is continuous over an area it can require several orders of
magnitude more storage than the equivalent area represented through a vector data
source. There is little that can be done about this. While it is possible to convert data
from raster to vector representation (and vice versa), doing so is a highly analytic,
lossy process that changes the essential character and functionality of the data. Thus,
raster data generally must be preserved as raster data.
Compounding the problem of size is that automatic capture methods such as digital
aerial photography and satellite remote sensing make it possible to quickly amass
volumes of raster data that are large by any measure: MODIS
, for example, acquires
a terabyte of imagery per day. Raster data access mechanisms may impose additional
storage requirements. Image tile pyramids that support efficient panning and zooming
of large images add at least 30% to the data size.
As a consequence, in comparing raster data to vector, preservation of raster data is a
quantitatively larger problem to such a degree that it is a qualitatively different
problem. Large raster datasets will generally require custom engineered storage and
processing systems. If raster data is stored in a spatial database the preservation
problems due to size may compound the inherent migration and snapshot problems of
preserving spatial databases.
3.4 Emerging Data Formats
Additional geospatial data formats are used for data representation, data visualization,
and as network payloads occurring within web-based transfers of information. A
number of new formats such as KML, which is used for geographic visualization,
annotation, and navigation, and GeoRSS
, which is used for geographically enabling
RSS and Atom feeds, have emerged. These have especially found use in
applications. These formats might not be used in the creation or
management of geospatial information; rather data files occurring in these formats are
often created by transforming existing geospatial data. Data in some of these formats
might not be obvious targets for archival acquisition since the original data will tend
to be more complete. Yet the manner in which such data is represented in
visualization environments may be of importance in recording how information has
been shown and to record the basis for decision-making.
, formerly known as Keyhole Markup Language, is an XML language focused
on geographic visualization, including annotation of maps or images in digital globe
or mapping environments. KML was initially used solely within Google Earth
now used in a range of software environments, and in April 2008 KML version 2.2
was approved as an international implementation standard by the OGC. KML
In, Introduction to Neogeography, by Andrew J. Turner, O‘Reilly 2006, Neogeography is described
as ―‗new geography‘ and consists of a set of techniques and tools that fall outside the realm of
provides support for both feature data, in the form of points, lines, and polygons, and
image data, in the form of ground and photo overlays.
KML files may be associated with images, models, or textures that exist in separate
files. KMZ files are archive files which allow one or more KML files to be bundled
together along with other ancillary files required for the presentation, allowing for
ease of transfer of the entire collection. KMZ files are also compressed in the ZIP
archive format, resulting in reduced file size. KML files may refer to external
resources and other KML files via ―network links‖ (a link to a local or remote
resource), which are used to link related data files and to facilitate data updates. Large
data resources such as imagery datasets may be divided into a large number of smaller
image files which are then made available via network links on an as needed basis.
KML presentations using network links pose a preservation challenge in that any data
available via the links may no longer be available in the future.
3.4.2 PDF and GeoPDF
is commonly used to provide end-user representations of data in which
multiple datasets may be combined and other value-added elements may be added
such as annotations, symbolization and classification of the data according to data
attributes. While these finished data views, typically maps, can be captured in a
simple image format, PDF provides some opportunity to add additional features such
as attribute value lookup and toggling of individual data layers.
, which specifies a method for geopositioning of map frames within a PDF
document, originated as a proprietary format developed by TerraGo Technologies
strategic partner of Adobe. GeoPDF has proven to be a powerful format for
presentation of complex geospatial content to diverse audiences that are not familiar
with geospatial technologies. In September 2008 TerraGo Technologies approached
the OGC with a proposal to introduce the GeoPDF encoding specification to the OGC
standards process to make it an open standard and it is now published as a ―Best
. In parallel, Adobe introduced its own method for geo-
registration into the ISO standards process for PDF.
The preservation challenges
that accrue to complex PDF documents will accrue to
these documents as well. While the PDF/A specification has been developed to define
an archive-friendly version of PDF, some of the more advanced functionality that is
put to use in geospatial implementations are not supported by the current PDF/A
specification. The history of complex geospatial PDF documents is rather short and
risks associated with external dependencies (e.g., fonts) and reliance on specialized
software will require close attention by the preservation community.
3.5 Spatial Databases
Spatial databases reach a higher level of complexity than individual data files, as they
are capable of storing multiple datasets along with dataset relationships, behaviours,
See the DPC Technology Watch Report at http://www.dpconline.org/docs/reports/dpctw08-02.pdf
for an analysis of PDF for preservation
annotations, and data models, all of which are hosted in a relational database system.
Spatial databases have played an increasingly prominent role in data production and
management, while dataset-oriented formats are often still used for data distribution.
A variety of commercial database management systems, some using spatial
extensions, have the ability to store geospatial data including: Oracle Spatial
Informix Spatial DataBlade
and Microsoft SQLServer
. A prominent open source
option is the PostgreSQL-based PostGIS
spatial database. These spatial extensions
generally allow the user to store raster and vector data by adding spatial data types to
the database that supports storing and querying of spatial data. Access to the spatial
data in these databases can be directly through the database or, more commonly,
through a connection to a desktop or web-based client.
Spatial databases have a number of features in common, including support for:
- Continuous (large geographic extent) datasets
- Large volumes of data (raster and vector)
- Complex data models (spatial data and business models)
- Long transactions, multi-user editing and versioning
These features make the long term preservation of data in spatial databases much
more complex as it is often not possible to extract and transfer individual components
of this data into other systems without losing some information. Preserving
geospatial databases in general is likely to be particularly challenging as all the
problems of preserving relational databases
are inherited: the need to take snapshots
of running databases; storage of snapshots in proprietary database dump formats;
complex dump formats; and large, monolithic sizes of snapshots.
3.5.1 ESRI Geodatabases
A prominent spatial database format is the ESRI Geodatabase
. The ESRI
Geodatabase (often just referred to as Geodatabase) came into use in the late 1990s
with the advent of the ArcGIS software environment. The Geodatabase can store a
range of data types including geographic features, attribute information, satellite and
aerial imagery, surface modelling data, and survey measurements. In addition to
storing data, Geodatabases can also model the relationships between data and handle
data validation and versioning.
Until recently, there were two forms of the Geodatabase:
ArcSDE Geodatabases and
Personal Geodatabases. ArcSDE Geodatabases store the data in a relational database
management system (RDBMS) and support multiple users; Personal Geodatabases are
stored in Microsoft Access and cannot be larger than two gigabytes in size. The
requirement of a commercial relational database connection has made transfers of
ESRI Geodatabases greater than two gigabytes of size difficult.
Database preservation as such is outside the scope of this report. However there is much research
going on in this area, see http://www.dcc.ac.uk/resource/briefing-papers/database-archiving/
for a brief
summary of the topic.
In ArcGIS version 9.2 the File Geodatabase was created as a standalone database not
requiring a commercial back-end database. All information is stored in a directory of
files that can scale up to one terabyte of size, potentially increasing portability and
making the format more useful in archival transfers. However, as yet the format
specifications of the File Geodatabase have not been made publicly available and
there are issues over compatibility between versions
making its immediate appeal
for preservation problematic.
There are a number of approaches to exporting content from the ESRI Geodatabase.
Feature classes (vector layers) may be extracted as Shapefiles or converted to other
formats such as GML for distribution or archiving. Raster datasets may also be
extracted from a Geodatabase in a number of formats, including ERDAS Imagine,
JPEG and TIFF. Starting with ArcGIS version 9 a new, openly specified XML export
became available for the Geodatabase, making it possible to interchange
Geodatabase content with other technical environments, yet it is not clear what
support there will be in future versions of ArcGIS for re-importing XML exports
created from previous versions of the Geodatabase.
3.6 Dynamic Geospatial Data
Geospatial web services allow end-user applications as well as server applications to
make requests for sets of data over the web. Requests might also be made for
particular data processes, such as finding a route or locating a street address.
In web service client applications, data is drawn from one or possibly many different
sources and presented in map form to the user. These mapping environments take the
burden of data acquisition and processing away from the user. While it is typically
possible for the user to save service state (e.g., map area or view, zoom level, what
data is shown etc.), it is usually not possible to save the state of the data within the
service, creating a preservation challenge with regard to capturing such interactions.
3.6.1 Web Map Services (WMS)
The OGC WMS specification was released in 2000 and by virtue of its simplicity
gained wide adoption and vendor support. WMS is a lightweight web service at the
core of which is the ―Get Map‖ request, which allows the client application to request
an image representation of a specific data layer. Requests can be made from
individual clients such as desktop GIS software, web browsers, as well as other map
servers which might blend data sources from a number of different servers. The Web
Map Context specification was developed by the OGC to formalize how a specific
grouping of one or more maps from one or more map servers can be described in a
portable, platform-independent format. The Styled Layer Descriptor profile of the
Web Map Service (SLD) provides a means of specifying the styling of features
delivered by a WMS using the Symbology Encoding (SE) language. If preservation
of the cartographic representation of a map delivered by a WMS is important then it
may be necessary to preserve the associated SLD (if there is one). WMS tiling efforts
have come as a response to the experience of Google Maps and other commercial map
services, which demonstrated the speed with which static tiled imagery could be
presented in user applications. Efforts have been made to develop a standard
approach to provide access to static map tiles and the OGC have produced a candidate
Web Map Tiling Service (WMTS) Interface Standard
3.6.2 Web Feature Services (WFS)
Web Feature Services, which handle vector data, stream the actual data in the form of
GML. WFS which was first released as a standard in 2002 has not been implemented
on as wide of a scale as WMS, partly due to a higher level of complexity. WFS could
potentially be used in the future to automate data harvests, perhaps using
Transactional Web Feature Service (WFS-T) for making updates to a central archive.
3.6.3 Other OGC Web Services
Many other web services specifications have been released by the OGC, including the
Web Coverage Service (WCS), which addresses content such as satellite images,
digital aerial photos, digital elevation data, and other phenomena represented by
values at each measurement point. OGC members are also specifying a variety of
interoperability interfaces and metadata encodings that enable real time integration of
sensor webs into the information infrastructure. In general OGC services will pose
data persistence challenges related to schema evolution, URI/URN persistence and
Due to the ephemeral nature of the data in web services, new challenges in
maintaining data persistence are also created. It might also be argued that the
availability of web services-based access to data has decreased the incentive to
replicate data resources to additional locations that might otherwise retain copies of
the data. Details of all the OGC specifications can be found on the Open Geospatial
Consortium (OGC) website.
3.7 Legal Issues
The legal framework in which geospatial data is made available can cause a
considerable amount of uncertainty, and this may have an impact on the ability to
preserve and make use of geospatial data in the future. Intellectual property rights in
geospatial data are carefully – sometimes aggressively – protected. Most geospatial
data originates with an underlying dataset licensed from a third party – either from a
mapping agency or through a satellite imagery supplier. This means that many
geospatial datasets have an implied dependence on a third party supplier who may
take a view on preservation and access. Consequently, archivists and repository
managers would be well advised to examine the licences under which data is
presented to them. There have been various studies and books
written about GIS
legal issues including a report produced by the JISC funded GRADE
considered the licensing issues for sharing and re-using geospatial data within the UK
research and education sector.
For example, George Cho, Geographic Information Science: Mastering the Legal Issues
Documents you may be interested
Documents you may be interested