By Norman Barker
Geospatial data serves as the foundation for many mission-critical and time-sensitive applications, including Earth observation, location-based services, defense, population health, and more. However, geospatial data is not just one thing and can be quite heterogeneous, coming in many different forms including point clouds (e.g., lidar and sonar); polygons (e.g. buildings and areas of interest), and rasters.
Like other forms of data, geospatial data becomes inherently richer the more there is and the more it’s analyzed in aggregate. But the cost and complexity of managing all these different data types can be quite steep.
Consider collaborative projects that amalgamate data from many different partners around the world. This demands a new kind of highly efficient data storage mechanism to overcome several challenges, including the fact that different parties involved may be modifying the data and sharing it with others.
This constant sharing and duplication can make it difficult to keep track of the original source and create data governance risks. In addition, underlying data formats may not be well supported and existing APIs often don’t work well with object stores, which could render the data defunct and unable to be included in larger initiatives.
There are additional technical challenges involved with storing and managing massive amounts of geospatial data, including:
Initiating and managing clusters at scale to enable large-scale data processing in parallel. It can be very tricky to allocate resources efficiently in a cluster. If you under-provision, jobs are likely to run much too slow or even crash. However, if you over-provision, you risk wasting resources.
Spatial relational databases allow efficient spatial data storage and retrieval through optimized indexing mechanisms. It enables spatial indexing, supporting faster query processing by organizing data based on geometric properties. However, a big challenge of spatial relational databases is installing and configuring them properly, as well as ongoing maintenance, and many organizations do not have the resources to accomplish this.
Bringing together disparate geospatial pipelines and making sure the data is complete and accurate. Geospatial data taken straight from the wild usually contains a lot of errors and may not be in its best shape. This is why roughly 80 percent to 90 percent of data scientists’ time is spent cleaning their data and preparing it for merging. You cannot expect your analysis to be accurate unless your data is in pristine condition. It’s a case of “garbage in, garbage out.”
Supporting diverse data modalities, which helps today’s data teams avoid having to rely on different, bespoke databases to manage different types of data. It can take an inordinately long time–sometimes months or more–to knit different systems together in a way that yields meaningful insights. Support for diverse data types is crucial for overlays or the process of superimposing layers of geographic data covering the same area, in order to study and understand the relationships between them.
An example is land-suitability studies, where certain parcels of land are assigned ratings based on vegetation type, soil type, and proximity to flood plains. These “layers” can be superimposed to create a new layer that combines all of this information into a comprehensive rating showing the most suitable areas for development based on an overall understanding of land characteristics.
A New Approach to Managing Geospatial Data
Given the mission-critical and time-sensitive nature of the applications being supported, there needs to be an easier way to handle all this geospatial data–a single, unified solution that manages the geospatial data objects along with the raw original data (e.g., images, text files, etc.), the machine learning embedding models, and all other data modalities in an application (tables, rasters, point clouds, etc.).
Moreover, this unified database does not need to just be a geospatial database. It can be one space to store, manage, and analyze all geospatial, tabular, ML data, and files.
However, when it comes to geospatial data specifically, providing an efficient mechanism for handling the querying and storing of geometries at a grand scale (hundreds of billions of points and polygons) is particularly important. “Geometries” refer to geospatial data representing geographic data objects in the world, and supporting them enables users to achieve simpler workflows for processing their geospatial data.
In addition to supporting all data modalities, a unified solution also includes code and compute, so processing happens right next to the data, reducing egress and downloads. This is especially critical for geometries and point clouds, which often number in the billions and require fast query response times.
Advances in serverless make it possible to ingest huge geospatial datasets in parallel, taking just a few minutes and at a reduced cost. Coupling serverless compute and code with the actual data increases performance while reducing cost and time to get insights.
The future of geospatial data requires a new approach to data management in the form of a multimodal cloud native database. The benefits of such an approach include efficient and high performance, easier and better data quality, and an improved ability to superimpose with simplicity. This will be the key to tapping into various modalities, along with geometry support, to pose completely new questions and achieve so-called “full picture” views never before possible.
Norman Barker is vice president of geospatial at TileDB.