and How Can it Turn You into a Geospatial Rock Star?
Imagine a world where there was only one data type for 3D, 2D and point clouds. Wouldn’t it be bliss?
I have to confess that about 30 percent of my time, when I work in GIS, is spent working between different data formats…a DWG from the architect, an RCP from the engineer, an SHP from the council, and then the client wants a realistic 3D representation of their new building. Enter ETL.
Featured image: Screenshots of all three ETL; FME (left), GDAL (top) and the ArcGIS Pro Interoperability tool (center).
What is ETL?
ETL stands for Extract Translate and Load (or also GTL for Geospatial Transform and Load). Commonly it is a tool or library which is used to work between these many data formats without too much hindrance. Sometimes, it can be integrated into the software, though this is normally done through leveraging a third-party library. Before we get stuck into the jargon, let’s delve into why we need it.
When creating software, the focus primarily is on the business need and then the deliverable. In the early stages you are more focused on getting the application out to the consumers so that they can use it and play with it. When we look at many of the big players in geospatial, most of their systems were designed over 20 years ago, when the word “geopackage” and “geojson” were not even thought of. In fact, there wasn’t a geospatial standard consortium and OGC (Open Geospatial Consortium) didn’t exist in the same capacity as in the last 10 years. Therefore, the system designers had to create data formats to ensure the system could do everything required, and do it efficiently; hence some of the formats we have now, like shapefile, KML and DWG.
ETL: The backstory
It wasn’t until the year 2000 that we had some ways of working between some of these different data formats (in a stable, reliable and supported fashion). One was FME by Safe Software and the other was GDAL. FME, which stands for Feature Manipulation Engine, is a whole software package which has a great interface that allows you to put together building blocks, like flowcharts to automate some of the work but to also translate data from one format to another. It is proprietary but a business case can easily be written for the amount of time that is saved in automating work. The other tool, GDAL (Geospatial Data Abstraction Library) is a command-line tool which is open source but converts data extremely efficiently.
Types of ETL
These two tools are so good that they have been consumed into the two main GIS, Esri and QGIS. ArcGIS Pro, by Esri, uses FME for its “Data Interoperability” geoprocessing tool. If you have an FME license or are licensed to use the tool through Esri, you can access conversion of all the geospatial formats available within FME through the Esri software. Although it is frustrating to have to pay for two software to do a job, it is extremely powerful and allows the user to work with almost any data.
QGIS, however, integrates GDAL (purists may call this GDAL/OGR) into its core, which right out of the box is all open source. Although there aren’t quite as many formats (approximately 80 percent) available compared to FME, you can use the translation in almost all parts of the software, rather than being an extra tool. This means you can import geojson formats, shapefile, geopackages and another 66 vector formats (to date), even though QGIS’s native data format is spatialite (SQLite).
Without going off track, many of the QGIS users I talk with use QGIS for simple GIS tasks and the ability to work with many of the different formats. It is quite common to find many GIS users run both QGIS and Esri’s ArcGIS on the same machine (and no, they don’t affect one another).
As stated earlier, FME and GDAL also run independently. FME is a stand-alone software which provides a complete solution for translating most geospatial and other data formats. The interface is quite technical but allows for a whole world of potential, with the capability to scrape websites, automatically download and translate data through flowchart type models. This is used by many companies to automate and translate data. I have personally worked with a few companies where it is integrated into the core of the infrastructure.
GDAL is a command-line tool and requires a little knowledge of the syntax which needs to be used. By running the tool on the command-line, there is no GUI and makes for much faster and efficient processing. It feels extremely powerful. For more advanced users, it is possible to put together batch files which can automate a few of the tasks and, with a little Python knowledge, whole workflows can be achieved from download to conversion to load. Unfortunately, there is a little bit of a learning curve and if you aren’t a command-line type of a person it can be a little intimidating, especially when you get errors which don’t make sense. I tend to find that the syntax for translation is a little back-to-front, but copy and paste is your friend…and there is a lot of useful help on the web.
Both tools have other benefits far and above the ETL/GTL, which I’ve described above. In both tools there is the ability to alter and analyse the data. You can see in the FME screenshot that there is the ability to transform the data and add 3D extrusion or convert the color, as well as hundreds of other manipulation tools. With GDAL you can convert coordinate systems, warp and dissolve, as well as many other operations.
There are, of course, other methods of translating geospatial data using Python and/or individual libraries, but for your day-to-day geospatial needs, you can’t go wrong with picking one of these methods.