The case
In this post we're going to take a look at spatial ETL. A lot of BI work traditionally exists of getting our data from A to B and transforming it in the process. So what if our source data contains spatial information?
For example, we have two tables in our source database. One is containing 9 maintenance areas that are assigned to a particular supervisor. The other is containing over 3200 incidents that have been reported in the area. Both tables are stored in an Oracle 11g database, contain a spatial column and can be visualised with GIS software to show where they're situated.
|
All incidents |
|
9 maintenance areas, colored by unique area |
Now imagine we want to determine the designated supervisor for each incident to report progress of incident management on a dashboard. No primary-foreign key construction exists between the two tables. There is no other way to link these two objects together than by use of the spatial relationship.
Common ETL tools usually don't know how to treat spatial data columns. But with the right combination of functionality, you can certainly do spatial ETL. In this example, I will be using
Pentaho Data Integration 6.0 (PDI) and the additional
AtolCD Pentaho GIS plugin. The plugin adds two specific improvements to regular Pentaho:
- The ability to treat spatial columns as column type 'geometry' within the ETL.
- A set of transformation steps that enhance Pentaho with geospatial operations.
The solution path
I've created a simple Pentaho-transformation to combine the incidents and maintenance area. As you will notice the transformation looks like any other, with the exception of a couple of geospatial steps that will be explained further on. All steps in the transformation shown below are referred to in the text by use of double quotes and starting with a capital (e.g. "Table incidents").
|
PDI transformation with added geospatial steps. |
We see that both tables are loaded with two standard table input steps, "Table incidents" and "Table maintenance areas", resulting in two data streams. When we examine the output of these steps, we see the following data.
|
Incidents table preview |
The incident table contains about 3200 incidents that each have an unique ID and an incident location. The maintenance area table contains 9 maintenance areas, that all have a name, a supervisor name and an area location.
|
Maintenance areas table preview |
Notice how the columns INCIDENT_LOCATION and MAINTENANCE_AREA_LOCATION contain some kind of coordinate definition? We can clearly see one contains points and the other contains polygons. The added geospatial functionality enables PDI to show the contents of the geospatial columns in readable text. We can see the field types when examining the output fields for the two table input steps; the location fields are marked as type 'Geometry'.
|
Output fields for the incidents stream |
|
Output fields for the maintenance area stream |
The text "SRID=28992" tells us that the spatial data is in a Dutch coordinate system. It is important that the data from both geometry columns is in the same coordinate system. Otherwise it isn't possible to relate the two columns. We verify the coordinate system with a "Geometry information" step for each datastream. This step is part of the additional steps added by the GIS plugin.
The next step is to join both data streams based on a "Cartesian join". This step doesn't yet recognize the spatial relationship. It will simply join each record from one stream with each record from the other. For this example, this means that because we started with 3000 incidents and 9 maintenance area, the resulting data stream will contain about 27000 records.
|
Cartesian join result |
|
|
|
|
|
In the join result we can clearly see that incident ID no 1012 is joined to all 9 area names. But an incident can't be in all 9 maintenance areas at the same time. We wanted to identify the maintenance area for each incident, remember?
As a result, the datastream now contains two geometry columns. We need to filter the datastream based on the spatial relationship. For this, we use the step called "Determine spatial relationship", which is also part of the GIS plugin.
|
Define the spatial relationship between incidents and maintenance area. |
In this step we can define the spatial relationship between the two geometry columns. The step supports a lot of different spatial relationships that can be described to relate two objects by use of location. The one we're looking for is simply called 'Within', as we want to identify each incident within each maintenance area. The spatial analysis settings shown above can be summarized as a select statement that selects the records where 'INCIDENT_LOCATION is within MAINTENANCE_AREA_LOCATION' evaluates to TRUE. When this step is run, only the records where this statement is true are kept in the resulting datastream.
|
Incidents within their maintenance area. |
As the image above shows, the resulting datastream contains one record per incident ID with the corresponding maintenance area information. However, the datastream still contains two geometry columns. This is no problem for the database table we're about to create, but most common GIS software only supports one geometry type per data layer. So we make a selection of the output fields with the "Select values" step, assuring that the resulting table only contains the incidents point geometry attribute, combined with ID, area name and supervisor name.
|
Select the necessary fields |
After this, the final step is to write the resulting data to a new database table with the "Table output" step.
(Another approach could be to add the maintenance area and supervisor
name to the existing incidents table; in that scenario we might consider
using an "Insert/update" step.) Of course, the receiving database must support geometry type database fields.
|
Output the table to the database |
Now, when we visualize the resulting dataset with GIS software, we can see that each incident is nicely matched with its corresponding maintenance area by color.
|
Incidents colored by maintenance area |
Conclusion
Spatial relationships allow us to define relationships between objects that have no other relation to each other than the locational one. Extending standard ETL tools with geospatial capabilities is a very cool way to make use of the specific advantage of geospatial data.
Granted, there are some caveats in the use of spatial properties in ETL tooling.
- The AtolCD Pentaho GIS plugin
used for PDI supports Oracle Spatial and PostgreSQL/PostGIS geometry
(and also MySQL, but this I haven't tested), but lacks support for MS
SQL Server geometry.
- Geospatial data can get quite large in
size. Substantially larger datasets
like cadastral parcel information (400.000+ records) that are joined by a
cartesian join will result in billions of records in the
datastream. This might hamper performance but I haven't tested yet.
- I have yet to try this in other ETL tools like SSIS or Talend.
Still, I'm confident these issues will eventually get resolved.