edwinvanvugt: april 2016

The case

In this post we're going to take a look at spatial ETL. A lot of BI work traditionally exists of getting our data from A to B and transforming it in the process. So what if our source data contains spatial information?

For example, we have two tables in our source database. One is containing 9 maintenance areas that are assigned to a particular supervisor. The other is containing over 3200 incidents that have been reported in the area. Both tables are stored in an Oracle 11g database, contain a spatial column and can be visualised with GIS software to show where they're situated.

All incidents

9 maintenance areas, colored by unique area

Now imagine we want to determine the designated supervisor for each incident to report progress of incident management on a dashboard. No primary-foreign key construction exists between the two tables. There is no other way to link these two objects together than by use of the spatial relationship.

Common ETL tools usually don't know how to treat spatial data columns. But with the right combination of functionality, you can certainly do spatial ETL. In this example, I will be using Pentaho Data Integration 6.0 (PDI) and the additional AtolCD Pentaho GIS plugin. The plugin adds two specific improvements to regular Pentaho:

The ability to treat spatial columns as column type 'geometry' within the ETL.
A set of transformation steps that enhance Pentaho with geospatial operations.

The solution path

I've created a simple Pentaho-transformation to combine the incidents and maintenance area. As you will notice the transformation looks like any other, with the exception of a couple of geospatial steps that will be explained further on. All steps in the transformation shown below are referred to in the text by use of double quotes and starting with a capital (e.g. "Table incidents").

PDI transformation with added geospatial steps.

We see that both tables are loaded with two standard table input steps, "Table incidents" and "Table maintenance areas", resulting in two data streams. When we examine the output of these steps, we see the following data.

Incidents table preview

The incident table contains about 3200 incidents that each have an unique ID and an incident location. The maintenance area table contains 9 maintenance areas, that all have a name, a supervisor name and an area location.

Maintenance areas table preview

Notice how the columns INCIDENT_LOCATION and MAINTENANCE_AREA_LOCATION contain some kind of coordinate definition? We can clearly see one contains points and the other contains polygons. The added geospatial functionality enables PDI to show the contents of the geospatial columns in readable text. We can see the field types when examining the output fields for the two table input steps; the location fields are marked as type 'Geometry'.

Output fields for the incidents stream

Output fields for the maintenance area stream

The text "SRID=28992" tells us that the spatial data is in a Dutch coordinate system. It is important that the data from both geometry columns is in the same coordinate system. Otherwise it isn't possible to relate the two columns. We verify the coordinate system with a "Geometry information" step for each datastream. This step is part of the additional steps added by the GIS plugin.

The next step is to join both data streams based on a "Cartesian join". This step doesn't yet recognize the spatial relationship. It will simply join each record from one stream with each record from the other. For this example, this means that because we started with 3000 incidents and 9 maintenance area, the resulting data stream will contain about 27000 records.


Cartesian join result

In the join result we can clearly see that incident ID no 1012 is joined to all 9 area names. But an incident can't be in all 9 maintenance areas at the same time. We wanted to identify the maintenance area for each incident, remember?

As a result, the datastream now contains two geometry columns. We need to filter the datastream based on the spatial relationship. For this, we use the step called "Determine spatial relationship", which is also part of the GIS plugin.

Define the spatial relationship between incidents and maintenance area.

In this step we can define the spatial relationship between the two geometry columns. The step supports a lot of different spatial relationships that can be described to relate two objects by use of location. The one we're looking for is simply called 'Within', as we want to identify each incident within each maintenance area. The spatial analysis settings shown above can be summarized as a select statement that selects the records where 'INCIDENT_LOCATION is within MAINTENANCE_AREA_LOCATION' evaluates to TRUE. When this step is run, only the records where this statement is true are kept in the resulting datastream.

Incidents within their maintenance area.

As the image above shows, the resulting datastream contains one record per incident ID with the corresponding maintenance area information. However, the datastream still contains two geometry columns. This is no problem for the database table we're about to create, but most common GIS software only supports one geometry type per data layer. So we make a selection of the output fields with the "Select values" step, assuring that the resulting table only contains the incidents point geometry attribute, combined with ID, area name and supervisor name.

Select the necessary fields

After this, the final step is to write the resulting data to a new database table with the "Table output" step. (Another approach could be to add the maintenance area and supervisor name to the existing incidents table; in that scenario we might consider using an "Insert/update" step.) Of course, the receiving database must support geometry type database fields.

Output the table to the database

Now, when we visualize the resulting dataset with GIS software, we can see that each incident is nicely matched with its corresponding maintenance area by color.

Incidents colored by maintenance area

Conclusion

Spatial relationships allow us to define relationships between objects that have no other relation to each other than the locational one. Extending standard ETL tools with geospatial capabilities is a very cool way to make use of the specific advantage of geospatial data.

Granted, there are some caveats in the use of spatial properties in ETL tooling.

The AtolCD Pentaho GIS plugin used for PDI supports Oracle Spatial and PostgreSQL/PostGIS geometry (and also MySQL, but this I haven't tested), but lacks support for MS SQL Server geometry.
Geospatial data can get quite large in size. Substantially larger datasets like cadastral parcel information (400.000+ records) that are joined by a cartesian join will result in billions of records in the datastream. This might hamper performance but I haven't tested yet.
I have yet to try this in other ETL tools like SSIS or Talend.

Still, I'm confident these issues will eventually get resolved.

We're going to look for geography in source systems in this post. When designing a BI solution, most data will be stored in one or more source systems that follow a 3NF relational model stored in a database. It is fairly uncommon for such source systems to have true geographical information in geography type database fields, but that doesn't mean these sources don't contain geographical information.

Let's take for example the Sakila database model for MySQL (the original data model by Matt Casters, Jos van Dongen & Roland Bouman can be found here). This very well normalized sample database shows us three entities that are likely to have a geographical component. The addresses, that are represented with point geometries, and the city and country that are polygon geometries with their respective city limits and country borders. Address, city and country geometries are easily obtained through a geocoding service (see my previous post about geometry types).

Geometry attributes added to address, city and country. Geographical relationships in red.

The relations between address, city and country in this data model are stored with primary-foreign key relationships. However, adding geometry to these entities allows us to perform additional analysis on the data model. In the Sakila data model we can also see that three entities, namely store, customer and staff, are related to the address entity. By adding geometry to the address we can analyse which store the customers live closest to, and compare this to the store in which they rent the most movies at.

Geographical relation between store and customer, based on address geography.

We might learn that some customers rather go to a store further away from their home and try to explain this with other data like sales data. Additionaly, plotting these customer addresses and their relation to the stores frequently visited on a map can give us a clear indication about 'blind spots' in our store coverage.

So at this point we really only have to decide how to incorporate the geographical information in our design. If we're building a datawarehouse solution, we might consider storing the geographical attribute of the addresses, cities and countries in our datawarehouse on a regular basis.

In reality, source systems are hardly ever properly normalized like the example above. Very often we find address information is scattered over several tables in the data model. For instance, the customer, store and staff tables might all contain it's own address attributes. For geographical analysis, this doesn't make much difference. We can still perform analysis like described above.

The geographical address entity is very easy to spot in source systems, however there are more entities that can identified as having a geographical attribute. Not so much in financial systems, but latent geography can be found in source systems at organisations that have their operations taking place in the public space.

The 'geodatabase'

Organisations that handle a lot of geographical data often have a separate GIS department. This GIS department will most likely have their information stored in a dedicated geographically enabled database they call the 'geodatabase'. But this geodatabase has little of the characteristics you can expect of a BI database. GIS systems usually don't require any 3NF datamodel to be used in a database environment. Instead, GIS systems are equipped with functionality to add and remove tables with geodata to any database that supports geography.

Let's take a look at the image below. It shows the datamodel of a particular schema from our geodatabase. Notice the lack of any relationships between the tables.

One of the schemas in our 'geodatawarehouse'. No key-based relationships between tables.

In traditional 3NF systems you will find relationships between tables based on keys (PK/FK, many-to-many etc). In a geodatabase, every table is related through the spatial reference of the geographical attribute, as explained in this previous post. In the above datamodel, each table has an attribute named GEOMETRY (with type OTHER because SQL PowerArchitect doesn't support geometry types) which relates it to every other table in the database. Seen from one of the tables, these relationships would look something like this:

Table MINSET_STUW is related to all other tables (ánd selfreferencing) through the geometry attribute.

However, if we would draw relationships for all tables it might look something like this:

All tables related to all other tables

You can imagine that in a geodatabase with between 200 and 1000 separate tables this would generate an unreadable datamodel.

In essence, any geodatabase is mainly a storage platform for every table related to geography. From a BI standpoint it should be treated as a source system, albeit a strange one. As I mentioned earlier, a very specific property of the geodatabase is that GIS folk tend to add and remove tables to/from the geodatabase at any given time. Because no table is explicitly related to anything, this poses no problems from a GIS standpoint. But it will certainly hurt your BI deliverables if your carefully crafted report, that is depending on geographical relations, suddenly isn't working anymore because a table has been removed from the source system.

The hybrid database

So far we've discussed source systems that are either purely administrative or purely geographical in nature. There is a third flavour source systems that are made of a normalized 3NF data model ánd contain geographical attributes in the database tables. These hybrid databases are extremely rare. If you encounter one you can safely use it for your BI solution as it will offer you the best of both worlds:

a tightly managed data model
integrated geographical attributes

Feel free to discuss, and if you have any questions please let me know.

edwinvanvugt

maandag 18 april 2016

Using spatial relationships in ETL

The case

The solution path

Conclusion

dinsdag 5 april 2016

Geography in 3NF normalized source systems

The 'geodatabase'

The hybrid database