dinsdag 5 april 2016

Geography in 3NF normalized source systems

We're going to look for geography in source systems in this post. When designing a BI solution, most data will be stored in one or more source systems that follow a 3NF relational model stored in a database. It is fairly uncommon for such source systems to have true geographical information in geography type database fields, but that doesn't mean these sources don't contain geographical information.

Let's take for example the Sakila database model for MySQL (the original data model by Matt Casters, Jos van Dongen & Roland Bouman can be found here). This very well normalized sample database shows us three entities that are likely to have a geographical component. The addresses, that are represented with point geometries, and the city and country that are polygon geometries with their respective city limits and country borders. Address, city and country geometries are easily obtained through a geocoding service (see my previous post about geometry types).

Geometry attributes added to address, city and country. Geographical relationships in red.
The relations between address, city and country in this data model are stored with primary-foreign key relationships. However, adding geometry to these entities allows us to perform additional analysis on the data model. In the Sakila data model we can also see that three entities, namely store, customer and staff, are related to the address entity. By adding geometry to the address we can analyse which store the customers live closest to, and compare this to the store in which they rent the most movies at.


Geographical relation between store and customer, based on address geography.
We might learn that some customers rather go to a store further away from their home and try to explain this with other data like sales data. Additionaly, plotting these customer addresses and their relation to the stores frequently visited on a map can give us a clear indication about 'blind spots' in our store coverage.

So at this point we really only have to decide how to incorporate the geographical information in our design. If we're building a datawarehouse solution, we might consider storing the geographical attribute of the addresses, cities and countries in our datawarehouse on a regular basis. 

In reality, source systems are hardly ever properly normalized like the example above. Very often we find address information is scattered over several tables in the data model. For instance, the customer, store and staff tables might all contain it's own address attributes. For geographical analysis, this doesn't make much difference. We can still perform analysis like described above.

The geographical address entity is very easy to spot in source systems, however there are more entities that can identified as having a geographical attribute. Not so much in financial systems, but latent geography can be found in source systems at organisations that have their operations taking place in the public space.

The 'geodatabase'

Organisations that handle a lot of geographical data often have a separate GIS department. This GIS department will most likely have their information stored in a dedicated geographically enabled database they call the 'geodatabase'. But this geodatabase has little of the characteristics you can expect of a BI database. GIS systems usually don't require any 3NF datamodel to be used in a database environment. Instead, GIS systems are equipped with functionality to add and remove tables with geodata to any database that supports geography.

Let's take a look at the image below. It shows the datamodel of a particular schema from our geodatabase. Notice the lack of any relationships between the tables.
One of the schemas in our 'geodatawarehouse'. No key-based relationships between tables.

In traditional 3NF systems you will find relationships between tables based on keys (PK/FK, many-to-many etc). In a geodatabase, every table is related through the spatial reference of the geographical attribute, as explained in this previous post. In the above datamodel, each table has an attribute named GEOMETRY (with type OTHER because SQL PowerArchitect doesn't support geometry types) which relates it to every other table in the database. Seen from one of the tables, these relationships would look something like this:
Table MINSET_STUW is related to all other tables (ánd selfreferencing) through the geometry attribute.
However, if we would draw relationships for all tables it might look something like this:
All tables related to all other tables
You can imagine that in a geodatabase with between 200 and 1000 separate tables this would generate an unreadable datamodel.

In essence, any geodatabase is mainly a storage platform for every table related to geography. From a BI standpoint it should be treated as a source system, albeit a strange one. As I mentioned earlier, a very specific property of the geodatabase is that GIS folk tend to add and remove tables to/from the geodatabase at any given time. Because no table is explicitly related to anything, this poses no problems from a GIS standpoint. But it will certainly hurt your BI deliverables if your carefully crafted report, that is depending on geographical relations, suddenly isn't working anymore because a table has been removed from the source system.

The hybrid database

So far we've discussed source systems that are either purely administrative or purely geographical in nature. There is a third flavour source systems that are made of a normalized 3NF data model ánd contain geographical attributes in the database tables. These hybrid databases are extremely rare. If you encounter one you can safely use it for your BI solution as it will offer you the best of both worlds:
  • a tightly managed data model
  • integrated geographical attributes
Feel free to discuss, and if you have any questions please let me know.

Geen opmerkingen:

Een reactie posten