donderdag 22 december 2016

Open data is the new solar energy

Data is the new oil. This phrase has been with us for several years now. Just googling the term gives you about 197 million hits. Data is fueling the our Digital Economy making data a valuable asset for anyone able to leverage its potential. Right now, organizations everywhere are improving their data infrastructure trying to gain an advantage over its competitors.

But viewing data as the new oil has a downside. These inititatives are mainly aimed inwards, keeping data within the confines of the company. Like oil, data benefits the large players owning the data the most. It requires vast investments in acquiring, transporting, processing and refining data to truly benefit an organization. An organization that isn't able to produce its own data might buy it from a data vendor, but that makes it dependent on that vendor.

Open data, on the other hand, provides a level playing field and can advance society as a whole. Open data, like sunlight, is democratic and can be captured by anyone, companies and citizens alike. It can provide in big and small data needs. You don't need big investments in software to profit from available open data sources. Open data spurs new initiatives that can become self-sustaining.

If data is the new oil, then open data is the new solar energy. And we need to move to sustainable data sources quickly.

Source: United Nations Photo,  https://www.flickr.com/photos/un_photo/5105193914/

dinsdag 20 december 2016

Spatial business keys

In the previous examples we've considered the location of a business object as just another one of it's attributes. Most geographical data sources are indeed fitted with at proper business key. Business datasets like cadastral parcel registrations, addresses and assets such as water courses and culverts contain a unique attribute that is used in natural business language.

Sometimes a geographical dataset doesn't seem to contain a clear natural key. A technical key is surely present in the dataset (bc most GIS oriented software requires one) but it is mostly meaningless for the business. However, in business discussions, these objects are referenced by their location. People are literally pointing on a map to identify the object.

This makes me wonder: can location bé the natural key of an object? And therefore a candidate business key for an hub entity?

To answer this question, we need to take a look at the definition of the natural business key. There are several different definitions for the business key, but they all share the same characteristics. Business keys:
  • are unique for the object (at least within the dataset, preferably enterprise-wide, ideally worldwide)
  • exist in the real world
  • have meaning for the business users
With these criteria in mind, let's take a look at some examples.

Example #1: local zoning ordinance plans


Dutch municipalities regulate building activities by issuing zoning ordinance plans. All zoning ordinance plans are published in a nationwide interactive webservice on the Dutch national open webservices portal www.pdok.nl. The webservice used is https://www.pdok.nl/nl/service/wms-ruimtelijke-plannen.

Each zone in a plan defines which activities are allowed within the zone, like opening a shop or building a new home. A zoning ordinance plan consists of a map and a set of zoning regulations. Each polygon on the map refers to any number of these regulations. An extract of the map may look like this.

Example of local zoning ordinance zones

In the picture we can see various zones with the same legend color. Two of the yellow zones with designation 'urban housing' (in Dutch: 'wonen') have been labeled for the example. If we look at the attributes for these two zones, we see the following:

Results for FeatureType 'http://plu.geonovum.nl:Enkelbestemming':
--------------------------------------------
bestemmingshoofdgroep = wonen
historisch = false
hoofdfuncties = null
identificatie = NL.IMRO.0846.EP30158152437-00
naam = Wonen
ondergeschiktefuncties = null
plangebied = NL.IMRO.0846.BP2008BOS01Boskant-oh02
planstatus = onherroepelijk
typeplan = bestemmingsplan
versieimro = IMRO2008
verwijzingnaarobjectgerichteteksturl = null
verwijzingnaarteksturl = http://ruimtelijkeplannen.nl/documents/NL.IMRO.0846.BP2008BOS01Boskant-oh02/r_NL.IMRO.0846.BP2008BOS01Boskant-oh02_2.13.html
dossierid = NL.IMRO.0846.BP2008BOS01Boskant
dossierstatus = geheel onherroepelijk in werking
datum = 2012-03-14
verwijzingnaarexternplan = null
geometrie = [GEOMETRY (Polygon) with 8 points]
--------------------------------------------
 
Results for FeatureType 'http://plu.geonovum.nl:Enkelbestemming':
--------------------------------------------
bestemmingshoofdgroep = wonen
historisch = false
hoofdfuncties = null
identificatie = NL.IMRO.0846.EP34158152442-00
naam = Wonen
ondergeschiktefuncties = null
plangebied = NL.IMRO.0846.BP2008BOS01Boskant-oh02
planstatus = onherroepelijk
typeplan = bestemmingsplan
versieimro = IMRO2008
verwijzingnaarobjectgerichteteksturl = null
verwijzingnaarteksturl = http://ruimtelijkeplannen.nl/documents/NL.IMRO.0846.BP2008BOS01Boskant-oh02/r_NL.IMRO.0846.BP2008BOS01Boskant-oh02_2.13.html
dossierid = NL.IMRO.0846.BP2008BOS01Boskant
dossierstatus = geheel onherroepelijk in werking
datum = 2012-03-14
verwijzingnaarexternplan = null
geometrie = [GEOMETRY (Polygon) with 10 points]
--------------------------------------------


The attributes of these yellow colored polygons are all equal, except for two columns: the 'identificatie' column and the 'geometrie' column which contains the geometry. The first attribute, 'identificatie', contains what looks like a business key but in reality is a composite technical key with business aspects. The first part of the key contains a reference to country ('NL'), domain ('IMRO') and plan ('0846') while the last part of the key ('EP30158152437-00') is essentially a sequence.

What makes each ordinance zone truly unique is its location, captured in the second attribute 'geometrie'. Without the geometry of the zones, the zones don't even exist and the 'identificatie' attribute is useless. This might make the geometry of the zones the true business key of this dataset. Let's check how it holds up to the three characteristics of a business key:
  • The geometry is unique for each zone. There is no overlap between zones.
  • Each zone references real public space. You can live and work in a zone. If you want to change the designated purpose of a building within the zone, you need to file a zoning request.
  • The map is the main entrance for usage of the business object. Each zoning request is negated by comparing the location of the request to the zone it is in. This can only be done with the geometry of the zone.

 

Example #2: Watercourse signs 


Along the watercourses in our area, bridges are being fitted with name signs. Each sign contains the name of the watercourse that is crossed. Below is a picture of a sign on a bridge crossing the river Dommel.

Name sign for river Dommel


Of course, one single watercourse can be crossed many times by different bridges. For example, there currently are 53 locations where one or more signs with description 'Dommel' are fitted. So the signs itself are not quite unique. In fact, they're manufactured in bulk.

All signs are managed in a table in our GIS-database, that looks like this. Most columns in the table are purely descriptive data, which are not very suitable as a business key. Again there are two candidate business keys present, the fields 'PNT' and 'GEOMETRY'.

Table WATERCOURSE_NAMESIGN
The first candidate is the field named 'PNT'. It is clearly a unique key but it lacks business meaning. The fact that its values are equal to the field 'OBJECTID' (which is a sequence populated by the GIS-software) indicates that it is a technical key at best.

Watercourse signs on the map


The second candidate 'GEOMETRY' is, as the name suggests, a spatial column holding the point location for each sign location. Again, let's check if it conforms to the definition elements of a business key:
  • Each sign location is unique as there are no two bridges on the same location. 
  • The location undoubtedly exists in the real world as you can physically touch the objects on their location. 
  • And it is definitely used by the business users to refer and to navigate to the signs.
Positioning each signs location on the map gives it its specific meaning. When the signs on a location are unscrewed from the bridge railing and put away in a storage locker, the record seizes to exist.

In conclusion


So, can a spatial column be the business key for a table? I think it can, and in case of geospatial data sources, will prove to be very common. Of course, this will have implications for the way information with a spatial business key is handled in methods like Data Vault. That is a topic for a next post.

vrijdag 25 november 2016

How spatial links form new business entities

In a previous post I showed how links can be created based on the spatial relationship between two business entities. In most cases, a spatial link is just a link as any other, defining a relationship between two geographical objects, answering question like how "far are these two objects apart" or "how many objects are in the vicinity of this location".

In some cases however, a spatial link might result in the creation of new business information.
Let's consider the following example. From time to time, heavy rainfall causes parcels near watercourses to flood. These floods cause damage to the crops that farmers grow on these parcels. In order to compensate the farmers, we determine which areas are flooded, how big they are, and what crops were grown. In business terms, we have two entities of data:

  1. A collection of parcels, including information about ownership and crop growth.
  2. A flooded area that is either digitized or modeled, containing information about the date, time and nature of the flood.


On the left, the parcels. On the right, the flooded area.

Both entities have a spatial attribute with its location, which in both cases are polygons. In a map, they look like this.

 
Data model for floods
The source data model is shown above, and is actually quite simple. We just have two source tables, each with some attributes and both with a geometry column. They are spatially connected based on the geometries being at the same spot on the earth's surface.

Data vault model for floods
Modeling this source data with the data vault approach gives us the following data vault model. Again, it's a very simple model. The two source tables result in two hubs that are spatially linked (see the previous post about spatial links here).

But in this particular case, the parcels and flood are not just on the same location, but they can interact with eachother. When we intersect the flood contour with the parcels, we get the areas of each parcel that is actually flooded. 


Flooded areas (in red)
In doing this, we're actually creating new business entities which possess their own unique spatial location and combination of attributes. As you can see above, the 3 red polygons form new business entities called 'flooded areas'. So, when overlaying two hub entities we're essentially creating new business information by intersecting their spatial attributes. Remember the 3 types of spatial references? The creation of new spatial business entities through spatial linking is mostly the result of overlap-type spatial reference.

These new business entities can be modeled as a link with satellite or as a new hub with a satellite attached to the exisiting link. There's currently some debate about the validity of the link satellite that centers around this question. I'm not going to weigh in on that debate here though. Either way, the resulting geometry of the intersection (the red polygons) are stored as new geometries in the satellite of either the link or the newly created hub. The resulting new geometry is called INTERSECTION_GEOMETRY. Below I've modeled both scenarios to give you an idea of how this looks, so you can take your pick.

New business entity modeled as link satellite

New business entity modeled as a hub

Comments or questions? Please let me know.



maandag 18 juli 2016

"There is no geodata"

Last week I had a conversation with some colleagues about a project. While we were talking about the inventory of all attributes for a particular object across all systems, one colleague said: "Our main focus is limited to the geodata."

At this point,my reply was: "But, there is no 'geodata'!"

Everyone was shocked. "Then what have we been doing the past year?", they reciprocated. I explained that doing a data inventory is always a good thing, but if you limit yourself to only a partial set your inventory will be incomplete. The discussion ended quickly after this, but the subject kept nagging me. So, time for a post.


"80 of all data has a spatial component." Every geographer has heard this quote somewhere in his career. It has been cited numerous times. The big GIS companies use it in their commercial outings to sell geospatial products. The validity of the statistic has been questioned repeatedly (e.g. here and here). It might or might not be true, but that's not the point. The quote obscures the simple fact that, even in a dataset that features geographic data, most of the data is non-spatial.

Instead, location is just one attribute of an object.

Just think about it. Every geospatial data table is comprised of columns, ranging from a few to several hundreds. Only one column in the table contains the location of the objects. All other columns contain descriptive data about the objects. They're filled with numbers, strings, hyperlinks and binaries. Data that is clearly non-spatial.

In GIS, data is used and presented in the form of maps. Most datasets with that spatial component are locked up in specific geospatial databases, accessible only through GIS software operated by trained geospatial specialists. GIS has staked its own domain in the IT landscape. It's only loosely coupled with mainstream IT. More often than not, there is a lack of mutual understanding between GIS and IT people. I figure this might have something to do with the fact that GIS emerged from the domain of environmental land use management, rather than from the IT domain.

Because of this, mainstream IT concepts, methodologies and technologies tend to be underused in the GIS domain. Which is a pity, because these can provide so much more value to spatial data. On the other hand, in mainstream data solutions the spatial potential of data is underused as well. The main reason for me to start this blog was to try and connect spatial data to BI concepts and methodologies.

I have always felt, and still feel, very strongly that GIS should reconnect with its IT origin. We can start by stop isolating datasets that we call 'geodata' and acknowledging that spatial data is part of something bigger. Our spatial data must be integrated in our mainstream IT solutions, applications and databases. After all, spatial data is data too. Who's with me?

vrijdag 17 juni 2016

Creating a spatial link

In my previous post I described the way spatial data is handled in Data Vault modeling using a modeling tool that is based on a standard 3NF-ruleset. This ruleset evaluates the data source tables to determine whether a source table has to be modeled as a hub or as a link. The tables that contain primary-foreign key constraints are modeled as links because they represent a relationship, and the tables that only contain a primary key are modeled as hubs.

Inevitably, the resulting raw data vault contains only a set of hubs and satellites. After all, our geospatial data source model doesn't contain any PK-FK constraints. All geospatial relationships in the database are implicit, remember? They can be constructed on the fly. So this means that our source model dóes contain relationship data between the entities. However, these relationships can't be automatically generated. There are several different types of spatial relations, categorized in proximity, adjacency and overlap relations. There's no way to automatically tell which spatial relationship type is relevant for any combination of two (or more) entities. This leads to the following principle:

"The business context of spatial relationships defines the nature of spatial data vault links."
The implication of this principle is that data vault links based on spatial relationship must be part of a business data vault layer in your data warehouse architecture. And they must be manually constructed. This is not necessarily a bad thing. It will force you to analyze your business processes to determine which spatial relationships are worth capturing into your data warehouse.

It might be possible to construct some form of DV-ruleset to automatically generate all spatial links between source tables, similar to the way tools like Quipu use the relational source model to generate a raw data vault. This would be a very bad idea. Why? Because, in a spatial source system, each table containing geography is technically linked to each other table by their spatial relation. Generating a raw spatial data vault would result in numerous links that are probably mostly irrelevant. So stick with the business data vault approach as far as spatial links are concerned.

So why is it worth the trouble to model spatial relationships as links in your data vault? You might argue that it is possible to model the spatial relationship downstream, in your data marts, or maybe even real-time in the end user interface. After all, the spatial information for each object is nicely retained in the raw data vault. The answer is simple. Spatial analysis requires a fair amount of processing time. It can put a real strain on the application if done in real-time. Having the spatial relationship already available in a link can save a lot of time on the user side. And it can enable you to use the relationship if the chosen end user interface doesn't support spatial capabilities. It really comes down to a design choice.

Example: linking incidents to maintenance areas


Let's take another look at the spatial raw data vault that I modeled in my last post. It consisted of a collection of 4 hubs and 4 corresponding satellites. There were no links in the data model. However, all 4 hub satellites contain a geographical attribute. It makes sense to define some form of relationship between the entities in the raw data vault based on these attributes. In this example, we decide to only model the relationship between incidents and maintenance areas.

Part of the raw data vault model with entities visualized on a map

As we can see above, each incident will fall inside one (and only one) maintenance area at a specific time. Over time, maintenance areas will change and new incidents will be reported. The location of incidents are fixed once assigned, but incidents can be deleted from the source. The relation between incidents and maintenance areas gives us useful information about the supervisors' responsibilities towards incident management. The relationship link and satellite are added to the data vault model created in the previous post. It links the incidents hub and maintenance area hub together.




The link and satellite are constructed with the use of DDL scripting. Both link and satellite get an index on the primary key.

CREATE TABLE edw.bdv.incident_in_area_l
( incident_in_area_lid INT NOT NULL IDENTITY(1,1)
, incidents_hid INT NOT NULL
, maintenance_areas_hid INT NOT NULL
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk__incident_in_area_l
  PRIMARY KEY( incident_in_area_lid
        )
);

CREATE UNIQUE INDEX ux__incident_in_area_l ON edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
);

CREATE TABLE edw.bdv.incident_in_area_l_s
( incident_in_area_lid INT NOT NULL
, load_date DATETIME NOT NULL
, load_date_end DATETIME DEFAULT '9999-12-31'
, voided BIT
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, audit_spatial_relation VARCHAR(80)
, CONSTRAINT pk__incident_in_area_l_s
  PRIMARY KEY( incident_in_area_lid
        , load_date
        )
);

CREATE INDEX pd__incident_in_area_l_s ON edw.bdv.incident_in_area_l_s
( load_date
, load_date_end
);


The link model is fairly straightforward and complies with standard data vault modeling rules. You might notice that the satellite contains very little context information. The link is based solely on the spatial relation, and doesn't have any other context in the source model. One crucial addition to the data vault model is the extra field called AUDIT_SPATIAL_RELATION. In this field the specific type of spatial relationship that is the basis for the link is stored. This is necessary because geographical objects can have many different types of spatial relationships; objects can touch, partially or completely overlap, contain or be near eachother. The nature of the spatial relationship therefore is business context, and has to be stored as such, in the link satellite.

Next, we need to create the ETL necessary to populate the link with the appropriate data. This is done in a two-step approach:
  1. Join the incidents and maintenance tables in the staging area with a spatial WHERE-clause (highlighted in green below);
  2. Join the incidents and maintenance hubs from the raw data vault based on the business key.
This gives us the following query.

------ insert new business keys data into 'incident_in_area_l'
INSERT INTO edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
, audit_timestamp
, audit_recordsource
)
SELECT DISTINCT h0.incidents_hid
              , h1.maintenance_areas_hid
              , CURRENT_TIMESTAMP as audit_timestamp
              , 'edw.sta_spatial.incidents' as audit_recordsource
    FROM  sta_spatial.incidents stg0
    JOIN edw.sta_spatial.maintenance_areas stg1 ON (stg0.ligging.STWithin(stg1.ligging) = 1)
    LEFT JOIN sdv_spatial.incidents_h h0 ON stg0.MELDINGID = h0.MELDINGID
    LEFT JOIN sdv_spatial.maintenance_areas_h h1 ON stg1.GAG_ID = h1.GAG_ID
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l l
                    WHERE (h0.incidents_hid = l.incidents_hid OR (h0.incidents_hid IS NULL AND l.incidents_hid IS NULL))
                      AND (h1.maintenance_areas_hid = l.maintenance_areas_hid OR (h1.maintenance_areas_hid IS NULL AND l.maintenance_areas_hid IS NULL))
                  )
;


The spatial WHERE-clause in this example uses the MS SQL Server specific STWithin command, which identifies all features from one table that lay within the features from a second table. The result is a cartesian join. Because maintenance areas don't overlap, each incident point will be located in only one area. Hence each incident is joined to a specific maintenance area, resulting in the desired link.

Finally, we need to create the proper ETL to fill and maintain the link satellite. It consists of three queries, to insert context for new links, insert context for deleted links and end-date the deleted links. These queries are pretty straightforward, besides the fact that the link isn't based on one source table but rather on the implicit relationship. Noticeable in the first query is the insertion of the value 'inside' for the field AUDIT_SPATIAL_RELATION, as explained above and shown in green.

-- insert new business keys data from 'incident_in_area_l' into 'incident_in_area_l_s'
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP  -- there is no staging for this link, so load date equals audit timestamp
     , 0
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
     , 'inside' as audit_spatial_relation
 FROM edw.bdv.incident_in_area_l l
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l_s dv
                    WHERE (l.incident_in_area_lid = incident_in_area_lid OR (l.incident_in_area_lid IS NULL AND incident_in_area_lid IS NULL))
                      AND (dv.voided = 0 OR dv.voided IS NULL)
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.bdv.incident_in_area_l_s
                                        WHERE incident_in_area_lid = dv.incident_in_area_lid
                                        and voided = 0
                                      )
                  )
;


-- insert deleted business keys data into 'incident_in_area_l_s
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP
     , 1
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
     , 'inside' as audit_spatial_relation
  FROM edw.bdv.incident_in_area_l_s dv
 WHERE load_date = (SELECT MAX(load_date)
                      FROM edw.bdv.incident_in_area_l_s
                     WHERE incident_in_area_lid = dv.incident_in_area_lid
                   )
   AND dv.voided = 0
   AND NOT EXISTS (SELECT 1
                   FROM edw.bdv.incident_in_area_l l
                   LEFT JOIN sdv_spatial.incidents_h_s t0s
                   ON ((t0s.incidents_hid = l.incidents_hid) and (t0s.voided = 0) and (t0s.load_date_end = '9999-12-31 00:00:00.000'))
                   LEFT JOIN edw.sdv_spatial.maintenance_areas_h_s t1s
                   ON ((t1s.maintenance_areas_hid = l.maintenance_areas_hid) and (t1s.voided = 0) and (t1s.load_date_end = '9999-12-31 00:00:00.000'))
                   WHERE (t0s.ligging.STWithin(t1s.ligging) = 1)
                   AND l.incident_in_area_lid = dv.incident_in_area_lid
                  )
;


-- Satellite end dating in 'incident_in_area_l_s'
UPDATE edw.bdv.incident_in_area_l_s
   SET load_date_end = (SELECT MIN(s.load_date)
                          FROM edw.bdv.incident_in_area_l_s s
                         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
                           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
                       )
 WHERE load_date_end = '9999-12-31' -- in ISO 8601 format
   AND (SELECT MIN(s.load_date_end)
          FROM edw.bdv.incident_in_area_l_s s
         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
       ) IS NOT NULL
;


All ETL shown above is built by copying a typical generated link-satellite queryset from the raw data vault ETL and manually altering it to accomodate for the spatial functionality that is needed to construct the spatial link. When we run the queries and check out the result, we get the following.
Selecting the link and satellite


Subsequently, if an incident has been deleted in the source the change in the link relationship is detected by the ETL. The link is properly end-dated in the satellite and the newly inserted satellite record is set to voided, so it is clear the link is no longer valid.

Link between incident and maintenance area; the incident has been deleted in the source.

Conclusion


It is possible to model a data vault link based on a spatial relationship in the source. With some minor adjustments, standard ETL can be modified to create the SQL code needed to populate the link and satellite. All spatial links must be based on business context to avoid generation of useless links.

dinsdag 17 mei 2016

Loading spatial data in a raw data vault

In this post I'm showing an example of spatial data in combination with the Data Vault methodology based on the philosophy implemented at our company. There are several philosophies how to model your Enterprise Data Warehouse using Data Vault (DV), and each philosophy has its merits and its drawbacks. This blog isn't meant to discuss these philosophies: there are far more suitable blogs for that. Instead, we're going to focus on one specific item regarding Data Vault, namely: how to incorporate spatial data in the Data Vault method. Data Vault consists of three main components, being hubs, links and satellites. Hubs contain the business entities, links contain the relationships between business entities and the satellites contain the context of hubs and links.

So, how does Data Vault handle spatial data? The geometry attributes of an object provide context about the location of the object in the real world. Most RDBMS support spatial information as a specific object type. So at this point it is safe to assume that geometry will find its place in the data vault. But, as we've seen in a previous post, geometry also provides spatial reference to other objects. In other words, it forms relationships between objects. Will this affect the way DV handles spatial data

The example


In our architecture we are using a raw data vault that serves as a technical copy of the source systems. Prior to loading the raw data vault, source systems are copied to a staging area. After loading, data is processed into several data marts that are accessed by end users through reporting, analysis (OLAP) and, in the near future, dashboard solutions.

Our Enterprise BI architecture

As I said, it is one of several philosophies. This is a fairly technical approach, with the benefit that a lot of work building the EDW can be automated. The routines needed for loading the staging areas and the raw data vault are generated with the use of the DV modeling tool Quipu (which was open source at the time we started building our EDW). The manual labor required to transform the data for end user purposes (the big T) is offloaded to the data mart phase.

#1: Import the source system


For this demo, I've created a new example source system with 4 spatial tables selected from our company geodatabase, containing incidents, maintenance areas, and the border of our service area both as line an polygon (which are in two separate tables). All 4 tables contain a GEOMETRY type column, named either LIGGING or GEOM.

Spatial source system model example


The source system model is imported into Quipu. The first thing we notice is that Quipu doesn't recognize geometry type columns. We need edit the column type and native type manually.

Source model in Quipu, with geometry type manually altered

 

 #2: Loading the staging area


The next step, we're creating the staging area which is a copy of the source with some added metadata fields. The appropriate staging area DDL and ETL is generated by Quipu. Let's take a look at the DDL-code for the maintenance areas table below.

CREATE TABLE edw.sta_spatial.maintenance_areas
( GAG_ID bigint
, LIGGING_WKT varchar(MAX)
, AREA_NAME varchar(50)
, SUPERVISOR_NAME varchar(50)
, SRID int
, ligging geometry
, stg_timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
, stg_sequence INT DEFAULT 1
, voided BIT DEFAULT 0
, audit_id INT
);


As we notice,the DDL is generated using the correct geometry type field after we manually altered it in the previous step. With the DDL, we can create the proper staging area tables without any further edits.

In this process, Quipu also generated the ETL needed to fill the staging area. In our case, this ETL consists of Pentaho Data Integration (PDI) code. As explained in an earlier post, PDI can support geometry type fields. Unfortunately, PDI only offers support for geometry type in combination with Oracle and PostgreSQL. Since our  EDW is built on the MS SQL Server platform, we needed to create a workaround for this. We solved this by inserting the geometry into the staging area in well-known text format (stored in the field LIGGING_WKT) and converting it to spatial with an 'INSTEAD OF'-trigger we are able to fill the staging properly.

CREATE TRIGGER [dbo].[maintenance_areas_trg] ON [dbo].[maintenance_areas]
INSTEAD OF INSERT
AS
BEGIN
    INSERT INTO maintenance_areas (
       [GAG_ID]
      ,[LIGGING_WKT]
      ,[AREA_NAME]
      ,[SUPERVISOR_NAME]
      ,[SRID]
      ,[ligging]
    ) SELECT
       [GAG_ID]
      ,[LIGGING_WKT]
      ,[AREA_NAME]
      ,[SUPERVISOR_NAME]
      ,[SRID]
      ,geometry::STGeomFromText ( LIGGING_WKT , 28992 )
     FROM
        INSERTED
END


The insertion of the field 'LIGGING' is bypassed by the trigger. In the code above,we can see the field is filled by converting text to actual geometry with use of the MS SQL command STGeomFromText.

This is not the only possible solution. There are several other solutions for this problem, ranging from using another ETL-tool to fill the staging area to changing EDW database platform. For now, we can suffice with the notion that filling the staging area with spatial data is possible.

#3: Designing the raw data vault


The next step in this example consists of modeling the raw data vault. Our modeling software Quipu will automatically generate a data vault style model based on the technical properties of the staging area. It also allows us to manually identify the business key in the staging tables that lack a primary key in the source model.





Generating the spatial raw data vault example

After analyzing our source model (shown above in paragraph #1), Quipu generates a data vault model consisting of 4 hubs and corresponding satellites. The geometry type fields (highlighted in green) have become part of the satellites, which is logical as they are identified as context properties.
Spatial raw data vault model based on example source system


Notice the absence of any links in this model? As I explained in previous posts (here and here), geographical databases rely heavily on the implicit spatial relationships to tie data together. However, when generating a raw data vault using a logical rule engine, it will ignore these spatial relationships. As far as a raw data vault is concerned, this is correct behavior. As a consequence, geographical data sources with 100+ tables will inevitably be generated into raw data vaults with 100+ hubs and satellites and virtually no links.

After generating the model, we can continue with generating the DDL and ETL necessary to create the model in our EDW environment. The DDL below shows the generation of the maintenance areas hub and satellite, including the geometry field in the satellite. The generated DDL is flawless because we altered the type and native type column for the geometry field named 'LIGGING' manually in step #1.

CREATE TABLE edw.sdv_spatial.maintenance_areas_h
( maintenance_areas_hid INT NOT NULL IDENTITY(1,1)
, GAG_ID bigint
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk_maintenance_areas_h
  PRIMARY KEY( maintenance_areas_hid
        )
);

CREATE UNIQUE INDEX bk_maintenance_areas_h ON edw.sdv_spatial.maintenance_areas_h
( GAG_ID
);

CREATE TABLE edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid INT NOT NULL
, load_date DATETIME NOT NULL
, load_date_end DATETIME DEFAULT '9999-12-31'
, LIGGING_WKT varchar(MAX)
, AREA_NAME varchar(50)
, SUPERVISOR_NAME varchar(50)
, SRID int
, ligging geometry
, voided BIT
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk_maintenance_areas_h_s
  PRIMARY KEY( maintenance_areas_hid
        , load_date
        )
);

CREATE INDEX pd_maintenance_areas_h_s ON edw.sdv_spatial.maintenance_areas_h_s
( load_date
, load_date_end
);


We also generate the ETL steps that are needed to make the data vault run. These steps compare new data loaded into the staging area with the data already present in the raw data vault, identifying alterations and storing them into the data vault.

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h'
INSERT INTO edw.sdv_spatial.maintenance_areas_h
( GAG_ID
, audit_timestamp
, audit_recordsource
)
SELECT DISTINCT GAG_ID
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
  FROM edw.sta_spatial.maintenance_areas stg
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h
                    WHERE (stg.GAG_ID = GAG_ID OR (stg.GAG_ID IS NULL AND GAG_ID IS NULL))
                  )
;

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h_s'
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , stg_timestamp
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , voided
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
  FROM ( SELECT (SELECT MAX(maintenance_areas_hid)
                   FROM edw.sdv_spatial.maintenance_areas_h
                  WHERE (src.GAG_ID = GAG_ID OR (src.GAG_ID IS NULL AND GAG_ID IS NULL))
                ) as maintenance_areas_hid
              , stg_timestamp
              , LIGGING_WKT
              , AREA_NAME
              , SUPERVISOR_NAME
              , SRID
              , ligging
              , voided
           FROM edw.sta_spatial.maintenance_areas src
       ) stg
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h_s dv
                    WHERE (stg.maintenance_areas_hid = maintenance_areas_hid OR (stg.maintenance_areas_hid IS NULL AND maintenance_areas_hid IS NULL))
                      AND (stg.LIGGING_WKT = LIGGING_WKT OR (stg.LIGGING_WKT IS NULL AND LIGGING_WKT IS NULL))
                      AND (stg.AREA_NAME = AREA_NAME OR (stg.AREA_NAME IS NULL AND AREA_NAME IS NULL))
                      AND (stg.SUPERVISOR_NAME = SUPERVISOR_NAME OR (stg.SUPERVISOR_NAME IS NULL AND SUPERVISOR_NAME IS NULL))
                      AND (stg.SRID = SRID OR (stg.SRID IS NULL AND SRID IS NULL))
                      AND ((stg.ligging.STEquals(ligging) = 1) OR (stg.ligging IS NULL AND ligging IS NULL))
                      AND (stg.voided = voided OR (stg.voided IS NULL AND voided IS NULL))
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.sdv_spatial.maintenance_areas_h_s
                                        WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                                      )
                  )
;

-- insert deleted business keys data from 'maintenance_areas' into 'maintenance_areas_h_s
-- entries in the edw.dbo.edw_load with is_full_load = 1 are treated as full loads!
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , COALESCE(ld.load_timestamp,CURRENT_TIMESTAMP)
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , 1
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
  FROM edw.sdv_spatial.maintenance_areas_h_s dv
 INNER
  JOIN edw.dbo.edw_load ld
    ON ld.table_name = 'edw.sta_spatial.maintenance_areas'
   AND ld.is_full_load = 1
 WHERE load_date = (SELECT MAX(load_date)
                      FROM edw.sdv_spatial.maintenance_areas_h_s
                     WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                   )
   --AND dv.voided = 0
   AND dv.voided = 0
   AND NOT EXISTS (SELECT 1
                     FROM edw.sta_spatial.maintenance_areas src
                     LEFT OUTER HASH
                     JOIN edw.sdv_spatial.maintenance_areas_h t1821432
                       ON (t1821432.GAG_ID = src.GAG_ID OR (t1821432.GAG_ID IS NULL AND src.GAG_ID IS NULL))
                    WHERE t1821432.maintenance_areas_hid = dv.maintenance_areas_hid
                  )
;

-- Satellite end dating in 'maintenance_areas_h_s'
UPDATE edw.sdv_spatial.maintenance_areas_h_s
   SET load_date_end = (SELECT MIN(s.load_date)
                          FROM edw.sdv_spatial.maintenance_areas_h_s s
                         WHERE s.maintenance_areas_hid = edw.sdv_spatial.maintenance_areas_h_s.maintenance_areas_hid
                           AND s.load_date > edw.sdv_spatial.maintenance_areas_h_s.load_date
                       )
 WHERE load_date_end = '9999-12-31' -- in ISO 8601 format
   AND (SELECT MIN(s.load_date_end)
          FROM edw.sdv_spatial.maintenance_areas_h_s s
         WHERE s.maintenance_areas_hid = edw.sdv_spatial.maintenance_areas_h_s.maintenance_areas_hid
           AND s.load_date > edw.sdv_spatial.maintenance_areas_h_s.load_date
       ) IS NOT NULL
;


This code needs one small but crucial alteration for each satellite that is generated. The code that updates the new business keys data into the satellite compares the existing data with the new data that is loaded into the staging area, based on an '=' comparison. This operator won't work for geometry type fields. Instead, the database platform offers specific operators for handling spatial data. Let's take a closer look at the satellite insert statement.

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h_s'
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , stg_timestamp
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , voided
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
  FROM ( SELECT (SELECT MAX(maintenance_areas_hid)
                   FROM edw.sdv_spatial.maintenance_areas_h
                  WHERE (src.GAG_ID = GAG_ID OR (src.GAG_ID IS NULL AND GAG_ID IS NULL))
                ) as maintenance_areas_hid
              , stg_timestamp
              , LIGGING_WKT
              , AREA_NAME
              , SUPERVISOR_NAME
              , SRID
              , ligging
              , voided
           FROM edw.sta_spatial.maintenance_areas src
       ) stg
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h_s dv
                    WHERE (stg.maintenance_areas_hid = maintenance_areas_hid OR (stg.maintenance_areas_hid IS NULL AND maintenance_areas_hid IS NULL))
                      AND (stg.LIGGING_WKT = LIGGING_WKT OR (stg.LIGGING_WKT IS NULL AND LIGGING_WKT IS NULL))
                      AND (stg.AREA_NAME = AREA_NAME OR (stg.AREA_NAME IS NULL AND AREA_NAME IS NULL))
                      AND (stg.SUPERVISOR_NAME = SUPERVISOR_NAME OR (stg.SUPERVISOR_NAME IS NULL AND SUPERVISOR_NAME IS NULL))
                      AND (stg.SRID = SRID OR (stg.SRID IS NULL AND SRID IS NULL))
                      AND ((stg.ligging.STEquals(ligging) = 1) OR (stg.ligging IS NULL AND ligging IS NULL))
                      AND (stg.voided = voided OR (stg.voided IS NULL AND voided IS NULL))
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.sdv_spatial.maintenance_areas_h_s
                                        WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                                      )
                  )
;


Highlighted in green we see the part where the newly inserted geometries are being compared to the geometry data already present in our data vault. In the Quipu generated statement, they were simply compared using the '=' operator, which won't work. So we alter the statement slightly using the MS SQL specific statement STEquals (other RDBMS support the same statement, but with different syntax). This statement compares the actual spatial data stored in the two geometry fields. This small edit needs to be done for each satellite ETL that is generated. In our example this means altering four SQL statements. After this edit, our generated ETL will work as expected, identifying all altered objects in our staging area.

Conclusion


When integrating spatial data in a raw data vault style EDW, there are two important questions to consider:
  • Does my ETL tooling support spatial object types?
  • What alterations do I have to make to my code?
This example shows us that, with some minor alterations to our code, we can easily adapt a raw data vault to support storing and tracking spatial attribute information. In the next post, we will take a look at spatial in the business data vault.

maandag 18 april 2016

Using spatial relationships in ETL

The case

In this post we're going to take a look at spatial ETL. A lot of BI work traditionally exists of getting our data from A to B and transforming it in the process. So what if our source data contains spatial information?

For example, we have two tables in our source database. One is containing 9 maintenance areas that are assigned to a particular supervisor. The other is containing over 3200 incidents that have been reported in the area. Both tables are stored in an Oracle 11g database, contain a spatial column and can be visualised with GIS software to show where they're situated.
All incidents

9 maintenance areas, colored by unique area


Now imagine we want to determine the designated supervisor for each incident to report progress of incident management on a dashboard. No primary-foreign key construction exists between the two tables. There is no other way to link these two objects together than by use of the spatial relationship.

Common ETL tools usually don't know how to treat spatial data columns. But with the right combination of functionality, you can certainly do spatial ETL. In this example, I will be using Pentaho Data Integration 6.0 (PDI) and the additional AtolCD Pentaho GIS plugin. The plugin adds two specific improvements to regular Pentaho:
  1. The ability to treat spatial columns as column type 'geometry' within the ETL.
  2. A set of transformation steps that enhance Pentaho with geospatial operations.

The solution path

I've created a simple Pentaho-transformation to combine the incidents and maintenance area. As you will notice the transformation looks like any other, with the exception of a couple of geospatial steps that will be explained further on. All steps in the transformation shown below are referred to in the text by use of double quotes and starting with a capital (e.g. "Table incidents").

PDI transformation with added geospatial steps.

We see that both tables are loaded with two standard table input steps, "Table incidents" and "Table maintenance areas", resulting in two data streams. When we examine the output of these steps, we see the following data.

Incidents table preview

The incident table contains about 3200 incidents that each have an unique ID and an incident location. The maintenance area table contains 9 maintenance areas, that all have a name, a supervisor name and an area location.

Maintenance areas table preview

Notice how the columns INCIDENT_LOCATION and MAINTENANCE_AREA_LOCATION contain some kind of coordinate definition? We can clearly see one contains points and the other contains polygons. The added geospatial functionality enables PDI to show the contents of the geospatial columns in readable text. We can see the field types when examining the output fields for the two table input steps; the location fields are marked  as type 'Geometry'.

Output fields for the incidents stream

Output fields for the maintenance area stream

The text "SRID=28992" tells us that the spatial data is in a Dutch coordinate system. It is important that the data from both geometry columns is in the same coordinate system. Otherwise it isn't possible to relate the two columns. We verify the coordinate system with a "Geometry information" step for each datastream. This step is part of the additional steps added by the GIS plugin.

The next step is to join both data streams based on a "Cartesian join". This step doesn't yet recognize the spatial relationship. It will simply join each record from one stream with each record from the other. For this example, this means that because we started with 3000 incidents and 9 maintenance area, the resulting data stream will contain about 27000 records.
Cartesian join result




In the join result we can clearly see that incident ID no 1012 is joined to all 9 area names. But an incident can't be in all 9 maintenance areas at the same time. We wanted to identify the maintenance area for each incident, remember?

As a result, the datastream now contains two geometry columns. We need to filter the datastream based on the spatial relationship. For this, we use the step called "Determine spatial relationship", which is also part of the GIS plugin.
Define the spatial relationship between incidents and maintenance area.

In this step we can define the spatial relationship between the two geometry columns. The step supports a lot of different spatial relationships that can be described to relate two objects by use of location. The one we're looking for is simply called 'Within', as we want to identify each incident within each maintenance area. The spatial analysis settings shown above can be summarized as a select statement that selects the records where 'INCIDENT_LOCATION is within MAINTENANCE_AREA_LOCATION' evaluates to TRUE. When this step is run, only the records where this statement is true are kept in the resulting datastream.
Incidents within their maintenance area.

As the image above shows, the resulting datastream contains one record per incident ID with the corresponding maintenance area information. However, the datastream still contains two geometry columns. This is no problem for the database table we're about to create, but most common GIS software only supports one geometry type per data layer. So we make a selection of the output fields with the "Select values" step, assuring that the resulting table only contains the incidents point geometry attribute, combined with ID, area name and supervisor name.
Select the necessary fields
After this, the final step is to write the resulting data to a new database table with the "Table output" step. (Another approach could be to add the maintenance area and supervisor name to the existing incidents table; in that scenario we might consider using an "Insert/update" step.) Of course, the receiving database must support geometry type database fields.
Output the table to the database
Now, when we visualize the resulting dataset with GIS software, we can see that each incident is nicely matched with its corresponding maintenance area by color.

Incidents colored by maintenance area

Conclusion

Spatial relationships allow us to define relationships between objects that have no other relation to each other than the locational one. Extending standard ETL tools with geospatial capabilities is a very cool way to make use of the specific advantage of geospatial data.

Granted, there are some caveats in the use of spatial properties in ETL tooling.
  • The AtolCD Pentaho GIS plugin used for PDI supports Oracle Spatial and PostgreSQL/PostGIS geometry (and also MySQL, but this I haven't tested), but lacks support for MS SQL Server geometry.
  • Geospatial data can get quite large in size. Substantially larger datasets like cadastral parcel information (400.000+ records) that are joined by a cartesian join will result in billions of records in the datastream. This might hamper performance but I haven't tested yet.
  • I have yet to try this in other ETL tools like SSIS or Talend.
Still, I'm confident these issues will eventually get resolved.