vrijdag 17 juni 2016

Creating a spatial link

In my previous post I described the way spatial data is handled in Data Vault modeling using a modeling tool that is based on a standard 3NF-ruleset. This ruleset evaluates the data source tables to determine whether a source table has to be modeled as a hub or as a link. The tables that contain primary-foreign key constraints are modeled as links because they represent a relationship, and the tables that only contain a primary key are modeled as hubs.

Inevitably, the resulting raw data vault contains only a set of hubs and satellites. After all, our geospatial data source model doesn't contain any PK-FK constraints. All geospatial relationships in the database are implicit, remember? They can be constructed on the fly. So this means that our source model dóes contain relationship data between the entities. However, these relationships can't be automatically generated. There are several different types of spatial relations, categorized in proximity, adjacency and overlap relations. There's no way to automatically tell which spatial relationship type is relevant for any combination of two (or more) entities. This leads to the following principle:

"The business context of spatial relationships defines the nature of spatial data vault links."
The implication of this principle is that data vault links based on spatial relationship must be part of a business data vault layer in your data warehouse architecture. And they must be manually constructed. This is not necessarily a bad thing. It will force you to analyze your business processes to determine which spatial relationships are worth capturing into your data warehouse.

It might be possible to construct some form of DV-ruleset to automatically generate all spatial links between source tables, similar to the way tools like Quipu use the relational source model to generate a raw data vault. This would be a very bad idea. Why? Because, in a spatial source system, each table containing geography is technically linked to each other table by their spatial relation. Generating a raw spatial data vault would result in numerous links that are probably mostly irrelevant. So stick with the business data vault approach as far as spatial links are concerned.

So why is it worth the trouble to model spatial relationships as links in your data vault? You might argue that it is possible to model the spatial relationship downstream, in your data marts, or maybe even real-time in the end user interface. After all, the spatial information for each object is nicely retained in the raw data vault. The answer is simple. Spatial analysis requires a fair amount of processing time. It can put a real strain on the application if done in real-time. Having the spatial relationship already available in a link can save a lot of time on the user side. And it can enable you to use the relationship if the chosen end user interface doesn't support spatial capabilities. It really comes down to a design choice.

Example: linking incidents to maintenance areas


Let's take another look at the spatial raw data vault that I modeled in my last post. It consisted of a collection of 4 hubs and 4 corresponding satellites. There were no links in the data model. However, all 4 hub satellites contain a geographical attribute. It makes sense to define some form of relationship between the entities in the raw data vault based on these attributes. In this example, we decide to only model the relationship between incidents and maintenance areas.

Part of the raw data vault model with entities visualized on a map

As we can see above, each incident will fall inside one (and only one) maintenance area at a specific time. Over time, maintenance areas will change and new incidents will be reported. The location of incidents are fixed once assigned, but incidents can be deleted from the source. The relation between incidents and maintenance areas gives us useful information about the supervisors' responsibilities towards incident management. The relationship link and satellite are added to the data vault model created in the previous post. It links the incidents hub and maintenance area hub together.




The link and satellite are constructed with the use of DDL scripting. Both link and satellite get an index on the primary key.

CREATE TABLE edw.bdv.incident_in_area_l
( incident_in_area_lid INT NOT NULL IDENTITY(1,1)
, incidents_hid INT NOT NULL
, maintenance_areas_hid INT NOT NULL
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk__incident_in_area_l
  PRIMARY KEY( incident_in_area_lid
        )
);

CREATE UNIQUE INDEX ux__incident_in_area_l ON edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
);

CREATE TABLE edw.bdv.incident_in_area_l_s
( incident_in_area_lid INT NOT NULL
, load_date DATETIME NOT NULL
, load_date_end DATETIME DEFAULT '9999-12-31'
, voided BIT
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, audit_spatial_relation VARCHAR(80)
, CONSTRAINT pk__incident_in_area_l_s
  PRIMARY KEY( incident_in_area_lid
        , load_date
        )
);

CREATE INDEX pd__incident_in_area_l_s ON edw.bdv.incident_in_area_l_s
( load_date
, load_date_end
);


The link model is fairly straightforward and complies with standard data vault modeling rules. You might notice that the satellite contains very little context information. The link is based solely on the spatial relation, and doesn't have any other context in the source model. One crucial addition to the data vault model is the extra field called AUDIT_SPATIAL_RELATION. In this field the specific type of spatial relationship that is the basis for the link is stored. This is necessary because geographical objects can have many different types of spatial relationships; objects can touch, partially or completely overlap, contain or be near eachother. The nature of the spatial relationship therefore is business context, and has to be stored as such, in the link satellite.

Next, we need to create the ETL necessary to populate the link with the appropriate data. This is done in a two-step approach:
  1. Join the incidents and maintenance tables in the staging area with a spatial WHERE-clause (highlighted in green below);
  2. Join the incidents and maintenance hubs from the raw data vault based on the business key.
This gives us the following query.

------ insert new business keys data into 'incident_in_area_l'
INSERT INTO edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
, audit_timestamp
, audit_recordsource
)
SELECT DISTINCT h0.incidents_hid
              , h1.maintenance_areas_hid
              , CURRENT_TIMESTAMP as audit_timestamp
              , 'edw.sta_spatial.incidents' as audit_recordsource
    FROM  sta_spatial.incidents stg0
    JOIN edw.sta_spatial.maintenance_areas stg1 ON (stg0.ligging.STWithin(stg1.ligging) = 1)
    LEFT JOIN sdv_spatial.incidents_h h0 ON stg0.MELDINGID = h0.MELDINGID
    LEFT JOIN sdv_spatial.maintenance_areas_h h1 ON stg1.GAG_ID = h1.GAG_ID
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l l
                    WHERE (h0.incidents_hid = l.incidents_hid OR (h0.incidents_hid IS NULL AND l.incidents_hid IS NULL))
                      AND (h1.maintenance_areas_hid = l.maintenance_areas_hid OR (h1.maintenance_areas_hid IS NULL AND l.maintenance_areas_hid IS NULL))
                  )
;


The spatial WHERE-clause in this example uses the MS SQL Server specific STWithin command, which identifies all features from one table that lay within the features from a second table. The result is a cartesian join. Because maintenance areas don't overlap, each incident point will be located in only one area. Hence each incident is joined to a specific maintenance area, resulting in the desired link.

Finally, we need to create the proper ETL to fill and maintain the link satellite. It consists of three queries, to insert context for new links, insert context for deleted links and end-date the deleted links. These queries are pretty straightforward, besides the fact that the link isn't based on one source table but rather on the implicit relationship. Noticeable in the first query is the insertion of the value 'inside' for the field AUDIT_SPATIAL_RELATION, as explained above and shown in green.

-- insert new business keys data from 'incident_in_area_l' into 'incident_in_area_l_s'
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP  -- there is no staging for this link, so load date equals audit timestamp
     , 0
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
     , 'inside' as audit_spatial_relation
 FROM edw.bdv.incident_in_area_l l
 WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l_s dv
                    WHERE (l.incident_in_area_lid = incident_in_area_lid OR (l.incident_in_area_lid IS NULL AND incident_in_area_lid IS NULL))
                      AND (dv.voided = 0 OR dv.voided IS NULL)
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.bdv.incident_in_area_l_s
                                        WHERE incident_in_area_lid = dv.incident_in_area_lid
                                        and voided = 0
                                      )
                  )
;


-- insert deleted business keys data into 'incident_in_area_l_s
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP
     , 1
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
     , 'inside' as audit_spatial_relation
  FROM edw.bdv.incident_in_area_l_s dv
 WHERE load_date = (SELECT MAX(load_date)
                      FROM edw.bdv.incident_in_area_l_s
                     WHERE incident_in_area_lid = dv.incident_in_area_lid
                   )
   AND dv.voided = 0
   AND NOT EXISTS (SELECT 1
                   FROM edw.bdv.incident_in_area_l l
                   LEFT JOIN sdv_spatial.incidents_h_s t0s
                   ON ((t0s.incidents_hid = l.incidents_hid) and (t0s.voided = 0) and (t0s.load_date_end = '9999-12-31 00:00:00.000'))
                   LEFT JOIN edw.sdv_spatial.maintenance_areas_h_s t1s
                   ON ((t1s.maintenance_areas_hid = l.maintenance_areas_hid) and (t1s.voided = 0) and (t1s.load_date_end = '9999-12-31 00:00:00.000'))
                   WHERE (t0s.ligging.STWithin(t1s.ligging) = 1)
                   AND l.incident_in_area_lid = dv.incident_in_area_lid
                  )
;


-- Satellite end dating in 'incident_in_area_l_s'
UPDATE edw.bdv.incident_in_area_l_s
   SET load_date_end = (SELECT MIN(s.load_date)
                          FROM edw.bdv.incident_in_area_l_s s
                         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
                           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
                       )
 WHERE load_date_end = '9999-12-31' -- in ISO 8601 format
   AND (SELECT MIN(s.load_date_end)
          FROM edw.bdv.incident_in_area_l_s s
         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
       ) IS NOT NULL
;


All ETL shown above is built by copying a typical generated link-satellite queryset from the raw data vault ETL and manually altering it to accomodate for the spatial functionality that is needed to construct the spatial link. When we run the queries and check out the result, we get the following.
Selecting the link and satellite


Subsequently, if an incident has been deleted in the source the change in the link relationship is detected by the ETL. The link is properly end-dated in the satellite and the newly inserted satellite record is set to voided, so it is clear the link is no longer valid.

Link between incident and maintenance area; the incident has been deleted in the source.

Conclusion


It is possible to model a data vault link based on a spatial relationship in the source. With some minor adjustments, standard ETL can be modified to create the SQL code needed to populate the link and satellite. All spatial links must be based on business context to avoid generation of useless links.