edwinvanvugt

vrijdag 25 november 2016

How spatial links form new business entities

In a previous post I showed how links can be created based on the spatial relationship between two business entities. In most cases, a spatial link is just a link as any other, defining a relationship between two geographical objects, answering question like how "far are these two objects apart" or "how many objects are in the vicinity of this location".

In some cases however, a spatial link might result in the creation of new business information.
Let's consider the following example. From time to time, heavy rainfall causes parcels near watercourses to flood. These floods cause damage to the crops that farmers grow on these parcels. In order to compensate the farmers, we determine which areas are flooded, how big they are, and what crops were grown. In business terms, we have two entities of data:

A collection of parcels, including information about ownership and crop growth.
A flooded area that is either digitized or modeled, containing information about the date, time and nature of the flood.

On the left, the parcels. On the right, the flooded area.

Both entities have a spatial attribute with its location, which in both cases are polygons. In a map, they look like this.

Data model for floods

The source data model is shown above, and is actually quite simple. We just have two source tables, each with some attributes and both with a geometry column. They are spatially connected based on the geometries being at the same spot on the earth's surface.

Data vault model for floods

Modeling this source data with the data vault approach gives us the following data vault model. Again, it's a very simple model. The two source tables result in two hubs that are spatially linked (see the previous post about spatial links here).

But in this particular case, the parcels and flood are not just on the same location, but they can interact with eachother. When we intersect the flood contour with the parcels, we get the areas of each parcel that is actually flooded.

Flooded areas (in red)

In doing this, we're actually creating new business entities which possess their own unique spatial location and combination of attributes. As you can see above, the 3 red polygons form new business entities called 'flooded areas'. So, when overlaying two hub entities we're essentially creating new business information by intersecting their spatial attributes. Remember the 3 types of spatial references? The creation of new spatial business entities through spatial linking is mostly the result of overlap-type spatial reference.

These new business entities can be modeled as a link with satellite or as a new hub with a satellite attached to the exisiting link. There's currently some debate about the validity of the link satellite that centers around this question. I'm not going to weigh in on that debate here though. Either way, the resulting geometry of the intersection (the red polygons) are stored as new geometries in the satellite of either the link or the newly created hub. The resulting new geometry is called INTERSECTION_GEOMETRY. Below I've modeled both scenarios to give you an idea of how this looks, so you can take your pick.

New business entity modeled as link satellite

New business entity modeled as a hub

Comments or questions? Please let me know.

maandag 18 juli 2016

"There is no geodata"

Last week I had a conversation with some colleagues about a project. While we were talking about the inventory of all attributes for a particular object across all systems, one colleague said: "Our main focus is limited to the geodata."

At this point,my reply was: "But, there is no 'geodata'!"

Everyone was shocked. "Then what have we been doing the past year?", they reciprocated. I explained that doing a data inventory is always a good thing, but if you limit yourself to only a partial set your inventory will be incomplete. The discussion ended quickly after this, but the subject kept nagging me. So, time for a post.

"80 of all data has a spatial component." Every geographer has heard this quote somewhere in his career. It has been cited numerous times. The big GIS companies use it in their commercial outings to sell geospatial products. The validity of the statistic has been questioned repeatedly (e.g. here and here). It might or might not be true, but that's not the point. The quote obscures the simple fact that, even in a dataset that features geographic data, most of the data is non-spatial.

Instead, location is just one attribute of an object.

Just think about it. Every geospatial data table is comprised of columns, ranging from a few to several hundreds. Only one column in the table contains the location of the objects. All other columns contain descriptive data about the objects. They're filled with numbers, strings, hyperlinks and binaries. Data that is clearly non-spatial.

In GIS, data is used and presented in the form of maps. Most datasets with that spatial component are locked up in specific geospatial databases, accessible only through GIS software operated by trained geospatial specialists. GIS has staked its own domain in the IT landscape. It's only loosely coupled with mainstream IT. More often than not, there is a lack of mutual understanding between GIS and IT people. I figure this might have something to do with the fact that GIS emerged from the domain of environmental land use management, rather than from the IT domain.

Because of this, mainstream IT concepts, methodologies and technologies tend to be underused in the GIS domain. Which is a pity, because these can provide so much more value to spatial data. On the other hand, in mainstream data solutions the spatial potential of data is underused as well. The main reason for me to start this blog was to try and connect spatial data to BI concepts and methodologies.

I have always felt, and still feel, very strongly that GIS should reconnect with its IT origin. We can start by stop isolating datasets that we call 'geodata' and acknowledging that spatial data is part of something bigger. Our spatial data must be integrated in our mainstream IT solutions, applications and databases. After all, spatial data is data too. Who's with me?

vrijdag 17 juni 2016

Creating a spatial link

In my previous post I described the way spatial data is handled in Data Vault modeling using a modeling tool that is based on a standard 3NF-ruleset. This ruleset evaluates the data source tables to determine whether a source table has to be modeled as a hub or as a link. The tables that contain primary-foreign key constraints are modeled as links because they represent a relationship, and the tables that only contain a primary key are modeled as hubs.

Inevitably, the resulting raw data vault contains only a set of hubs and satellites. After all, our geospatial data source model doesn't contain any PK-FK constraints. All geospatial relationships in the database are implicit, remember? They can be constructed on the fly. So this means that our source model dóes contain relationship data between the entities. However, these relationships can't be automatically generated. There are several different types of spatial relations, categorized in proximity, adjacency and overlap relations. There's no way to automatically tell which spatial relationship type is relevant for any combination of two (or more) entities. This leads to the following principle:

"The business context of spatial relationships defines the nature of spatial data vault links."

The implication of this principle is that data vault links based on spatial relationship must be part of a business data vault layer in your data warehouse architecture. And they must be manually constructed. This is not necessarily a bad thing. It will force you to analyze your business processes to determine which spatial relationships are worth capturing into your data warehouse.

It might be possible to construct some form of DV-ruleset to automatically generate all spatial links between source tables, similar to the way tools like Quipu use the relational source model to generate a raw data vault. This would be a very bad idea. Why? Because, in a spatial source system, each table containing geography is technically linked to each other table by their spatial relation. Generating a raw spatial data vault would result in numerous links that are probably mostly irrelevant. So stick with the business data vault approach as far as spatial links are concerned.

So why is it worth the trouble to model spatial relationships as links in your data vault? You might argue that it is possible to model the spatial relationship downstream, in your data marts, or maybe even real-time in the end user interface. After all, the spatial information for each object is nicely retained in the raw data vault. The answer is simple. Spatial analysis requires a fair amount of processing time. It can put a real strain on the application if done in real-time. Having the spatial relationship already available in a link can save a lot of time on the user side. And it can enable you to use the relationship if the chosen end user interface doesn't support spatial capabilities. It really comes down to a design choice.

Example: linking incidents to maintenance areas

Let's take another look at the spatial raw data vault that I modeled in my last post. It consisted of a collection of 4 hubs and 4 corresponding satellites. There were no links in the data model. However, all 4 hub satellites contain a geographical attribute. It makes sense to define some form of relationship between the entities in the raw data vault based on these attributes. In this example, we decide to only model the relationship between incidents and maintenance areas.

Part of the raw data vault model with entities visualized on a map

As we can see above, each incident will fall inside one (and only one) maintenance area at a specific time. Over time, maintenance areas will change and new incidents will be reported. The location of incidents are fixed once assigned, but incidents can be deleted from the source. The relation between incidents and maintenance areas gives us useful information about the supervisors' responsibilities towards incident management. The relationship link and satellite are added to the data vault model created in the previous post. It links the incidents hub and maintenance area hub together.

The link and satellite are constructed with the use of DDL scripting. Both link and satellite get an index on the primary key.

CREATE TABLE edw.bdv.incident_in_area_l
( incident_in_area_lid INT NOT NULL IDENTITY(1,1)
, incidents_hid INT NOT NULL
, maintenance_areas_hid INT NOT NULL
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk__incident_in_area_l
PRIMARY KEY( incident_in_area_lid
        )
);

CREATE UNIQUE INDEX ux__incident_in_area_l ON edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
);

CREATE TABLE edw.bdv.incident_in_area_l_s
( incident_in_area_lid INT NOT NULL
, load_date DATETIME NOT NULL
, load_date_end DATETIME DEFAULT '9999-12-31'
, voided BIT
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, audit_spatial_relation VARCHAR(80)
, CONSTRAINT pk__incident_in_area_l_s
PRIMARY KEY( incident_in_area_lid
        , load_date
        )
);

CREATE INDEX pd__incident_in_area_l_s ON edw.bdv.incident_in_area_l_s
( load_date
, load_date_end
);

The link model is fairly straightforward and complies with standard data vault modeling rules. You might notice that the satellite contains very little context information. The link is based solely on the spatial relation, and doesn't have any other context in the source model. One crucial addition to the data vault model is the extra field called AUDIT_SPATIAL_RELATION. In this field the specific type of spatial relationship that is the basis for the link is stored. This is necessary because geographical objects can have many different types of spatial relationships; objects can touch, partially or completely overlap, contain or be near eachother. The nature of the spatial relationship therefore is business context, and has to be stored as such, in the link satellite.

Next, we need to create the ETL necessary to populate the link with the appropriate data. This is done in a two-step approach:

Join the incidents and maintenance tables in the staging area with a spatial WHERE-clause (highlighted in green below);
Join the incidents and maintenance hubs from the raw data vault based on the business key.

This gives us the following query.

------ insert new business keys data into 'incident_in_area_l'
INSERT INTO edw.bdv.incident_in_area_l
( incidents_hid
, maintenance_areas_hid
, audit_timestamp
, audit_recordsource
)
SELECT DISTINCT h0.incidents_hid
              , h1.maintenance_areas_hid
              , CURRENT_TIMESTAMP as audit_timestamp
              , 'edw.sta_spatial.incidents' as audit_recordsource
    FROM sta_spatial.incidents stg0
    JOIN edw.sta_spatial.maintenance_areas stg1 ON (stg0.ligging.STWithin(stg1.ligging) = 1)
    LEFT JOIN sdv_spatial.incidents_h h0 ON stg0.MELDINGID = h0.MELDINGID
    LEFT JOIN sdv_spatial.maintenance_areas_h h1 ON stg1.GAG_ID = h1.GAG_ID
WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l l
                    WHERE (h0.incidents_hid = l.incidents_hid OR (h0.incidents_hid IS NULL AND l.incidents_hid IS NULL))
                      AND (h1.maintenance_areas_hid = l.maintenance_areas_hid OR (h1.maintenance_areas_hid IS NULL AND l.maintenance_areas_hid IS NULL))
                  )
;

The spatial WHERE-clause in this example uses the MS SQL Server specific STWithin command, which identifies all features from one table that lay within the features from a second table. The result is a cartesian join. Because maintenance areas don't overlap, each incident point will be located in only one area. Hence each incident is joined to a specific maintenance area, resulting in the desired link.

Finally, we need to create the proper ETL to fill and maintain the link satellite. It consists of three queries, to insert context for new links, insert context for deleted links and end-date the deleted links. These queries are pretty straightforward, besides the fact that the link isn't based on one source table but rather on the implicit relationship. Noticeable in the first query is the insertion of the value 'inside' for the field AUDIT_SPATIAL_RELATION, as explained above and shown in green.

-- insert new business keys data from 'incident_in_area_l' into 'incident_in_area_l_s'
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP -- there is no staging for this link, so load date equals audit timestamp
     , 0
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
    , 'inside' as audit_spatial_relation
FROM edw.bdv.incident_in_area_l l
WHERE NOT EXISTS (SELECT 1
                     FROM edw.bdv.incident_in_area_l_s dv
                    WHERE (l.incident_in_area_lid = incident_in_area_lid OR (l.incident_in_area_lid IS NULL AND incident_in_area_lid IS NULL))
                      AND (dv.voided = 0 OR dv.voided IS NULL)
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.bdv.incident_in_area_l_s
                                        WHERE incident_in_area_lid = dv.incident_in_area_lid
                                        and voided = 0
                                      )
                  )
;

-- insert deleted business keys data into 'incident_in_area_l_s
INSERT INTO edw.bdv.incident_in_area_l_s
( incident_in_area_lid
, load_date
, voided
, audit_timestamp
, audit_recordsource
, audit_spatial_relation
)
SELECT incident_in_area_lid
     , CURRENT_TIMESTAMP
     , 1
     , CURRENT_TIMESTAMP
     , 'edw.bdv.incident_in_area'
    , 'inside' as audit_spatial_relation
FROM edw.bdv.incident_in_area_l_s dv
WHERE load_date = (SELECT MAX(load_date)
                      FROM edw.bdv.incident_in_area_l_s
                     WHERE incident_in_area_lid = dv.incident_in_area_lid
                   )
   AND dv.voided = 0
   AND NOT EXISTS (SELECT 1
                   FROM edw.bdv.incident_in_area_l l
                   LEFT JOIN sdv_spatial.incidents_h_s t0s
                   ON ((t0s.incidents_hid = l.incidents_hid) and (t0s.voided = 0) and (t0s.load_date_end = '9999-12-31 00:00:00.000'))
                   LEFT JOIN edw.sdv_spatial.maintenance_areas_h_s t1s
                   ON ((t1s.maintenance_areas_hid = l.maintenance_areas_hid) and (t1s.voided = 0) and (t1s.load_date_end = '9999-12-31 00:00:00.000'))
                   WHERE (t0s.ligging.STWithin(t1s.ligging) = 1)
                   AND l.incident_in_area_lid = dv.incident_in_area_lid
                  )
;

-- Satellite end dating in 'incident_in_area_l_s'
UPDATE edw.bdv.incident_in_area_l_s
   SET load_date_end = (SELECT MIN(s.load_date)
                          FROM edw.bdv.incident_in_area_l_s s
                         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
                           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
                       )
WHERE load_date_end = '9999-12-31' -- in ISO 8601 format
   AND (SELECT MIN(s.load_date_end)
          FROM edw.bdv.incident_in_area_l_s s
         WHERE s.incident_in_area_lid = edw.bdv.incident_in_area_l_s.incident_in_area_lid
           AND s.load_date > edw.bdv.incident_in_area_l_s.load_date
       ) IS NOT NULL
;

All ETL shown above is built by copying a typical generated link-satellite queryset from the raw data vault ETL and manually altering it to accomodate for the spatial functionality that is needed to construct the spatial link. When we run the queries and check out the result, we get the following.

Selecting the link and satellite

Subsequently, if an incident has been deleted in the source the change in the link relationship is detected by the ETL. The link is properly end-dated in the satellite and the newly inserted satellite record is set to voided, so it is clear the link is no longer valid.

Link between incident and maintenance area; the incident has been deleted in the source.

Conclusion

It is possible to model a data vault link based on a spatial relationship in the source. With some minor adjustments, standard ETL can be modified to create the SQL code needed to populate the link and satellite. All spatial links must be based on business context to avoid generation of useless links.

dinsdag 17 mei 2016

Loading spatial data in a raw data vault

In this post I'm showing an example of spatial data in combination with the Data Vault methodology based on the philosophy implemented at our company. There are several philosophies how to model your Enterprise Data Warehouse using Data Vault (DV), and each philosophy has its merits and its drawbacks. This blog isn't meant to discuss these philosophies: there are far more suitable blogs for that. Instead, we're going to focus on one specific item regarding Data Vault, namely: how to incorporate spatial data in the Data Vault method. Data Vault consists of three main components, being hubs, links and satellites. Hubs contain the business entities, links contain the relationships between business entities and the satellites contain the context of hubs and links.

So, how does Data Vault handle spatial data? The geometry attributes of an object provide context about the location of the object in the real world. Most RDBMS support spatial information as a specific object type. So at this point it is safe to assume that geometry will find its place in the data vault. But, as we've seen in a previous post, geometry also provides spatial reference to other objects. In other words, it forms relationships between objects. Will this affect the way DV handles spatial data

The example

In our architecture we are using a raw data vault that serves as a technical copy of the source systems. Prior to loading the raw data vault, source systems are copied to a staging area. After loading, data is processed into several data marts that are accessed by end users through reporting, analysis (OLAP) and, in the near future, dashboard solutions.

Our Enterprise BI architecture

As I said, it is one of several philosophies. This is a fairly technical approach, with the benefit that a lot of work building the EDW can be automated. The routines needed for loading the staging areas and the raw data vault are generated with the use of the DV modeling tool Quipu (which was open source at the time we started building our EDW). The manual labor required to transform the data for end user purposes (the big T) is offloaded to the data mart phase.

#1: Import the source system

For this demo, I've created a new example source system with 4 spatial tables selected from our company geodatabase, containing incidents, maintenance areas, and the border of our service area both as line an polygon (which are in two separate tables). All 4 tables contain a GEOMETRY type column, named either LIGGING or GEOM.

Spatial source system model example

The source system model is imported into Quipu. The first thing we notice is that Quipu doesn't recognize geometry type columns. We need edit the column type and native type manually.

Source model in Quipu, with geometry type manually altered

#2: Loading the staging area

The next step, we're creating the staging area which is a copy of the source with some added metadata fields. The appropriate staging area DDL and ETL is generated by Quipu. Let's take a look at the DDL-code for the maintenance areas table below.

CREATE TABLE edw.sta_spatial.maintenance_areas
( GAG_ID bigint
, LIGGING_WKT varchar(MAX)
, AREA_NAME varchar(50)
, SUPERVISOR_NAME varchar(50)
, SRID int
, ligging geometry
, stg_timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
, stg_sequence INT DEFAULT 1
, voided BIT DEFAULT 0
, audit_id INT
);

As we notice,the DDL is generated using the correct geometry type field after we manually altered it in the previous step. With the DDL, we can create the proper staging area tables without any further edits.

In this process, Quipu also generated the ETL needed to fill the staging area. In our case, this ETL consists of Pentaho Data Integration (PDI) code. As explained in an earlier post, PDI can support geometry type fields. Unfortunately, PDI only offers support for geometry type in combination with Oracle and PostgreSQL. Since our EDW is built on the MS SQL Server platform, we needed to create a workaround for this. We solved this by inserting the geometry into the staging area in well-known text format (stored in the field LIGGING_WKT) and converting it to spatial with an 'INSTEAD OF'-trigger we are able to fill the staging properly.

CREATE TRIGGER [dbo].[maintenance_areas_trg] ON [dbo].[maintenance_areas]
INSTEAD OF INSERT
AS
BEGIN
    INSERT INTO maintenance_areas (
       [GAG_ID]
    ,[LIGGING_WKT]
      ,[AREA_NAME]
      ,[SUPERVISOR_NAME]
      ,[SRID]
      ,[ligging]
    ) SELECT
       [GAG_ID]
    ,[LIGGING_WKT]
      ,[AREA_NAME]
      ,[SUPERVISOR_NAME]
      ,[SRID]
      ,geometry::STGeomFromText ( LIGGING_WKT , 28992 )
     FROM
        INSERTED
END

The insertion of the field 'LIGGING' is bypassed by the trigger. In the code above,we can see the field is filled by converting text to actual geometry with use of the MS SQL command STGeomFromText.

This is not the only possible solution. There are several other solutions for this problem, ranging from using another ETL-tool to fill the staging area to changing EDW database platform. For now, we can suffice with the notion that filling the staging area with spatial data is possible.

#3: Designing the raw data vault

The next step in this example consists of modeling the raw data vault. Our modeling software Quipu will automatically generate a data vault style model based on the technical properties of the staging area. It also allows us to manually identify the business key in the staging tables that lack a primary key in the source model.

Generating the spatial raw data vault example

After analyzing our source model (shown above in paragraph #1), Quipu generates a data vault model consisting of 4 hubs and corresponding satellites. The geometry type fields (highlighted in green) have become part of the satellites, which is logical as they are identified as context properties.

Spatial raw data vault model based on example source system

Notice the absence of any links in this model? As I explained in previous posts (here and here), geographical databases rely heavily on the implicit spatial relationships to tie data together. However, when generating a raw data vault using a logical rule engine, it will ignore these spatial relationships. As far as a raw data vault is concerned, this is correct behavior. As a consequence, geographical data sources with 100+ tables will inevitably be generated into raw data vaults with 100+ hubs and satellites and virtually no links.

After generating the model, we can continue with generating the DDL and ETL necessary to create the model in our EDW environment. The DDL below shows the generation of the maintenance areas hub and satellite, including the geometry field in the satellite. The generated DDL is flawless because we altered the type and native type column for the geometry field named 'LIGGING' manually in step #1.

CREATE TABLE edw.sdv_spatial.maintenance_areas_h
( maintenance_areas_hid INT NOT NULL IDENTITY(1,1)
, GAG_ID bigint
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk_maintenance_areas_h
PRIMARY KEY( maintenance_areas_hid
        )
);

CREATE UNIQUE INDEX bk_maintenance_areas_h ON edw.sdv_spatial.maintenance_areas_h
( GAG_ID
);

CREATE TABLE edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid INT NOT NULL
, load_date DATETIME NOT NULL
, load_date_end DATETIME DEFAULT '9999-12-31'
, LIGGING_WKT varchar(MAX)
, AREA_NAME varchar(50)
, SUPERVISOR_NAME varchar(50)
, SRID int
, ligging geometry
, voided BIT
, audit_timestamp DATETIME
, audit_recordsource VARCHAR(128)
, CONSTRAINT pk_maintenance_areas_h_s
PRIMARY KEY( maintenance_areas_hid
        , load_date
        )
);

CREATE INDEX pd_maintenance_areas_h_s ON edw.sdv_spatial.maintenance_areas_h_s
( load_date
, load_date_end
);

We also generate the ETL steps that are needed to make the data vault run. These steps compare new data loaded into the staging area with the data already present in the raw data vault, identifying alterations and storing them into the data vault.

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h'
INSERT INTO edw.sdv_spatial.maintenance_areas_h
( GAG_ID
, audit_timestamp
, audit_recordsource
)
SELECT DISTINCT GAG_ID
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
FROM edw.sta_spatial.maintenance_areas stg
WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h
                    WHERE (stg.GAG_ID = GAG_ID OR (stg.GAG_ID IS NULL AND GAG_ID IS NULL))
                  )
;

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h_s'
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , stg_timestamp
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , voided
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
FROM ( SELECT (SELECT MAX(maintenance_areas_hid)
                   FROM edw.sdv_spatial.maintenance_areas_h
                  WHERE (src.GAG_ID = GAG_ID OR (src.GAG_ID IS NULL AND GAG_ID IS NULL))
                ) as maintenance_areas_hid
              , stg_timestamp
              , LIGGING_WKT
              , AREA_NAME
              , SUPERVISOR_NAME
              , SRID
              , ligging
              , voided
           FROM edw.sta_spatial.maintenance_areas src
       ) stg
WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h_s dv
                    WHERE (stg.maintenance_areas_hid = maintenance_areas_hid OR (stg.maintenance_areas_hid IS NULL AND maintenance_areas_hid IS NULL))
                      AND (stg.LIGGING_WKT = LIGGING_WKT OR (stg.LIGGING_WKT IS NULL AND LIGGING_WKT IS NULL))
                      AND (stg.AREA_NAME = AREA_NAME OR (stg.AREA_NAME IS NULL AND AREA_NAME IS NULL))
                      AND (stg.SUPERVISOR_NAME = SUPERVISOR_NAME OR (stg.SUPERVISOR_NAME IS NULL AND SUPERVISOR_NAME IS NULL))
                      AND (stg.SRID = SRID OR (stg.SRID IS NULL AND SRID IS NULL))
                      AND ((stg.ligging.STEquals(ligging) = 1) OR (stg.ligging IS NULL AND ligging IS NULL))
                      AND (stg.voided = voided OR (stg.voided IS NULL AND voided IS NULL))
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.sdv_spatial.maintenance_areas_h_s
                                        WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                                      )
                  )
;

-- insert deleted business keys data from 'maintenance_areas' into 'maintenance_areas_h_s
-- entries in the edw.dbo.edw_load with is_full_load = 1 are treated as full loads!
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , COALESCE(ld.load_timestamp,CURRENT_TIMESTAMP)
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , 1
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
FROM edw.sdv_spatial.maintenance_areas_h_s dv
INNER
JOIN edw.dbo.edw_load ld
    ON ld.table_name = 'edw.sta_spatial.maintenance_areas'
   AND ld.is_full_load = 1
WHERE load_date = (SELECT MAX(load_date)
                      FROM edw.sdv_spatial.maintenance_areas_h_s
                     WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                   )
   --AND dv.voided = 0
   AND dv.voided = 0
   AND NOT EXISTS (SELECT 1
                     FROM edw.sta_spatial.maintenance_areas src
                     LEFT OUTER HASH
                     JOIN edw.sdv_spatial.maintenance_areas_h t1821432
                       ON (t1821432.GAG_ID = src.GAG_ID OR (t1821432.GAG_ID IS NULL AND src.GAG_ID IS NULL))
                    WHERE t1821432.maintenance_areas_hid = dv.maintenance_areas_hid
                  )
;

-- Satellite end dating in 'maintenance_areas_h_s'
UPDATE edw.sdv_spatial.maintenance_areas_h_s
   SET load_date_end = (SELECT MIN(s.load_date)
                          FROM edw.sdv_spatial.maintenance_areas_h_s s
                         WHERE s.maintenance_areas_hid = edw.sdv_spatial.maintenance_areas_h_s.maintenance_areas_hid
                           AND s.load_date > edw.sdv_spatial.maintenance_areas_h_s.load_date
                       )
WHERE load_date_end = '9999-12-31' -- in ISO 8601 format
   AND (SELECT MIN(s.load_date_end)
          FROM edw.sdv_spatial.maintenance_areas_h_s s
         WHERE s.maintenance_areas_hid = edw.sdv_spatial.maintenance_areas_h_s.maintenance_areas_hid
           AND s.load_date > edw.sdv_spatial.maintenance_areas_h_s.load_date
       ) IS NOT NULL
;

This code needs one small but crucial alteration for each satellite that is generated. The code that updates the new business keys data into the satellite compares the existing data with the new data that is loaded into the staging area, based on an '=' comparison. This operator won't work for geometry type fields. Instead, the database platform offers specific operators for handling spatial data. Let's take a closer look at the satellite insert statement.

-- insert new business keys data from 'maintenance_areas' into 'maintenance_areas_h_s'
INSERT INTO edw.sdv_spatial.maintenance_areas_h_s
( maintenance_areas_hid
, load_date
, LIGGING_WKT
, AREA_NAME
, SUPERVISOR_NAME
, SRID
, ligging
, voided
, audit_timestamp
, audit_recordsource
)
SELECT maintenance_areas_hid
     , stg_timestamp
     , LIGGING_WKT
     , AREA_NAME
     , SUPERVISOR_NAME
     , SRID
     , ligging
     , voided
     , CURRENT_TIMESTAMP
     , 'edw.sta_spatial.maintenance_areas'
FROM ( SELECT (SELECT MAX(maintenance_areas_hid)
                   FROM edw.sdv_spatial.maintenance_areas_h
                  WHERE (src.GAG_ID = GAG_ID OR (src.GAG_ID IS NULL AND GAG_ID IS NULL))
                ) as maintenance_areas_hid
              , stg_timestamp
              , LIGGING_WKT
              , AREA_NAME
              , SUPERVISOR_NAME
              , SRID
              , ligging
              , voided
           FROM edw.sta_spatial.maintenance_areas src
       ) stg
WHERE NOT EXISTS (SELECT 1
                     FROM edw.sdv_spatial.maintenance_areas_h_s dv
                    WHERE (stg.maintenance_areas_hid = maintenance_areas_hid OR (stg.maintenance_areas_hid IS NULL AND maintenance_areas_hid IS NULL))
                      AND (stg.LIGGING_WKT = LIGGING_WKT OR (stg.LIGGING_WKT IS NULL AND LIGGING_WKT IS NULL))
                      AND (stg.AREA_NAME = AREA_NAME OR (stg.AREA_NAME IS NULL AND AREA_NAME IS NULL))
                      AND (stg.SUPERVISOR_NAME = SUPERVISOR_NAME OR (stg.SUPERVISOR_NAME IS NULL AND SUPERVISOR_NAME IS NULL))
                      AND (stg.SRID = SRID OR (stg.SRID IS NULL AND SRID IS NULL))
                      AND ((stg.ligging.STEquals(ligging) = 1) OR (stg.ligging IS NULL AND ligging IS NULL))
                      AND (stg.voided = voided OR (stg.voided IS NULL AND voided IS NULL))
                      AND load_date = (SELECT MAX(load_date)
                                         FROM edw.sdv_spatial.maintenance_areas_h_s
                                        WHERE maintenance_areas_hid = dv.maintenance_areas_hid
                                      )
                  )
;

Highlighted in green we see the part where the newly inserted geometries are being compared to the geometry data already present in our data vault. In the Quipu generated statement, they were simply compared using the '=' operator, which won't work. So we alter the statement slightly using the MS SQL specific statement STEquals (other RDBMS support the same statement, but with different syntax). This statement compares the actual spatial data stored in the two geometry fields. This small edit needs to be done for each satellite ETL that is generated. In our example this means altering four SQL statements. After this edit, our generated ETL will work as expected, identifying all altered objects in our staging area.

Conclusion

When integrating spatial data in a raw data vault style EDW, there are two important questions to consider:

Does my ETL tooling support spatial object types?
What alterations do I have to make to my code?

This example shows us that, with some minor alterations to our code, we can easily adapt a raw data vault to support storing and tracking spatial attribute information. In the next post, we will take a look at spatial in the business data vault.

maandag 18 april 2016

Using spatial relationships in ETL

The case

In this post we're going to take a look at spatial ETL. A lot of BI work traditionally exists of getting our data from A to B and transforming it in the process. So what if our source data contains spatial information?

For example, we have two tables in our source database. One is containing 9 maintenance areas that are assigned to a particular supervisor. The other is containing over 3200 incidents that have been reported in the area. Both tables are stored in an Oracle 11g database, contain a spatial column and can be visualised with GIS software to show where they're situated.

All incidents

9 maintenance areas, colored by unique area

Now imagine we want to determine the designated supervisor for each incident to report progress of incident management on a dashboard. No primary-foreign key construction exists between the two tables. There is no other way to link these two objects together than by use of the spatial relationship.

Common ETL tools usually don't know how to treat spatial data columns. But with the right combination of functionality, you can certainly do spatial ETL. In this example, I will be using Pentaho Data Integration 6.0 (PDI) and the additional AtolCD Pentaho GIS plugin. The plugin adds two specific improvements to regular Pentaho:

The ability to treat spatial columns as column type 'geometry' within the ETL.
A set of transformation steps that enhance Pentaho with geospatial operations.

The solution path

I've created a simple Pentaho-transformation to combine the incidents and maintenance area. As you will notice the transformation looks like any other, with the exception of a couple of geospatial steps that will be explained further on. All steps in the transformation shown below are referred to in the text by use of double quotes and starting with a capital (e.g. "Table incidents").

PDI transformation with added geospatial steps.

We see that both tables are loaded with two standard table input steps, "Table incidents" and "Table maintenance areas", resulting in two data streams. When we examine the output of these steps, we see the following data.

Incidents table preview

The incident table contains about 3200 incidents that each have an unique ID and an incident location. The maintenance area table contains 9 maintenance areas, that all have a name, a supervisor name and an area location.

Maintenance areas table preview

Notice how the columns INCIDENT_LOCATION and MAINTENANCE_AREA_LOCATION contain some kind of coordinate definition? We can clearly see one contains points and the other contains polygons. The added geospatial functionality enables PDI to show the contents of the geospatial columns in readable text. We can see the field types when examining the output fields for the two table input steps; the location fields are marked as type 'Geometry'.

Output fields for the incidents stream

Output fields for the maintenance area stream

The text "SRID=28992" tells us that the spatial data is in a Dutch coordinate system. It is important that the data from both geometry columns is in the same coordinate system. Otherwise it isn't possible to relate the two columns. We verify the coordinate system with a "Geometry information" step for each datastream. This step is part of the additional steps added by the GIS plugin.

The next step is to join both data streams based on a "Cartesian join". This step doesn't yet recognize the spatial relationship. It will simply join each record from one stream with each record from the other. For this example, this means that because we started with 3000 incidents and 9 maintenance area, the resulting data stream will contain about 27000 records.


Cartesian join result

In the join result we can clearly see that incident ID no 1012 is joined to all 9 area names. But an incident can't be in all 9 maintenance areas at the same time. We wanted to identify the maintenance area for each incident, remember?

As a result, the datastream now contains two geometry columns. We need to filter the datastream based on the spatial relationship. For this, we use the step called "Determine spatial relationship", which is also part of the GIS plugin.

Define the spatial relationship between incidents and maintenance area.

In this step we can define the spatial relationship between the two geometry columns. The step supports a lot of different spatial relationships that can be described to relate two objects by use of location. The one we're looking for is simply called 'Within', as we want to identify each incident within each maintenance area. The spatial analysis settings shown above can be summarized as a select statement that selects the records where 'INCIDENT_LOCATION is within MAINTENANCE_AREA_LOCATION' evaluates to TRUE. When this step is run, only the records where this statement is true are kept in the resulting datastream.

Incidents within their maintenance area.

As the image above shows, the resulting datastream contains one record per incident ID with the corresponding maintenance area information. However, the datastream still contains two geometry columns. This is no problem for the database table we're about to create, but most common GIS software only supports one geometry type per data layer. So we make a selection of the output fields with the "Select values" step, assuring that the resulting table only contains the incidents point geometry attribute, combined with ID, area name and supervisor name.

Select the necessary fields

After this, the final step is to write the resulting data to a new database table with the "Table output" step. (Another approach could be to add the maintenance area and supervisor name to the existing incidents table; in that scenario we might consider using an "Insert/update" step.) Of course, the receiving database must support geometry type database fields.

Output the table to the database

Now, when we visualize the resulting dataset with GIS software, we can see that each incident is nicely matched with its corresponding maintenance area by color.

Incidents colored by maintenance area

Conclusion

Spatial relationships allow us to define relationships between objects that have no other relation to each other than the locational one. Extending standard ETL tools with geospatial capabilities is a very cool way to make use of the specific advantage of geospatial data.

Granted, there are some caveats in the use of spatial properties in ETL tooling.

The AtolCD Pentaho GIS plugin used for PDI supports Oracle Spatial and PostgreSQL/PostGIS geometry (and also MySQL, but this I haven't tested), but lacks support for MS SQL Server geometry.
Geospatial data can get quite large in size. Substantially larger datasets like cadastral parcel information (400.000+ records) that are joined by a cartesian join will result in billions of records in the datastream. This might hamper performance but I haven't tested yet.
I have yet to try this in other ETL tools like SSIS or Talend.

Still, I'm confident these issues will eventually get resolved.

dinsdag 5 april 2016

Geography in 3NF normalized source systems

We're going to look for geography in source systems in this post. When designing a BI solution, most data will be stored in one or more source systems that follow a 3NF relational model stored in a database. It is fairly uncommon for such source systems to have true geographical information in geography type database fields, but that doesn't mean these sources don't contain geographical information.

Let's take for example the Sakila database model for MySQL (the original data model by Matt Casters, Jos van Dongen & Roland Bouman can be found here). This very well normalized sample database shows us three entities that are likely to have a geographical component. The addresses, that are represented with point geometries, and the city and country that are polygon geometries with their respective city limits and country borders. Address, city and country geometries are easily obtained through a geocoding service (see my previous post about geometry types).

Geometry attributes added to address, city and country. Geographical relationships in red.

The relations between address, city and country in this data model are stored with primary-foreign key relationships. However, adding geometry to these entities allows us to perform additional analysis on the data model. In the Sakila data model we can also see that three entities, namely store, customer and staff, are related to the address entity. By adding geometry to the address we can analyse which store the customers live closest to, and compare this to the store in which they rent the most movies at.

Geographical relation between store and customer, based on address geography.

We might learn that some customers rather go to a store further away from their home and try to explain this with other data like sales data. Additionaly, plotting these customer addresses and their relation to the stores frequently visited on a map can give us a clear indication about 'blind spots' in our store coverage.

So at this point we really only have to decide how to incorporate the geographical information in our design. If we're building a datawarehouse solution, we might consider storing the geographical attribute of the addresses, cities and countries in our datawarehouse on a regular basis.

In reality, source systems are hardly ever properly normalized like the example above. Very often we find address information is scattered over several tables in the data model. For instance, the customer, store and staff tables might all contain it's own address attributes. For geographical analysis, this doesn't make much difference. We can still perform analysis like described above.

The geographical address entity is very easy to spot in source systems, however there are more entities that can identified as having a geographical attribute. Not so much in financial systems, but latent geography can be found in source systems at organisations that have their operations taking place in the public space.

The 'geodatabase'

Organisations that handle a lot of geographical data often have a separate GIS department. This GIS department will most likely have their information stored in a dedicated geographically enabled database they call the 'geodatabase'. But this geodatabase has little of the characteristics you can expect of a BI database. GIS systems usually don't require any 3NF datamodel to be used in a database environment. Instead, GIS systems are equipped with functionality to add and remove tables with geodata to any database that supports geography.

Let's take a look at the image below. It shows the datamodel of a particular schema from our geodatabase. Notice the lack of any relationships between the tables.

One of the schemas in our 'geodatawarehouse'. No key-based relationships between tables.

In traditional 3NF systems you will find relationships between tables based on keys (PK/FK, many-to-many etc). In a geodatabase, every table is related through the spatial reference of the geographical attribute, as explained in this previous post. In the above datamodel, each table has an attribute named GEOMETRY (with type OTHER because SQL PowerArchitect doesn't support geometry types) which relates it to every other table in the database. Seen from one of the tables, these relationships would look something like this:

Table MINSET_STUW is related to all other tables (ánd selfreferencing) through the geometry attribute.

However, if we would draw relationships for all tables it might look something like this:

All tables related to all other tables

You can imagine that in a geodatabase with between 200 and 1000 separate tables this would generate an unreadable datamodel.

In essence, any geodatabase is mainly a storage platform for every table related to geography. From a BI standpoint it should be treated as a source system, albeit a strange one. As I mentioned earlier, a very specific property of the geodatabase is that GIS folk tend to add and remove tables to/from the geodatabase at any given time. Because no table is explicitly related to anything, this poses no problems from a GIS standpoint. But it will certainly hurt your BI deliverables if your carefully crafted report, that is depending on geographical relations, suddenly isn't working anymore because a table has been removed from the source system.

The hybrid database

So far we've discussed source systems that are either purely administrative or purely geographical in nature. There is a third flavour source systems that are made of a normalized 3NF data model ánd contain geographical attributes in the database tables. These hybrid databases are extremely rare. If you encounter one you can safely use it for your BI solution as it will offer you the best of both worlds:

a tightly managed data model
integrated geographical attributes

Feel free to discuss, and if you have any questions please let me know.

dinsdag 5 januari 2016

The spatial reference in BI

In the previous post I wrote about the fact geographical information can be of great business value. Currently, in Business Intelligence solutions geography is mostly used for visualisation purposes. In Business Intelligence, we're always looking for the relationships in data. With structured datasets based on RDBMS sources, the relationships between objects are managed through primary and foreign keys. In unstructured data, as Bill Inmon (http://www.b-eye-network.com/view/14120) explained, you have to depend on taxonomies.

But there's another option that might be applied to both structured and unstructured data. Geographical information contains a unique property making it possible to relate even the most seemingly unrelated datasets: the spatial one. Every object with a spatial location can be related to every other object that has one. These relationships can be queried with common SQL (though the approach is dependent on the database platform), calculated with spatially enabled ETL tools or just with use of a nice JavaScript or Python-library.

Let's have a look at some examples of spatial relationships.

Example 1: Distance to specific location Imagine you have two data source tables in your database: a table with home addresses and a table with pubs. At first glance, just looking at the tables doesn't tell you much. But with the use of their respective locations you can determine the proximity to each pub for each home address with a simple SQL statement. A spatial relationship like this is called a proximity relationship.

The map shows 3 pubs as green diamonds. Addresses are symbolized as blue dots.

Below is an example for PostgreSQL.

SELECT a.address
,      dist.name as pubname
,      dist.distance
FROM adressen a
LEFT JOIN (SELECT a.address, p.name, ST_Distance(p.geom, a.geom) as distance 
                FROM adressen a
                LEFT JOIN pubs p ON ST_DWithin(a.geom, p.geom, 10000)
                ) as dist
                on (a.address = dist.address)
WHERE a.address = dist.address 
AND dist.name = (SELECT p.name
               from pubs p
                order by p.geom <-> a.geom LIMIT 1)
order by a.address;

The query is designed to find the nearest pub for each address within 10km of the address. With the use of the subquery in the LEFT JOIN we determine the distance to each pub for each individual address, basically a cartesian join between addresses and pubs on the locational relationship. The subquery in the WHERE clause is optional to determine the closest pub for each address, again based on the spatial relationship.

The resulting table would be as followed. Clearly, you can see the distance from each address to the nearest pub. In Oracle, this query would be somewhat simpler, but both queries would yield the same result.

Example 2: Area of influence
In rural areas the government keeps record of all nuisance caused by farms. Around each farm, the stench contour defining the area where air quality is affected is registered as a polygon in a table. Another table containing house addresses can be queried to see which addresses are within the contour.

The farm is shown in yellow, the area of influence is shown with the red contour. The adresses are the house-shaped blue-ish polygons.

Also here are sample queries for Oracle and PostgreSQL.

SELECT a.street
,      a.number
,      a.city
FROM address a
,    farm_contour f
WHERE SDO_RELATE(a.geometry, f.geometry,'mask=inside') = 'TRUE';  
SELECT a.street
,      a.number
,      a.city
FROM address a
,    farm_contour f
WHERE ST_Within(a.geometry, f.geometry) = 'TRUE';

These queries already look somewhat more similar. Again, both queries will return the same results: a list of all addresses within a stench contour. Addresses that are outside all contours are excluded by the 'TRUE' condition in the spatial where clause. This second kind of spatial relationship is called an overlap. Both full and partial overlaps can be determined depending on the spatial operator used.

Example 3: neighbors along ownership
My organization has to deal with a lot of land owners that own parcels that are neighboring our property. Very often, we need to figure out which parcels directly touch parcels owned by a specific owner, in this case our organization

Left the parcel owned, right the parcels that directly touch the left parcel.

SELECT k.parcel_code
,      k.owner
,      o.parcel_code
FROM parcels k   --
,    parcels o
WHERE SDO_RELATE(k.geometry, o.geometry,'mask=touch') = 'TRUE'
AND o.owner = '<our organization>';

SELECT k.parcel_code
,      k.owner
,      o.parcel_code
FROM parcels k   --
,    parcels o
WHERE ST_Touches(k.geometry, o.geometry) = 'TRUE'
AND o.owner = '<our organization>';

We notice the queries only select the parcels that share a common boundary. This is a third kind of spatial relationship called adjacency.

Implicit relationship
The cool thing about spatial relationships is that they are implicit. Each table can be maintained on it's own without the need to maintain a relational table or primary-foreign key constraint between tables. In the above samples, the addresses table and pubs table can be stored and edited independent of each other. New addresses can be added to the address table without any specific requirement regarding the pubs locations. The same goes for the addition of new houses or parcels. Once two tables are queried, the spatial relationship can be determined on-the-fly. This is possible because all spatial data shares a common attribute: they are all positioned somewhere on (or below) the earth surface.

Self-intersecting
In the last query set we can notice the parcels table is referenced twice in the FROM-part of our query. In example 3 we see that both the central parcel and the adjacent parcels are stored in the same parcels table. The geometry relation in the WHERE clause of the query is referencing the same column (geom) twice. This behavior is exemplary for spatial data. Each spatial relationship is self-intersecting, meaning that objects from within the same table can be referenced to each other.

Relevance for BI
These kinds of relationships are very common in BI. With geographical information in your data sources you can determine and use spatial relationships based on proximity, adjacency and overlap. The implicit relationship allows us to relate data from different sources that are seemingly unrelated from a BI perspective. An example of an implicit and self-intersecting spatial relationship is the relationship between a store and a customer. This relationship can be administratively based on the fact that a customer bought something in the store, which will undoubtedly show up when examining source systems. But on the other hand the geographical location of the addresses of both the store and the customer are implicitly related to one another based on their respective locations. This can allow for additional insight in shopping behavior. Maybe the relationships will reveal a pattern of customers shopping at particular stores inspite of the fact other store locations are closer to their home.

The spatial relationships described in this post will be present in both structured and unstructured data sources. In structured data, the spatial reference may reveal new relationships between your data sources you never knew existed as a bonus. But for unstructured data, the spatial references already is proving to be the most promising way to relate all this data together. Virtually every post, tweet, picture or video already has location associated with it. And for documents, if it has an address in it, you can very easily add location to the mix.

Any thoughts? Leave them below. Next time, we'll be taking a closer look at spatial information in 3NF source systems. Be sure to check back in!