dinsdag 5 januari 2016

The spatial reference in BI

In the previous post I wrote about the fact geographical information can be of great business value. Currently, in Business Intelligence solutions geography is mostly used for visualisation purposes. In Business Intelligence, we're always looking for the relationships in data. With structured datasets based on RDBMS sources, the relationships between objects are managed through primary and foreign keys. In unstructured data, as Bill Inmon (http://www.b-eye-network.com/view/14120) explained, you have to depend on taxonomies.

But there's another option that might be applied to both structured and unstructured data. Geographical information contains a unique property making it possible to relate even the most seemingly unrelated datasets: the spatial one. Every object with a spatial location can be related to every other object that has one. These relationships can be queried with common SQL (though the approach is dependent on the database platform), calculated with spatially enabled ETL tools or just with use of a nice JavaScript or Python-library.

Let's have a look at some examples of spatial relationships.  

Example 1: Distance to specific location Imagine you have two data source tables in your database: a table with home addresses and a table with pubs. At first glance, just looking at the tables doesn't tell you much. But with the use of their respective locations you can determine the proximity to each pub for each home address with a simple SQL statement. A spatial relationship like this is called a proximity relationship.

The map shows 3 pubs as green diamonds. Addresses are symbolized as blue dots.

Below is an example for PostgreSQL.
SELECT a.address
,      dist.name as pubname
,      dist.distance
FROM adressen a
LEFT JOIN (SELECT a.address, p.name, ST_Distance(p.geom, a.geom) as distance 
                FROM adressen a
                LEFT JOIN pubs p ON ST_DWithin(a.geom, p.geom, 10000)
                ) as dist
                on (a.address = dist.address)
WHERE a.address = dist.address 
AND dist.name = (SELECT p.name
               from pubs p
                order by p.geom <-> a.geom LIMIT 1)
order by a.address;
The query is designed to find the nearest pub for each address within 10km of the address. With the use of the subquery in the LEFT JOIN we determine the distance to each pub for each individual address, basically a cartesian join between addresses and pubs on the locational relationship. The subquery in the WHERE clause is optional to determine the closest pub for each address, again based on the spatial relationship.

The resulting table would be as followed. Clearly, you can see the distance from each address to the nearest pub. In Oracle, this query would be somewhat simpler, but both queries would yield the same result.

Example 2: Area of influence
In rural areas the government keeps record of all nuisance caused by farms. Around each farm, the stench contour defining the area where air quality is affected is registered as a polygon in a table. Another table containing house addresses can be queried to see which addresses are within the contour.
The farm is shown in yellow, the area of influence is shown with the red contour. The adresses are the house-shaped blue-ish polygons.

Also here are sample queries for Oracle and PostgreSQL.
SELECT a.street
,      a.number
,      a.city
FROM address a
,    farm_contour f
WHERE SDO_RELATE(a.geometry, f.geometry,'mask=inside') = 'TRUE';  
SELECT a.street
,      a.number
,      a.city
FROM address a
,    farm_contour f
WHERE ST_Within(a.geometry, f.geometry) = 'TRUE';
These queries already look somewhat more similar. Again, both queries will return the same results: a list of all addresses within a stench contour. Addresses that are outside all contours are excluded by the 'TRUE' condition in the spatial where clause. This second kind of spatial relationship is called an overlap. Both full and partial overlaps can be determined depending on the spatial operator used.

Example 3: neighbors along ownership
My organization has to deal with a lot of land owners that own parcels that are neighboring our property. Very often, we need to figure out which parcels directly touch parcels owned by a specific owner, in this case our organization
Left the parcel owned, right the parcels that directly touch the left parcel.
SELECT k.parcel_code
,      k.owner
,      o.parcel_code
FROM parcels k   --
,    parcels o
WHERE SDO_RELATE(k.geometry, o.geometry,'mask=touch') = 'TRUE'
AND o.owner = '<our organization>';

SELECT k.parcel_code
,      k.owner
,      o.parcel_code
FROM parcels k   --
,    parcels o
WHERE ST_Touches(k.geometry, o.geometry) = 'TRUE'
AND o.owner = '<our organization>';
We notice the queries only select the parcels that share a common boundary. This is a third kind of spatial relationship called adjacency.

Implicit relationship
The cool thing about spatial relationships is that they are implicit. Each table can be maintained on it's own without the need to maintain a relational table or primary-foreign key constraint between tables. In the above samples, the addresses table and pubs table can be stored and edited independent of each other. New addresses can be added to the address table without any specific requirement regarding the pubs locations. The same goes for the addition of new houses or parcels. Once two tables are queried, the spatial relationship can be determined on-the-fly. This is possible because all spatial data shares a common attribute: they are all positioned somewhere on (or below) the earth surface.

Self-intersecting
In the last query set we can notice the parcels table is referenced twice in the FROM-part of our query. In example 3 we see that both the central parcel and the adjacent parcels are stored in the same parcels table. The geometry relation in the WHERE clause of the query is referencing the same column (geom) twice. This behavior is exemplary for spatial data. Each spatial relationship is self-intersecting, meaning that objects from within the same table can be referenced to each other.

Relevance for BI
These kinds of relationships are very common in BI. With geographical information in your data sources you can determine and use spatial relationships based on proximity, adjacency and overlap. The implicit relationship allows us to relate data from different sources that are seemingly unrelated from a BI perspective. An example of an implicit and self-intersecting spatial relationship is the relationship between a store and a customer. This relationship can be administratively based on the fact that a customer bought something in the store, which will undoubtedly show up when examining source systems. But on the other hand the geographical location of the addresses of both the store and the customer are implicitly related to one another based on their respective locations. This can allow for additional insight in shopping behavior. Maybe the relationships will reveal a pattern of customers shopping at particular stores inspite of the fact other store locations are closer to their home.

The spatial relationships described in this post will be present in both structured and unstructured data sources. In structured data, the spatial reference may reveal new relationships between your data sources you never knew existed as a bonus. But for unstructured data, the spatial references already is proving to be the most promising way to relate all this data together. Virtually every post, tweet, picture or video already has location associated with it. And for documents, if it has an address in it, you can very easily add location to the mix.

Any thoughts? Leave them below. Next time, we'll be taking a closer look at spatial information in 3NF source systems. Be sure to check back in!