Big Query Demystified |what really happens when you unnest customDimentions in ga_sessions_* (hint data inaccuracy)

I wanted to share with a problem i found while working in BigQuery and google analytics 360 data: ga_sessions_*

The following query

select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h , h.customDimensions as hcd
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3

returns 1724266 distinct fullvisitorid rows.

when use the same query only  remove the “, h.customDimensions as hcd” from the FROM section :

select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3

i get 1754580 distinct fullvisitorid rows.

 

What is the reason unnset of a custom dimension causes this distinct values difference?

what happens when i do select * from a,b,c in bigQuery?

As you probably know when you add “, something” to the FROM clause you are doing a CROSS JOIN between the first element and the recently added. For example:

select * FROM table tablename as a, table tablename2 as b

This will perform a cross join between the tables tablename and tablename2.

Then, when you do a CROSS JOIN with null values, this rows will be suppressed.
Let me show you an example:

#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1,      3,      5,      7,      11,      13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[2] AS primes_array)

SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime

This query returns 7 rows, however the next one returns only 6:

#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1,      3,      5,      7,      11,      13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[ ] AS primes_array)

SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime

As you can see the only difference is in the line “[2] AS primes_array” which changes to a null value, “[ ] AS primes_array”. Then, this data is discarded with the CROSS JOIN, so when you do the query make sense that you see a difference in the results if there’s any NULL value within that “customDimensions”.

 


Credit for finding the problem: Evyatar Aviram

Contact me:

Want more big data quality content? Join our meetup, subscribe to youtube channels

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s