I finished the 1st course and onto the 2nd course, Data to Insights. Haven’t really learned anything that new that I haven’t done before in some other program like IBM SPSS Modeler. The course is interesting especially when the data set is using the google ecommerce data. Seeing how to apply it to an ecommerce data set is what I’ve found the most interesting.
Introduced to the term ‘Data deduplication’. After watching more videos in the course I did do this but never knew about the term. Easiest way of getting rid of duplicated is using the GROUP BY in the SQL. Did not know that there was a term for this.
Some new SQL functions I’ve been introduced to are:
- WITH
- RANK
- PARTITION BY
- _TABLE_SUFFIX
I can take advantage of WITH and decrease the complexity of certain queries I have. Looking forward to introducing to a co worker who writes reports where the SQL query is over 200 lines.
The one concept that I found new was the data deduplication. I haven’t really had a use case for it since the data sources I work with luckily haven’t had that much of duplicate repeating data. Learning about data deduplication should come in handy in the future when the situation arises.
WITH product_query AS (
SELECT
DISTINCT
v2ProductName,
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE v2ProductName IS NOT NULL
)
SELECT k.* FROM (
# aggregate the products into an array and
# only take 1 result
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM product_query x
GROUP BY productSKU # this is the field we want deduplicated
);