Coursera – Data to Insights – Data Deduplicating

I finished the 1st course and onto the 2nd course, Data to Insights. Haven’t really learned anything that new that I haven’t done before in some other program like IBM SPSS Modeler. The course is interesting especially when the data set is using the google ecommerce data. Seeing how to apply it to an ecommerce data set is what I’ve found the most interesting.

Introduced to the term ‘Data deduplication’. After watching more videos in the course I did do this but never knew about the term. Easiest way of getting rid of duplicated is using the GROUP BY in the SQL. Did not know that there was a term for this.

Some new SQL functions I’ve been introduced to are:

WITH
RANK
PARTITION BY
_TABLE_SUFFIX

I can take advantage of WITH and decrease the complexity of certain queries I have. Looking forward to introducing to a co worker who writes reports where the SQL query is over 200 lines.

The one concept that I found new was the data deduplication. I haven’t really had a use case for it since the data sources I work with luckily haven’t had that much of duplicate repeating data. Learning about data deduplication should come in handy in the future when the situation arises.

WITH product_query AS (
  SELECT 
  DISTINCT 
  v2ProductName,
  productSKU
  FROM `data-to-insights.ecommerce.all_sessions_raw` 
  WHERE v2ProductName IS NOT NULL 
)

SELECT k.* FROM (

  # aggregate the products into an array and 
  # only take 1 result
  SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
  FROM product_query x 
  GROUP BY productSKU # this is the field we want deduplicated
);

This entry was posted in Coding and tagged BigQuery, coursera, SQL. Bookmark the permalink.

Coursera – Data to Insights – Data Deduplicating

Leave a comment Cancel reply

Archives

Meta

Categories

Top Clicks

Blog Stats

Coursera – Data to Insights – Data Deduplicating

Rate this:

Share this:

Related

Leave a comment Cancel reply

Archives

Meta

Categories

Top Clicks