Coursera – Data to Insights – Data Deduplicating

I finished the 1st course and onto the 2nd course, Data to Insights.  Haven’t really learned anything that new that I haven’t done before in some other program like IBM SPSS Modeler.  The course is interesting especially when the data set is using the google ecommerce data.  Seeing how to apply it to an ecommerce data set is what I’ve found the most interesting.

Introduced to the term ‘Data deduplication’.  After watching more videos in the course I did do this but never knew about the term.  Easiest way of getting rid of duplicated is using the GROUP BY in the SQL.  Did not know that there was a term for this.

Some new SQL functions I’ve been introduced to are:

  • WITH
  • RANK

I can take advantage of WITH and decrease the complexity of certain queries I have.  Looking forward to introducing to a co worker who writes reports where the SQL query is over 200 lines.

The one concept that I found new was the data deduplication.  I haven’t really had a use case for it since the data sources I work with luckily haven’t had that much of duplicate repeating data.  Learning about data deduplication should come in handy in the future when the situation arises.

WITH product_query AS (
  FROM `data-to-insights.ecommerce.all_sessions_raw` 
  WHERE v2ProductName IS NOT NULL 


  # aggregate the products into an array and 
  # only take 1 result
  FROM product_query x 
  GROUP BY productSKU # this is the field we want deduplicated
This entry was posted in Coding and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s