Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Group by using binary columns in Snowflake

Getting an error of duplicate data when trying to perform a merge that uses a column of binary format

MERGE INTO  TARGET_TABLE tgt               
USING (         
      SELECT src.*
        FROM SRC_TABLE src
   ) src ON (NVL(tgt.brand,'N') = NVL(src.brand,'N')
        AND tgt.host = src.host 
        AND tgt.binary_col = src.binary_col )
  WHEN MATCHED AND (src.TOTAL_DISTINCT != tgt.TOTAL_DISTINCT 
OR src.TOTAL_DISTINCT_2 != tgt.TOTAL_DISTINCT_) THEN
      UPDATE
        SET tgt.TOTAL_DISTINCT=src.TOTAL_DISTINCT,
            tgt.TOTAL_DISTINCT_2=src.TOTAL_DISTINCT_2
  WHEN NOT MATCHED THEN
    INSERT(BRAND,   HOST,binary_col,TOTAL_DISTINCT,TOTAL_DISTINCT_2) 
          values(src.BRAND,src.HOST, 
  src.binary_col,src.TOTAL_DISTINCT,src.TOTAL_DISTINCT_2);

When I run the below code, I do get a count = 2 for a couple of my records.

SELECT nvl(src.brand,'N'), src.host, TO_CHAR(src.binary_col, 'HEX'), count(*)
  FROM src_table src
GROUP BY nvl(src.brand,'N'), src.host, TO_CHAR(src.binary_col, 'HEX')
HAVING count(*)> 1;

However when I try to find those individual records with below query, the query only returns 1 record for the ones that the group by returned 2.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

SELECT src.*
FROM src_table src
WHERE 
 brand is null
 and host in ('www.myhost.com')
 and src.binary_col in ('6D5F2AC3300942F6A4CE066F056CAF0B'::BINARY)

Is there a workaround when using binary columns in merge/group by statements?

>Solution :

A better way to to quickly identify "duplicated" rows is usage of QUALIFY and windowed COUNT:

SELECT COALESCE(src.brand,'N'), src.host, TO_CHAR(src.binary_col, 'HEX'), *
FROM src_table src
QUALIFY COUNT(*) OVER (PARITITON BY COALESCE(src.brand,'N'), src.host,
                                    TO_CHAR(src.binary_col, 'HEX')) > 1;

This approach allows to quickly identify duplicates rows and do not collapse them like GROUP BY/HAVING approach.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading