Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to not return duplicates when comparing records in same table (A:B and B:A)

I have been stuck with this problem for a while now and can’t resolve it, would greatly appreciate some guidance

I am comparing records in a persons table to see if they’re possibly the same. To do this I am using a with statement to take the values I need and looking for matches

CREATE TABLE persons (
  serialno VARCHAR(20) NOT NULL,
  given VARCHAR(30) NOT NULL,
  family  VARCHAR(30) NOT NULL,
  dob  DATE NOT NULL,
  gender VARCHAR2(20 BYTE),
  address VARCHAR2(64 BYTE)
);

INSERT ALL 
    INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '001', 'Mick', 'Dundon','01/01/1970','Male','Main St' )
    INTO persons ( serialno, given, family, dob,gender,address) VALUES ( '002', 'Mick', 'Dundon','01/01/1970', 'Male','Montague St' )
    INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '003', 'Dave', 'Doyle', '13/10/1981','Male', 'Rathmines')
    INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '004', 'Jim', 'Morrison', '15/08/1956','Male','Newtown')
    INTO persons ( serialno, given, family, dob,gender,address) VALUES ( '005', 'Sam', 'Wise', '12/12/1992','Male','High St')
SELECT 1 FROM dual;

with rec as
(select serialno,given,family,dob,gender,address 
from persons)
select * 
from rec r1
join rec r2
on r1.given = r2.given
and r1.family = r2.family
and r1.gender = r2.gender
and r1.serialno <> r2.serialno

the code works fine except I end up with duplicates as the R1 record will appear further down in the output as R2, and vice versa.
Is there a simple way I can avoid this kind of duplication?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can get all the duplicates without a self-join by using the analytic COUNT function:

SELECT serialno, given, family, dob, gender, address
FROM   (
  SELECT serialno, given, family, dob, gender, address,
         COUNT(*)  OVER (PARTITION BY given, family, gender) AS num_matches
  FROM   persons
)
WHERE num_matches > 1;

If you also want to compare the values to the row with the same given/family/gender combination and the minimum serial number then, again you can avoid a self-join by using analytic functions:

SELECT serialno, given, family, dob, gender, address,
       min_serialno, min_dob, min_address
FROM   (
  SELECT serialno,
         given,
         family,
         dob,
         gender,
         address,
         MIN(serialno) OVER (PARTITION BY given, family, gender) AS min_serialno,
         MIN(dob) KEEP (DENSE_RANK FIRST ORDER BY serialno)
           OVER (PARTITION BY given, family, gender) AS min_dob,
         MIN(address) KEEP (DENSE_RANK FIRST ORDER BY serialno)
           OVER (PARTITION BY given, family, gender) AS min_address
  FROM   persons
)
WHERE serialno > min_serialno;

If, in Oracle, you want to get all possible combinations then you can avoid a self-join by using a hierarchical query:

SELECT serialno, given, family, dob, gender, address,
       PRIOR serialno AS p_serialno,
       PRIOR dob      AS p_dob,
       PRIOR address  AS p_address
FROM   persons
WHERE  LEVEL = 2
CONNECT BY
       PRIOR gender   = gender
AND    PRIOR given    = given
AND    PRIOR family   = family
AND    PRIOR serialno < serialno

db<>fiddle here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading