I want to compare and get number of corresposndeces between two arrays in PostgresSQL.
For example:
given the table
| id | array |
|---|---|
| 1 | a, b, d |
| 2 | a, e |
| 3 | a, b, d, e |
| 4 | b, c, d |
I would like to obtain a comparison table:
| id_1 | id_2 | number_of_matches |
|---|---|---|
| 1 | 2 | 1 |
| 1 | 3 | 3 |
| 1 | 4 | 2 |
| 2 | 3 | 2 |
| 2 | 4 | 0 |
| 3 | 4 | 2 |
There is no repeated values in each array, and the order of items inside arrays does not matter.
I found here about the overlap operator (&&) but it only returns a boolean if there is or not any correspondence between two arrays, but not the number of correspondences.
CREATE TABLE data (id integer, val text[]);
INSERT INTO data VALUES (1, '{a,b,d}');
INSERT INTO data VALUES (2, '{a,e}');
INSERT INTO data VALUES (3, '{a,b,d,e}');
INSERT INTO data VALUES (4, '{b,c,d}');
SELECT a.id id_1, b.id id_2, a.val && b.val match
FROM data a, data b
WHERE a.id < b.id;
returns
| id_1 | id_2 | match |
|---|---|---|
| 1 | 2 | t |
| 1 | 3 | t |
| 1 | 4 | t |
| 2 | 3 | t |
| 2 | 4 | f |
| 3 | 4 | t |
>Solution :
WITH normalized AS ( -- 2
SELECT
id,
unnest(val) as array_element -- 1
FROM data
)
SELECT
n1.id,
n2.id,
COUNT(*)
FROM normalized n1
JOIN normalized n2 -- 3
ON (n1.array_element = n2.array_element AND n1.id <> n2.id)
GROUP BY n1.id, n2.id -- 4
ORDER BY n1.id, n2.id
- Normalize your data (note: You should think about normalizing your data directly in the table; means: no arrays) by creating one row per array element
- Use a CTE to "store" your normalized result
- Self-join your normalized result (but not the same
ids) at thearray_element GROUPandCOUNT