Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe null handling in anti join #3745

Merged
merged 2 commits into from
Sep 29, 2024
Merged

Describe null handling in anti join #3745

merged 2 commits into from
Sep 29, 2024

Conversation

obacht17
Copy link
Contributor

@obacht17 obacht17 commented Sep 28, 2024

Current state

The documentation says an ANTI JOIN and WHERE NOT IN (...) are equivalent. This is not true when the right table of the ANTI JOIN / subquery of the NOT IN (...) does contain NULL values.

Detailed demonstration

create table t1 (id1 integer);
create table t2 (id1 integer);
insert into t1 select id1 from generate_series(1, 1000) tmp1(id1);
insert into t2 select id1 from generate_series(1, 100) tmp1(id1);
insert into t2 values(null);  -- a null value on the right side is the key here
select count(*) as q1 from t1 where id1 not in (select id1 from t2);  
--> q1: 0
select count(*) as q2 from t1 anti join t2 using (id1);
--> q2: 900
--> i.e. WHERE id1 NOT IN (...) is not equivalent to ANTI JOIN
select count(*) as q3 from t1 where id1 not in (select id1 from t2 where id1 is not null);
--> q3: 900
--> i.e. we can fix this by excluding NULLs from the subquery which do not match in the anti join.

Verification using Postgres

To make sure I did not step on a bug in DuckDB, I tried q1 and q3 in Postgres (online tool, PostgreSQL 14.11). The results were the same as in DuckDB 1.1.1. Please note that Postgres parses ANTI JOIN syntax in q2, but does definitely not apply an ANTI JOIN, but an INNER JOIN.

WHERE id1 NOT IN (SELECT id1 FROM rhs) does not give the same result as ANTI JOIN rhs USING (id1) in cases where rhs contains NULL values in id1. Expand the description (analogy between NOT IN and ANTI JOIN) and example code.
@szarnyasg
Copy link
Collaborator

szarnyasg commented Sep 29, 2024

Hi @obacht17, thanks for reporting this an submitting a PR with the clarification. This is indeed an important thing to mention.

Please note that Postgres parses ANTI JOIN syntax in q2, but does definitely not apply an ANTI JOIN, but an INNER JOIN.

Postgres interprets the string anti as an alias for t1, i.e. these two are equivalent:

select count(*) as q2 from t1    anti join t2 using (id1);
select count(*) as q2 from t1 AS anti join t2 using (id1);

@szarnyasg szarnyasg merged commit c5d7412 into duckdb:main Sep 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants