Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gh 5121 fedx bind left join #5122

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

aschwarte10
Copy link
Contributor

@aschwarte10 aschwarte10 commented Sep 6, 2024

Note: The PR is still WIP (some further changes, and especially unit test coverage will follow) however it is ready for early review. The goal is to have this available in RDF4J 5.1

GitHub issue resolved: #5121

Briefly describe the changes proposed in this PR:

The PR consists of multiple commits which add an implementation for bind left joins. It is recommended to follow the change commit by commit (as the first ones are preparational refactoring)

Commit 1: GH-5121: refactor the bind join logic into a reusable base class

Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.

For backwards code compatibility the "ControlledWorkerBoundJoin" is
kept, but no longer used. Instead the new code is in
ControlledWorkerBindJoin.

Commit 2: GH-5121: prepare execution of left joins in the federation strategy

Prepare to execute a specific implementation of a left join
implementation through the federation strategy.

Commit 3: GH-5121: implementation of left bind join operator

This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

  • execute left bind join using regular bound join query
  • process result iteration similar to BoundJoinVALUESConversionIteration
  • remember seen set of bindings (using index) and add original bindings
    to those, i.e. put to result return all non-seen bindings directly from
    the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with a unit tests

Commit 4: GH-5121: configurability of bind left joins

Bind left joins for OPTIONAL can be disabled using the
"enableOptionalAsBindJoin" flag in the federation config

Integrate the switch between implementations in the unit test as
parameterized test

Commit 5: GH-5121: code simplifications in bind join implementation

  • use for-each loop for iterating bindingset
  • use IntHashSet
  • use Literal#intValue instead of Integer#parseInt

TODOs

  • further unit test coverage
  • validate check left bind join pattern
  • configurability to turn bind left joins off
  • ...

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@aschwarte10 aschwarte10 added the 📦 fedx fedx: optimized federated query support label Sep 6, 2024
Comment on lines 26 to 38
/**
* A {@link LookAheadIteration} for processing bind left join results.
*
* Algorithm:
*
* <ul>
* <li>execute left bind join using regular bound join query</li>
* <li>process result iteration similar to {@link BoundJoinVALUESConversionIteration}</li>
* <li>remember seen set of bindings (using index) and add original bindings to those, i.e. put to result return all
* non-seen bindings directly from the input</li>
*
*
* @author Andreas Schwarte
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that this is used for OPTIONAL clauses. Is it used for anything else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the docs. This is only for OPTIONALs

protected BindingSet convert(BindingSet bIn, int bIndex) throws QueryEvaluationException {
QueryBindingSet res = new QueryBindingSet();
Iterator<Binding> bIter = bIn.iterator();
while (bIter.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a for-each look?

for (Binding bs : bIn){
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can be simplified like that. I think when initially writing the bind join logic it was not possible

Comment on lines 60 to 61
int bIndex = Integer.parseInt(
bIn.getBinding(BoundJoinVALUESConversionIteration.INDEX_BINDING_NAME).getValue().stringValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the binding Value object ever a literal? So it would be possible to check if it's instanceof IntegerLiteral or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always an integer literal. is passed as that in the VALUES clause. Have optimized it


protected final ControlledWorkerScheduler<BindingSet> scheduler;

protected final Phaser phaser = new Phaser(1);
Copy link
Contributor

@hmottestad hmottestad Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen in other code that we call phaser.forceTermination() in the handleClose method. Would that be wise here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good observation, seems present in the code, see line 125 (not sure if I added in the force-push, but likely it was already present before)

protected final CloseableIteration<BindingSet> iter;
protected final List<BindingSet> bindings;

protected Set<Integer> seenBindingIndexes = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the IntHashSet be appropriate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the impact of performance is, but it does the job. I guess this datastructure (just looking at its name is more optimized and hence more appropriate)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to wrap the integer in an object (autoboxing?) so it uses a lot less memory and is overall a lot faster. Mostly the memory pressure that would be beneficial here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be accessible from your code, as in the dependency is transitively included in the fedx maven sub-project.

@aschwarte10
Copy link
Contributor Author

@hmottestad thanks for the feedback. I will get back to this PR soon (we are currently in the release finalization phase of our product and this requires most of my attention).

Quick question: is there a timeline for the 5.1 release? I would like to see this change included (if I get it ready) such that we can make use of it in our next release. Ideally the 5.1 release would be available in the second half of november. What are the current project plans?

@hmottestad
Copy link
Contributor

We don't have any plans at the moment.

I'm currently working on the last SHACL paths that aren't supported. I also have a PR for configuring the Apache HttpClient timeouts. And there is a request for adding support for configuring the document loader in the new JSON-LD 1.1 parser.

Would really want the timeouts included, and if I start on the JSON-LD document loader then it shouldn't take more than a few weeks to get it merged.

So we can say mid november for RDF4J 5.1.

@aschwarte10 aschwarte10 force-pushed the GH-5121-fedx-bind-left-join branch 2 times, most recently from c058da3 to 40b03a5 Compare September 29, 2024 09:52
Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the new bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.

For backwards code compatibility the "ControlledWorkerBoundJoin" is
kept, but no longer used. Instead the new code is in
ControlledWorkerBindJoin.
Prepare to execute a specific implementation of a left join
implementation through the federation strategy.
This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

- execute left bind join using regular bound join query
- process result iteration similar to BoundJoinVALUESConversionIteration
- remember seen set of bindings (using index) and add original bindings
to those, i.e. put to result return all non-seen bindings directly from
the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with some unit tests
Bind left joins for OPTIONAL can be disabled using the
"enableOptionalAsBindJoin" flag in the federation config

Integrate the switch between implementations in the unit test as
parameterized test
- use for-each loop for iterating bindingset
- use IntHashSet
- use Literal#intValue instead of Integer#parseInt
@aschwarte10
Copy link
Contributor Author

The PR should not be fully ready for review. I have addressed my remaining TODOS and the feedback from @hmottestad (Thanks for that btw).

Can I ask for reviews on this?

Note: when this is merged, I will create some additional improvements as part of orthogonal improvement issues (still to be created)

-So we can say mid november for RDF4J 5.1.

That sounds really great. Let's aim for that. Might be helpful for us in our product to test the 5.1 branch earlier in our product (to be able to react when there are still issues encountered during testing / using). Note: we cannot really use RDF4J snapshot versions. Would it be poissible to create a milestone build some time end of October for acceptance testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 fedx fedx: optimized federated query support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants