Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate COSINE VectorSimilarity function #13308

Closed
wants to merge 45 commits into from

Conversation

Pulkitg64
Copy link
Contributor

Description

Tries to solve #13281 by deprecating COSINE VectorSimilarity function for Vector Search.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Apr 15, 2024

TestInt8HnswBackwardsCompatibility.java and TestBasicBackwardsCompatibility test cases failing for Lucene versions older than LUCENE_10_0_0. For older Lucene version, we try to read Index directory from resources folder. For example for Lucene-9.10.0, the directory path is located at lucene/backward-codecs/src/test/org/apache/lucene/backward_index/index.9.10.0-cfs.zip.

As per my understanding the segment present at above directory already has KNNVectorField with MAXIMUM_INNER_PRODUCT as vector similarity function and in above test we try to change similarity function to DOT_PRODUCT due to which there is conflict happening when adding more docs to it and hence these test cases failing.

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2024

Oh okay, that means we need to remove support for COSINE just from indexing side not from searching side?

Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes.

It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users.

@Pulkitg64
Copy link
Contributor Author

Thanks @jpountz for confirming and suggestions.

@@ -302,7 +301,6 @@ private static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) th
List.of(
VectorSimilarityFunction.EUCLIDEAN,
VectorSimilarityFunction.DOT_PRODUCT,
VectorSimilarityFunction.COSINE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is exactly the problem. There must be a null entry, so the ordinals in index format stay alive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ The reason for the list here is exactly to support and allow such evolution and separation from the enum ordinals. A null here trivially need to a change of List implementation type. And tests need to be carefully updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, null does not work with default lists :-) Also downstream the code needs to check for a null value.

In any case we can later also go back to a Map<Integer,VectorSimilarity> or to an array.

Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @uschindler @ChrisHegarty for the pointers, but IIUC, we cannot add a null entry to VectorSimilarityFunction map for COSINE in these files because we still want to use COSINE similarity function for Lucene versions < 10.0, right? Instead we need to create a separate file for new versions. Please correct me if I am wrong here.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a larger review.

I think it would be prudent to first attempt to deprecate and handle internal usage, etc. After that is merged and backported, another PR can be made to fully remove it from main only on writes.

At least, that would make reviewing & testing this simpler :)

@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB
private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major;
private static final String KNN_VECTOR_FIELD = "knn_field";
private static final FieldType KNN_VECTOR_FIELD_TYPE =
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call addDoc), meaning you will be trying to index a dot_product knn field into an index with a cosine field.

You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same sort of comment as above on versioning and that we index fields into an old index.

@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
}
},

/**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to deprecate this first, get all the internal paths updated to not use it.

Then the deprecation can be merged and backported. Then work on removing it from main.

Mostly because the deprecation path and internal code usage will all have to occur anyways :)

@@ -30,23 +28,6 @@
public class ScalarQuantizedRandomVectorScorer
extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> {

public static float quantizeQuery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine.

return (1 + cosine(v1, v2)) / 2;
}
},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version".

Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish).

@Pulkitg64
Copy link
Contributor Author

Thanks @benwtrent for all the feedback.
I have raised another revision just to mark the COSINE function as deprecated. I didn't find any internal usages of this function.
In a followup PR, I will remove COSINE from writes for LUCENE version >=10.0.0

@Pulkitg64 Pulkitg64 requested a review from benwtrent May 7, 2024 17:26
@benwtrent
Copy link
Member

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

@Pulkitg64
Copy link
Contributor Author

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes.

I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support.

@benwtrent
Copy link
Member

@Pulkitg64 could you deprecate also VectorUtil.cosine? That is technically a public API.

Comment on lines 105 to 106
* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be under 9.11 as we will backport the deprecation as a way to warn users that it is going away in 10.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Make sense. Will change in next revision
Thanks for mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though you it would be good get another reviewers approval too.

@Pulkitg64 Pulkitg64 requested a review from benwtrent May 16, 2024 16:14
@Pulkitg64
Copy link
Contributor Author

Hi @benwtrent, could we merge this change, if everything looks good to you?

Pulkitg64 and others added 9 commits May 28, 2024 19:35
…ntries; don't even consider "forbidden" entries anymore (apache#13429)

Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore
Pull request apache#13406 inadvertly broke Lucene's handling of tragic exceptions by
stopping after the first `DocValuesProducer` whose `close()` calls throws an
exception, instead of keeping calling `close()` on further producers in the
list.

This moves back to the previous behavior.

Closes apache#13434
…e vector values is empty (apache#13444)

This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The VectorScorer interface in new in Lucene 9.11, see apache#13181

An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations.
… format (apache#13445)

When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles.

However, when that was merged, I inadvertently changed the default behavior, where a null confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to 1 - 1/(dim + 1).

This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to 0, and preserves the previous behavior for null confidenceIntervals so that users upgrading will not see different quantiles than they would expect.
…ern. (apache#13450)

This applies to files where performing readahead could help:
 - Doc values data (`.dvd`)
 - Norms data (`.nvd`)
 - Docs and freqs in postings lists (`.doc`)
 - Points data (`.kdd`)

Other files (KNN vectors, stored fields, term vectors) keep using a `RANDOM`
advice.
This follows a similar approach as postings and only prefetches the first page
of data.

I verified that it works well for collectors such as `TopFieldCollector`, as
`IndexSearcher` first pulls a `LeafCollector`, then a `BulkScorer` and only
then starts feeding the `BulkScorer` into the `LeafCollector`. So the
background I/O for the `LeafCollector` which will prefetch the first page of
doc values and the background I/O for the `BulkScorer` will run in parallel.
@benwtrent
Copy link
Member

@Pulkitg64 sorry for radio silence, I was out of office and swamped with other things. If you could move the change log to 9.12, I can merge and backport.

bugmakerrrrrr and others added 25 commits June 5, 2024 15:02
…che#13322)

* implement Weight#count for vector values

* add change log

* apply review comment

* apply review comment

* changelog

* remove null check
This commit updates the Gradle wrapper to 8.8, which has support for Java 22.
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.

Closes apache#13410
…e#13460)

Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when 
writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
 match query terms.

With the introduction of PR apache#12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.

This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.

Relates to PR apache#12156
We consume a lot of memory for the `indexIn` slices. If `indexIn` is of
type `MemorySegmentIndexInput` the overhead of keeping loads of slices
around just for cloning is far higher than the extra 12b per reader this
adds (the slice description alone often costs a lot).
In a number of Elasticsearch example uses with high segment counts I
investigated, this change would save up to O(GB) of heap.
also make links to CONTRIBUTING.md more prominent, and demote link to dev-docs
@Pulkitg64
Copy link
Contributor Author

I somehow messed up with my branch while pulling changes from main branch. @benwtrent I have created a new PR for marking the cosine VectorSImilarity function as deprecated.

#13473

Sorry for the mess!

@Pulkitg64 Pulkitg64 closed this Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.