Skip to content

Commit

Permalink
feat: Updating docs to include model inference guidelines (#4416)
Browse files Browse the repository at this point in the history
Signed-off-by: Francisco Javier Arceo <[email protected]>
  • Loading branch information
franciscojavierarceo committed Aug 16, 2024
1 parent 0baeeb5 commit cebbe04
Show file tree
Hide file tree
Showing 5 changed files with 98 additions and 3 deletions.
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [Push vs Pull Model](getting-started/architecture/push-vs-pull-model.md)
* [Write Patterns](getting-started/architecture/write-patterns.md)
* [Feature Transformation](getting-started/architecture/feature-transformation.md)
* [Feature Serving and Model Inference](getting-started/architecture/model-inference.md)
* [Components](getting-started/components/README.md)
* [Overview](getting-started/components/overview.md)
* [Registry](getting-started/components/registry.md)
Expand Down
4 changes: 4 additions & 0 deletions docs/getting-started/architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,7 @@
{% content-ref url="feature-transformation.md" %}
[feature-transformation.md](feature-transformation.md)
{% endcontent-ref %}

{% content-ref url="model-inference.md" %}
[model-inference.md](model-inference.md)
{% endcontent-ref %}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
A *feature transformation* is a function that takes some set of input data and
returns some set of output data. Feature transformations can happen on either raw data or derived data.

## Feature Transformation Engines
Feature transformations can be executed by three types of "transformation engines":

1. The Feast Feature Server
Expand Down
88 changes: 88 additions & 0 deletions docs/getting-started/architecture/model-inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Feature Serving and Model Inference

Production machine learning systems can choose from four approaches to serving machine learning predictions (the output
of model inference):
1. Online model inference with online features
2. Precomputed (batch) model predictions without online features
3. Online model inference with online features and cached predictions
4. Online model inference without features

*Note: online features can be sourced from batch, streaming, or request data sources.*

These three approaches have different tradeoffs but, in general, have significant implementation differences.

## 1. Online Model Inference with Online Features
Online model inference with online features is a powerful approach to serving data-driven machine learning applications.
This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe).
This approach is particularly useful for applications where request-time data is required to run inference.
```python
features = store.get_online_features(
feature_refs=[
"user_data:click_through_rate",
"user_data:number_of_clicks",
"user_data:average_page_duration",
],
entity_rows=[{"user_id": 1}],
)
model_predictions = model_server.predict(features)
```

## 2. Precomputed (Batch) Model Predictions without Online Features
Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement.
This approach simply treats the model predictions as a feature and serves them from the feature store using the standard
Feast sdk.
```python
model_predictions = store.get_online_features(
feature_refs=[
"user_data:model_predictions",
],
entity_rows=[{"user_id": 1}],
)
```
Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and
materialized to the online store.

While this approach can lead to quick impact for different business use cases, it suffers from stale data as well
as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff
may be tolerable.

## 3. Online Model Inference with Online Features and Cached Predictions
This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running
model inference when data producers write features to the online store. This approach is particularly useful for
applications where features are coming from multiple data sources, the model is computationally expensive to run, or
latency is a significant constraint.

```python
# Client Reads
features = store.get_online_features(
feature_refs=[
"user_data:click_through_rate",
"user_data:number_of_clicks",
"user_data:average_page_duration",
"user_data:model_predictions",
],
entity_rows=[{"user_id": 1}],
)
if features.to_dict().get('user_data:model_predictions') is None:
model_predictions = model_server.predict(features)
store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))
```
Note that in this case a seperate call to `write_to_online_store` is required when the underlying data changes and
predictions change along with it.

```python
# Client Writes from the Data Producer
user_data = request.POST.get('user_data')
model_predictions = model_server.predict(user_data) # assume this includes `user_data` in the Data Frame
store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))
```
While this requires additional writes for every data producer, this approach will result in the lowest latency for
model inference.

## 4. Online Model Inference without Features
This approach does not require Feast. The model server can directly serve predictions without any features. This
approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions.

Note that generative models using Retrieval Augmented Generation (RAG) do require features where the
[document embeddings](../../reference/alpha-vector-database.md) are treated as features, which Feast supports
(this would fall under "Online Model Inference with Online Features").
7 changes: 4 additions & 3 deletions docs/getting-started/architecture/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ Feast's architecture is designed to be flexible and scalable. It is composed of
online store.
This allows Feast to serve features in real-time with low latency.

* Feast supports On Demand and Streaming Transformations for [feature computation](feature-transformation.md) and
will support Batch transformations in the future. For Streaming and Batch, Feast requires a separate Feature Transformation
Engine (in the batch case, this is typically your Offline Store). We are exploring adding a default streaming engine to Feast.
* Feast supports [feature transformation](feature-transformation.md) for On Demand and Streaming data sources and
will support Batch transformations in the future. For Streaming and Batch data sources, Feast requires a separate
[Feature Transformation Engine](feature-transformation.md#feature-transformation-engines) (in the batch case, this is
typically your Offline Store). We are exploring adding a default streaming engine to Feast.

* Domain expertise is recommended when integrating a data source with Feast understand the [tradeoffs from different
write patterns](write-patterns.md) to your application
Expand Down

0 comments on commit cebbe04

Please sign in to comment.