Project ideas for GSoC 2022

The FEAST project

Name: Feast - Feature Store for Machine Learning
Homepage: https://feast.dev/
GitHub: https://github.com/feast-dev/feast

Project idea #1: Real-time Observability

Time: 350 hr = 8 weeks

Potential Mentors:

Oleksii Moskalenko
Felix Wang
Willem Pienaar

Prerequisites:

Python
Go
Nice to have: familiarity with cloud infrastructure and monitoring tools (Kubernetes, Statsd, Grafana, Prometheus)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins. Feast also provides the ability for users to stream real-time features into an online store, from which production ML systems read feature values in order to make predictions.

Despite the fact that Feast operates as a production data system for operational machine learning, Feast is still relatively limited in the types of monitoring it allows. Teams that are running Feast at scale need a way to understand how the system is operating. This project seeks to add support for end-to-end monitoring to Feast, which will allow operators of the system to get visibility into component level metrics at ingestion, storage, compute, and retrieval time, as well as data level metrics to understand which features are available within the feature store.

This project aims to add real-time observability into the following milestones:

RFC for overall design (approved by community + core Feast maintainers)
Implementation (including tests + documentation + tutorial)
- Metric collection at the following points using Statsd or equivalent
  - Materialization library (writes to online store)
  - Python/Go Online Servers (reads from online store)
  - Push API (writes to online store)
  - Feature Transformation Server (computes features)
- Metric sink (StatsD)
- Configuration of metric collection through feature_store.yaml
- Create a Helm chart for deployment of the metric stack
(optional): Build a data visualization dashboard in Grafana to trend metrics in Prometheus
(optional): Add support for sinking not only performance metrics but also feature values to Prometheus. This allows operators to set alerts based on both metrics and feature values.
(optional): Add support for updating Prometheus/Grafana configuration based on the feature views in the registry.
(optional): Make it possible for users to detect anomalies in data by allowing them to configure alerts based on tags in feature views

Project idea #2: Batch transformations

Time: 350 hr = 8 weeks

Potential Mentors:

Achal Shah
Tsotne Tabidze
Felix Wang

Prerequisites:

Python
SQL
Nice to have: familiarity with MLOps / ML tooling (Spark, dbt, dagster, Dataflow, Flink, etc)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins.

One capability Feast is missing is the ability to support batch transformations (i.e. transforming large amounts of time series data into feature values for generating training data). Many users want a light-weight way to do feature engineering in Feast. Many other users have existing transformation pipelines (that data scientists use), but want to use Feast as a way of encouraging feature / data pipeline re-use.

This project aims to add feature engineering in several milestones:

RFC for overall design (approved by community + core Feast maintainers)
Implementation (including tests + documentation + tutorial)
- SQL based transformation (leveraging the existing data warehouses registered as offline stores) defined as part of a Feast BatchFeatureView that uses the SQL dialect of the underlying offline store
- Pandas based transformations (using Ray/Dask) for a pythonic transformation
  - e.g. data scientists can easily transition from notebook driven development
- Integration into Feast Web UI
(optional): using pandas based batch transformations to execute consistent transformations at training vs serving time (ODFV)
(optional): allow Spark based transformations (spark_sql vs pyspark)
(optional): allow dbt based transformations
(optional): integration with existing transformation pipelines (dbt, Spark, Dataflow)

Project idea #3: Streaming Compute

Time: 350 hr = 8 weeks

Potential Mentors:

Oleksii Moskalenko
Danny Chiao
Achal Shah

Prerequisites:

Java or Scala
Nice to have: familiarity stream processing frameworks (Apache Spark, Apache Beam, Apache Flink)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins. Feast also provides the ability for users to stream real-time features into an online store, from which production ML systems read feature values in order to make predictions.

The “streaming” implementation that Feast has today is very limited. It’s simply a push-based API to which users of Feast should write their values. Essentially Feast requires users to launch and manage their own streaming jobs, compute their own feature values based on streaming events, and then push them to Feast through this push API.

The goal of this project is to build a real-time streaming compute stack that makes it possible for users to define streaming feature aggregations, and have this streaming system automatically launch jobs to compute features and write them to the feature store.

This project aims to add real-time observability into the following milestones:

RFC for overall design (approved by community + core Feast maintainers)
Implementation (including tests + documentation + tutorial)
- Develop a streaming job launcher for a single platform (Spark, Flink, Beam, etc)
- Develop a streaming job orchestrator to ensure jobs are started, stopped, or restarted.
- Develop feature definitions (feature views) that allow users to define aggregations and register the with Feast
- Add support for at least a single streaming source (like Kafka)
(optional) Integrate metric production with the streaming stack
(optional) Allow users to develop and test streaming jobs from within a local development environment
(optional) Add support for partial aggregations/tiling, which lets the streaming compute system aggregate feature values partially, while the online serving stack takes care of the complete aggregation

The Milvus project

Name: Milvus Vector Database
Homepage: https://milvus.io/
GitHub: https://github.com/milvus-io/milvus

Milvus was created in 2019 with a singular goal: store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.

As a database specifically designed to handle queries over input vectors, it is capable of indexing vectors on a trillion scale. Unlike existing relational databases which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.

Project idea #1: Support Range Search in Milvus

Abstract

Range search is an operator returning all elements that are within a given radius of the query point. It is another widely used similarity search operator. This project will focus on integrating range search functionality into Milvus.

Difficulty	Time	Priority	Mentors
Hard	350 hr	High	Nemo, YH Mo

Technical Details

The underlying vector search library, such as Faiss, may implement range search functionalities already, while others like HNSW are not. To support range search in knowhere, we will need to:

Implement the range iteration logic
Add a range search API for Milvus
Implement range search unit test and CI test for Milvus's execution engine - Knowhere
Implement range search on HNSW (optional)

Helpful Experience

Proficient in Go and C++
Ability to read scientific publications
Nice to have: knowledge of vector index, knowledge of distributed system

First Steps

Learn what range search is, and the basic idea of Faiss and HNSW
Setup development environment of Knowhere and Milvus
Read the code of HNSW, figure out how to support range search
Discuss your proposal with the mentors and on community tech meeting
Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #2: Distributed file system Support for Milvus

Abstract

Milvus adopt the design principle of computation storage disaggregation. The primary storage we support is object storage such as Minio && S3. This project will focus on integrating Milvus with HDFS, and probably other distributed file systems such as Ceph and GlusterFS

Difficulty	Time	Priority	Mentors
Easy	175 hr	Medium	Congqi Xia

Technical Details

Milvus already has a good storage abstraction. We can support both local disk and Minio as chunk storage. To support DFS as the storage backend, we will need to:

Implement DFSChunkManager
Finish all the unit test and integration test together with distributed file system.
Add monitoring metrics about IOPs, traffic and latency.
Design and run the performance benchmark

Helpful Experience

Proficient in Go
Nice to have: have the experience of using distributed storage, familiar with a distributed system.

First Steps

Read the document and API usage of HDFS
Read the code in Milvus about MinioChunkManager
Discuss your proposal with the mentors.
Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #3: Implement auto scaler for milvus

Abstract

Milvus is a cloud-native vector database and we have good elasticity thanks to the design of storage computation disaggregation. In this project, we want to support a cluster auto scalar which could change the cluster size based on user traffic.

Difficulty	Time	Priority	Mentors
Moderate	175 hr	Medium	Edward Zeng, Xiaofan-luan

Technical Details

The key to this project is to evaluate the existing open-source auto scalar mechanism and determine which one we could adopt into Milvus. The task owner must have deep knowledge about what metrics can be used as the trigger judgment to expand and shrink the cluster. To implement a stable auto scalar, follow the steps:

Follow the benchmark of milvus, run stress tests and find the correct metrics that indicate the system load (May need to add actual metrics)
Implement the logic to collect all the metrics
Implement the core logic about making operation decision
Call Milvus K8s operator to expand or shrink the cluster
Run adequate benchmark tests with various workloads, test if the auto scalar works as expected.

Helpful Experience

Proficient in Go
Knowledge about K8s
Nice to have: Basic knowledge K8s auto scalar and Knative, familiar with prometheus.

First Steps

Investigate different auto scalars
Run performance benchmark and interpret all metrics on prometheus cluster
Design the core logic of resource orchestration
Discuss your proposal with the mentors.
Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #4: Implement dynamic config storage

Abstract

Milvus supports yaml style config and it is static. When a user wants to change configuration, the whole cluster needs to be rebooted. This project aims to implement a persistent, dynamic load, and notification mechanism for cluster configuration.

Difficulty	Time	Priority	Mentors
Easy	175 hr	Medium	Xuan Yang

Technical Details

The key part of the project is to implement a config table on meta storage such as Etcd. We'll need to define storage format, design the notification mechanism and update param table of each component. To finish the task, we will need to:

Implement config storage on etcd
Design the notification based on Etcd watch
Reload logic for each of the configuration items
Design the proto to set and remove specified configs
Implementing this API on multi language SDKs
Integrate the config with Attu - the milvus management GUI (Optional)

Helpful Experience

Proficient in Go
Knowledge about Etcd
Nice to have: Knowledge of other language , including Cpp, Java, Python

First Steps

Run Milvus and try to tune some of the configs
Read the code of param table, understand current param load mechanism
Discuss your proposal with the mentors.
Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #5: Support add/remove field in collection schema

Abstract

The collection schema in current Milvus design is immutable. User may want to add or drop one of the fields especially when their data is synced from other data sources. We should support DDL operations to manage collection fields.

Difficulty	Time	Priority	Mentors
Hard	350 hr	Medium	jaime0815

Technical Details

To add one field should be fairly simple, the system need to update collection meta on etcd and notify all the query node and proxy that collection meta has been changed. The only trouble is Milvus does not support incomplete entities, thus we will need to implement some default value mechanism to make sure new added field has a return value.

To remove a field is a bit complicated, except for changing collection meta and notify all components, Rootcoord is also responsible for cleaning up deleted column data, and one has to carefully design the clean up process because of time travel, snapshot and other mechanisms.

To finish the task, we will need to:

Implement collection schema modification logic
Implement the logic to notify all meta caches
Implement default value logic for new added field
Implement the deleted field data purge

Helpful Experience

Proficient in Go
Knowledge about database schema change
Nice to have: Familiar of the DDL part of any opensource database

First Steps

Setup development environment of Milvus
Read the code of current collection creation and drop
Find all usage of collection meta in proxy, data node and query node
Discuss your proposal with the mentors.
Join our slack channel or mail our mentors to get in touch with them for more information

The Adversarial Robustness Toolbox (ART) project

Name: Adversarial Robustness Toolbox (ART)
Homepage: https://adversarial-robustness-toolbox.org/
GitHub: https://github.com/Trusted-AI/adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

Project idea #1: Firewalls for AI in ART

Abstract

We would like to extend ART with tools and APIs to provide firewalls for ML models to mitigate adversarial attacks on AI models. The new tools can be based on stateful detectors, use reusable detection rules, monitor interactions with the model and determine if the model is currently under attack or applied outside its intended input distribution.

Difficulty	Time	Priority	Mentors
Hard	350 hr	High	Beat Buesser

Technical Details

Creating tools and APIs for firewalls for ML models will include:

Define APIs for attack detectors and reusable detection rules
Implement stateful detectors for evasion and extraction attacks in a black-box scenario for classification
Implement unit tests for the new tools
(Optional) Implement detection rules distinguishing natural shifts, ongoing attacks, or repeated application of adversarial examples

Helpful Experience

Proficient in Python and machine learning
Ability to read scientific publications
Nice to have: knowledge of adversarial machine learning

First Steps

Familiarise with current literature on stateful detection of black-box evasion and extraction attacks
Setup development environment for ART
Read the code of ART and create plan for new firewall APIs
Discuss your proposal with the mentors and on community tech meeting
Join our Slack channel or mail our mentors to get in touch with them for more information

Project idea #2: Extend set of Expectation over Transformation tools in ART

Abstract

We would like to extend ART's Expectation over Transformation (EoT) tools to cover the most common data augmentation transformation operations and add support beyond classification to object detection, object tracking, and speech recognition. Furthermore, we would like to automate and simplify the support for non-framework-specific transformations in framework-specific attacks.

Difficulty	Time	Priority	Mentors
Easy	175 hr	Medium	Beat Buesser

Technical Details

ART already has a basic set of EoT tools, mainly for classification and some support for object detection. To extend this initial set of tools and facilitate non-framework-specific transformations the project will:

Define a comprehensive list of transformations including standard and adversarial augmentation operations
Implement transformations as EoT Preprocessors with support for object detection, object tracking, and speech recognition
Implement a generalised and automated Preprocessor for non-framework-specific transformations
Implement unit tests for the new tools

Helpful Experience

Proficient in Python and machine learning
Ability to read scientific publications
Nice to have: knowledge of adversarial machine learning

First Steps

Familiarise with literature on Expectation over Transformation ina adversarial machine learning
Setup development environment for ART
Read the code of ART and create plan for new EoTs
Discuss your proposal with the mentors and on community tech meeting
Join our Slack channel or mail our mentors to get in touch with them for more information

Files

GSoC2022.md

Latest commit

History

GSoC2022.md

File metadata and controls

Project ideas for GSoC 2022

The FEAST project

Project idea #1: Real-time Observability

Description

Project idea #2: Batch transformations

Description

Project idea #3: Streaming Compute

Description

The Milvus project

Project idea #1: Support Range Search in Milvus

Abstract

Technical Details

Helpful Experience

First Steps

Project idea #2: Distributed file system Support for Milvus

Abstract

Technical Details

Helpful Experience

First Steps

Project idea #3: Implement auto scaler for milvus

Abstract

Technical Details

Helpful Experience

First Steps

Project idea #4: Implement dynamic config storage

Abstract

Technical Details

Helpful Experience

First Steps

Project idea #5: Support add/remove field in collection schema

Abstract

Technical Details

Helpful Experience

First Steps

The Adversarial Robustness Toolbox (ART) project

Project idea #1: Firewalls for AI in ART

Abstract

Technical Details

Helpful Experience

First Steps

Project idea #2: Extend set of Expectation over Transformation tools in ART

Abstract

Technical Details

Helpful Experience

First Steps