Skip to content

Latest commit

 

History

History
382 lines (253 loc) · 19.5 KB

GSoC2022.md

File metadata and controls

382 lines (253 loc) · 19.5 KB

Project ideas for GSoC 2022

The FEAST project

Project idea #1: Real-time Observability

Time: 350 hr = 8 weeks

Potential Mentors:

Prerequisites:

  • Python
  • Go
  • Nice to have: familiarity with cloud infrastructure and monitoring tools (Kubernetes, Statsd, Grafana, Prometheus)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins. Feast also provides the ability for users to stream real-time features into an online store, from which production ML systems read feature values in order to make predictions.

Despite the fact that Feast operates as a production data system for operational machine learning, Feast is still relatively limited in the types of monitoring it allows. Teams that are running Feast at scale need a way to understand how the system is operating. This project seeks to add support for end-to-end monitoring to Feast, which will allow operators of the system to get visibility into component level metrics at ingestion, storage, compute, and retrieval time, as well as data level metrics to understand which features are available within the feature store.

This project aims to add real-time observability into the following milestones:

  • RFC for overall design (approved by community + core Feast maintainers)
  • Implementation (including tests + documentation + tutorial)
    • Metric collection at the following points using Statsd or equivalent
      • Materialization library (writes to online store)
      • Python/Go Online Servers (reads from online store)
      • Push API (writes to online store)
      • Feature Transformation Server (computes features)
    • Metric sink (StatsD)
    • Configuration of metric collection through feature_store.yaml
    • Create a Helm chart for deployment of the metric stack
  • (optional): Build a data visualization dashboard in Grafana to trend metrics in Prometheus
  • (optional): Add support for sinking not only performance metrics but also feature values to Prometheus. This allows operators to set alerts based on both metrics and feature values.
  • (optional): Add support for updating Prometheus/Grafana configuration based on the feature views in the registry.
  • (optional): Make it possible for users to detect anomalies in data by allowing them to configure alerts based on tags in feature views

Project idea #2: Batch transformations

Time: 350 hr = 8 weeks

Potential Mentors:

Prerequisites:

  • Python
  • SQL
  • Nice to have: familiarity with MLOps / ML tooling (Spark, dbt, dagster, Dataflow, Flink, etc)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins.

One capability Feast is missing is the ability to support batch transformations (i.e. transforming large amounts of time series data into feature values for generating training data). Many users want a light-weight way to do feature engineering in Feast. Many other users have existing transformation pipelines (that data scientists use), but want to use Feast as a way of encouraging feature / data pipeline re-use.

This project aims to add feature engineering in several milestones:

  • RFC for overall design (approved by community + core Feast maintainers)
  • Implementation (including tests + documentation + tutorial)
    • SQL based transformation (leveraging the existing data warehouses registered as offline stores) defined as part of a Feast BatchFeatureView that uses the SQL dialect of the underlying offline store
    • Pandas based transformations (using Ray/Dask) for a pythonic transformation
      • e.g. data scientists can easily transition from notebook driven development
    • Integration into Feast Web UI
  • (optional): using pandas based batch transformations to execute consistent transformations at training vs serving time (ODFV)
  • (optional): allow Spark based transformations (spark_sql vs pyspark)
  • (optional): allow dbt based transformations
  • (optional): integration with existing transformation pipelines (dbt, Spark, Dataflow)

Project idea #3: Streaming Compute

Time: 350 hr = 8 weeks

Potential Mentors:

Prerequisites:

  • Java or Scala
  • Nice to have: familiarity stream processing frameworks (Apache Spark, Apache Beam, Apache Flink)

Difficulty level: moderate

Ideal candidate: advanced or student level both OK

Description

Feast is an open source feature store that enables users to register data sources and features, discover features at training time, and reliably serve features in online systems. Feast uses compute systems (primarily data warehouses) to do large scale joins. Feast also provides the ability for users to stream real-time features into an online store, from which production ML systems read feature values in order to make predictions.

The “streaming” implementation that Feast has today is very limited. It’s simply a push-based API to which users of Feast should write their values. Essentially Feast requires users to launch and manage their own streaming jobs, compute their own feature values based on streaming events, and then push them to Feast through this push API.

The goal of this project is to build a real-time streaming compute stack that makes it possible for users to define streaming feature aggregations, and have this streaming system automatically launch jobs to compute features and write them to the feature store.

This project aims to add real-time observability into the following milestones:

  • RFC for overall design (approved by community + core Feast maintainers)
  • Implementation (including tests + documentation + tutorial)
    • Develop a streaming job launcher for a single platform (Spark, Flink, Beam, etc)
    • Develop a streaming job orchestrator to ensure jobs are started, stopped, or restarted.
    • Develop feature definitions (feature views) that allow users to define aggregations and register the with Feast
    • Add support for at least a single streaming source (like Kafka)
  • (optional) Integrate metric production with the streaming stack
  • (optional) Allow users to develop and test streaming jobs from within a local development environment
  • (optional) Add support for partial aggregations/tiling, which lets the streaming compute system aggregate feature values partially, while the online serving stack takes care of the complete aggregation

The Milvus project

Milvus was created in 2019 with a singular goal: store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.

As a database specifically designed to handle queries over input vectors, it is capable of indexing vectors on a trillion scale. Unlike existing relational databases which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.

Project idea #1: Support Range Search in Milvus

Abstract

Range search is an operator returning all elements that are within a given radius of the query point. It is another widely used similarity search operator. This project will focus on integrating range search functionality into Milvus.

Difficulty Time Priority Mentors
Hard 350 hr High Nemo, YH Mo

Technical Details

The underlying vector search library, such as Faiss, may implement range search functionalities already, while others like HNSW are not. To support range search in knowhere, we will need to:

  • Implement the range iteration logic
  • Add a range search API for Milvus
  • Implement range search unit test and CI test for Milvus's execution engine - Knowhere
  • Implement range search on HNSW (optional)

Helpful Experience

  • Proficient in Go and C++
  • Ability to read scientific publications
  • Nice to have: knowledge of vector index, knowledge of distributed system

First Steps

  • Learn what range search is, and the basic idea of Faiss and HNSW
  • Setup development environment of Knowhere and Milvus
  • Read the code of HNSW, figure out how to support range search
  • Discuss your proposal with the mentors and on community tech meeting
  • Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #2: Distributed file system Support for Milvus

Abstract

Milvus adopt the design principle of computation storage disaggregation. The primary storage we support is object storage such as Minio && S3. This project will focus on integrating Milvus with HDFS, and probably other distributed file systems such as Ceph and GlusterFS

Difficulty Time Priority Mentors
Easy 175 hr Medium Congqi Xia

Technical Details

Milvus already has a good storage abstraction. We can support both local disk and Minio as chunk storage. To support DFS as the storage backend, we will need to:

  • Implement DFSChunkManager
  • Finish all the unit test and integration test together with distributed file system.
  • Add monitoring metrics about IOPs, traffic and latency.
  • Design and run the performance benchmark

Helpful Experience

  • Proficient in Go
  • Nice to have: have the experience of using distributed storage, familiar with a distributed system.

First Steps

  • Read the document and API usage of HDFS
  • Read the code in Milvus about MinioChunkManager
  • Discuss your proposal with the mentors.
  • Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #3: Implement auto scaler for milvus

Abstract

Milvus is a cloud-native vector database and we have good elasticity thanks to the design of storage computation disaggregation. In this project, we want to support a cluster auto scalar which could change the cluster size based on user traffic.

Difficulty Time Priority Mentors
Moderate 175 hr Medium Edward Zeng, Xiaofan-luan

Technical Details

The key to this project is to evaluate the existing open-source auto scalar mechanism and determine which one we could adopt into Milvus. The task owner must have deep knowledge about what metrics can be used as the trigger judgment to expand and shrink the cluster. To implement a stable auto scalar, follow the steps:

  • Follow the benchmark of milvus, run stress tests and find the correct metrics that indicate the system load (May need to add actual metrics)
  • Implement the logic to collect all the metrics
  • Implement the core logic about making operation decision
  • Call Milvus K8s operator to expand or shrink the cluster
  • Run adequate benchmark tests with various workloads, test if the auto scalar works as expected.

Helpful Experience

  • Proficient in Go
  • Knowledge about K8s
  • Nice to have: Basic knowledge K8s auto scalar and Knative, familiar with prometheus.

First Steps

  • Investigate different auto scalars
  • Run performance benchmark and interpret all metrics on prometheus cluster
  • Design the core logic of resource orchestration
  • Discuss your proposal with the mentors.
  • Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #4: Implement dynamic config storage

Abstract

Milvus supports yaml style config and it is static. When a user wants to change configuration, the whole cluster needs to be rebooted. This project aims to implement a persistent, dynamic load, and notification mechanism for cluster configuration.

Difficulty Time Priority Mentors
Easy 175 hr Medium Xuan Yang

Technical Details

The key part of the project is to implement a config table on meta storage such as Etcd. We'll need to define storage format, design the notification mechanism and update param table of each component. To finish the task, we will need to:

  • Implement config storage on etcd
  • Design the notification based on Etcd watch
  • Reload logic for each of the configuration items
  • Design the proto to set and remove specified configs
  • Implementing this API on multi language SDKs
  • Integrate the config with Attu - the milvus management GUI (Optional)

Helpful Experience

  • Proficient in Go
  • Knowledge about Etcd
  • Nice to have: Knowledge of other language , including Cpp, Java, Python

First Steps

  • Run Milvus and try to tune some of the configs
  • Read the code of param table, understand current param load mechanism
  • Discuss your proposal with the mentors.
  • Join our slack channel or mail our mentors to get in touch with them for more information

Project idea #5: Support add/remove field in collection schema

Abstract

The collection schema in current Milvus design is immutable. User may want to add or drop one of the fields especially when their data is synced from other data sources. We should support DDL operations to manage collection fields.

Difficulty Time Priority Mentors
Hard 350 hr Medium jaime0815

Technical Details

To add one field should be fairly simple, the system need to update collection meta on etcd and notify all the query node and proxy that collection meta has been changed. The only trouble is Milvus does not support incomplete entities, thus we will need to implement some default value mechanism to make sure new added field has a return value.

To remove a field is a bit complicated, except for changing collection meta and notify all components, Rootcoord is also responsible for cleaning up deleted column data, and one has to carefully design the clean up process because of time travel, snapshot and other mechanisms.

To finish the task, we will need to:

  • Implement collection schema modification logic
  • Implement the logic to notify all meta caches
  • Implement default value logic for new added field
  • Implement the deleted field data purge

Helpful Experience

  • Proficient in Go
  • Knowledge about database schema change
  • Nice to have: Familiar of the DDL part of any opensource database

First Steps

  • Setup development environment of Milvus
  • Read the code of current collection creation and drop
  • Find all usage of collection meta in proxy, data node and query node
  • Discuss your proposal with the mentors.
  • Join our slack channel or mail our mentors to get in touch with them for more information

The Adversarial Robustness Toolbox (ART) project

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

Project idea #1: Firewalls for AI in ART

Abstract

We would like to extend ART with tools and APIs to provide firewalls for ML models to mitigate adversarial attacks on AI models. The new tools can be based on stateful detectors, use reusable detection rules, monitor interactions with the model and determine if the model is currently under attack or applied outside its intended input distribution.

Difficulty Time Priority Mentors
Hard 350 hr High Beat Buesser

Technical Details

Creating tools and APIs for firewalls for ML models will include:

  • Define APIs for attack detectors and reusable detection rules
  • Implement stateful detectors for evasion and extraction attacks in a black-box scenario for classification
  • Implement unit tests for the new tools
  • (Optional) Implement detection rules distinguishing natural shifts, ongoing attacks, or repeated application of adversarial examples

Helpful Experience

  • Proficient in Python and machine learning
  • Ability to read scientific publications
  • Nice to have: knowledge of adversarial machine learning

First Steps

  • Familiarise with current literature on stateful detection of black-box evasion and extraction attacks
  • Setup development environment for ART
  • Read the code of ART and create plan for new firewall APIs
  • Discuss your proposal with the mentors and on community tech meeting
  • Join our Slack channel or mail our mentors to get in touch with them for more information

Project idea #2: Extend set of Expectation over Transformation tools in ART

Abstract

We would like to extend ART's Expectation over Transformation (EoT) tools to cover the most common data augmentation transformation operations and add support beyond classification to object detection, object tracking, and speech recognition. Furthermore, we would like to automate and simplify the support for non-framework-specific transformations in framework-specific attacks.

Difficulty Time Priority Mentors
Easy 175 hr Medium Beat Buesser

Technical Details

ART already has a basic set of EoT tools, mainly for classification and some support for object detection. To extend this initial set of tools and facilitate non-framework-specific transformations the project will:

  • Define a comprehensive list of transformations including standard and adversarial augmentation operations
  • Implement transformations as EoT Preprocessors with support for object detection, object tracking, and speech recognition
  • Implement a generalised and automated Preprocessor for non-framework-specific transformations
  • Implement unit tests for the new tools

Helpful Experience

  • Proficient in Python and machine learning
  • Ability to read scientific publications
  • Nice to have: knowledge of adversarial machine learning

First Steps

  • Familiarise with literature on Expectation over Transformation ina adversarial machine learning
  • Setup development environment for ART
  • Read the code of ART and create plan for new EoTs
  • Discuss your proposal with the mentors and on community tech meeting
  • Join our Slack channel or mail our mentors to get in touch with them for more information