From 165f44f4deb8e8149ea415646380b52bd879f737 Mon Sep 17 00:00:00 2001 From: Hari John Kuriakose Date: Mon, 23 Sep 2024 14:37:51 +0530 Subject: [PATCH] Add CONTRIBUTING.md (#725) * * Add contributing * Update run platform script user guide in README * Update encryption key backup reminder in README * Fix contributing md doc link in readme * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> --- CONTRIBUTE.md | 255 ------------------------------------------------ CONTRIBUTING.md | 3 + README.md | 9 +- 3 files changed, 11 insertions(+), 256 deletions(-) delete mode 100644 CONTRIBUTE.md create mode 100644 CONTRIBUTING.md diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md deleted file mode 100644 index 762700414..000000000 --- a/CONTRIBUTE.md +++ /dev/null @@ -1,255 +0,0 @@ -# Unstract - -[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm-project.org) - -Use LLMs to eliminate manual processes involving unstructured data. - -## System Requirements - -- `docker` (see [instructions](https://docs.docker.com/engine/install/)) -- `git` -- `pdm` (see below) -- `pyenv` (recommended to manage multiple Python versions) - -## Quick Start - -Just run the `run-platform.sh` launch script to get started in few minutes. - -The launch script configures the env with sane defaults, pulls public docker images or builds them locally and finally runs them in containers. - -```bash -# Pull and run entire Unstract platform with default env config. -./run-platform.sh - -# Pull and run docker containers with a specific version tag. -./run-platform.sh -v v0.1.0 - -# Build docker images locally and run with a specific version tag. -./run-platform.sh -b -v v0.1.0 - -# Display the help information. -./run-platform.sh -h - -# Only do setup of environment files. -./run-platform.sh -e - -# Only do docker images pull with a specific version tag. -./run-platform.sh -p -v v0.1.0 - -# Only do docker images pull by building locally with a specific version tag. -./run-platform.sh -p -b -v v0.1.0 - -# Pull and run docker containers in detached mode. -./run-platform.sh -d -v v0.1.0 -``` - -Now visit [http://frontend.unstract.localhost](http://frontend.unstract.localhost) in your browser. - -NOTE: Modify the `.env` files present in each service folder to update its runtime behaviour. Run docker compose up again for the changes to take effect.``` -That's all. Enjoy! - -## Authentication - -The default username is `unstract` and the default password is `unstract`. -More details on configuring this can be found in [backend's README.md](/backend/README.md#authentication) - -## Configuring a Text Extractor - -Unstract predominantly works with PDF documents and it requires a `Text Extractor` to be configured in the application which helps retrieve text from the documents. Currently supported text extractors include - -- [LLMWhisperer](https://unstract-api-resource.developer.azure-api.net/) (works best) -- Unstructured Community -- Unstructured Enterprise - -### Steps to use LLMWhisperer Service - -[LLMWhisperer](https://unstract-api-resource.developer.azure-api.net/) is our text extraction service which provides best results with Unstract. -- Create an account in the [developer portal](https://unstract-api-resource.developer.azure-api.net/signup) -- Create a `Subscription` under [your profile](https://unstract-api-resource.developer.azure-api.net/profile) and copy the `Primary Key` -- Try the APIs from the portal by passing the copied key in the request header `unstract-key` -- This key needs to be passed in our application while creating an `LLM Whisperer Text Extractor` - -## Running with docker compose - -See [Docker README.md](docker/README.md). - -## Setup Unstract for local development - -### Installation - -- Install the below libraries which are needed to run Unstract - - Linux - - ```bash - apt install build-essential libmagic-dev pandoc pkg-config tesseract-ocr - ``` - - - Mac - - ```bash - brew install freetds libmagic pkg-config poppler - ``` - -### Create your virtual env - -All commands assumes that you have activated your `venv`. - -```bash -cd - -# Create venv -pdm venv create -w virtualenv --with-pip -eval "$(pdm venv activate in-project)" - -# Remove venv -pdm venv remove in-project -``` - -### Install dependencies with PDM - -[PDM](https://github.com/pdm-project/pdm) is used for dependency management. - -```bash -# Install via script -curl -sSL https://pdm.fming.dev/install-pdm.py | python3 - - -# Install via pip -pip install pdm -``` - -Go to service dir and install dependencies listed in corresponding `pyproject.toml`. - -```bash -# Install dependencies -pdm install - -# Install specific dev dependency group -pdm install --dev -G lint - -# Install production dependencies only -pdm install --prod --no-editable -``` - -PDM allows you to run scripts applicable within the service dir. - -```bash -# List the possible scripts that can be executed -pdm run -l -``` - -Add dependencies as follows. - -```bash -# Add a new service dependency to ts pyproject.toml. -pdm add -# Add a relative path as an editable install. -pdm add -e -# List all dependencies. -pdm list -``` - -After modifying `pyproject.toml`, the lock file can be updated as below. - -``` -pdm lock -``` - -See [PDM's documentation](https://pdm.fming.dev/latest/reference/cli/) for further details. - -### Configuring Postgres - -- Create a Postgres user and DB for the BE and configure it like so - -``` -POSTGRES_USER: unstract_dev -POSTGRES_PASSWORD: unstract_pass -POSTGRES_DB: unstract_db -``` - -If you require a different config, make sure the necessary envs from [backend/sample.env](/backend/sample.env) are exported. - -### Pre-commit hooks - -- We use `pre-commit` to run some hooks whenever code is pushed to perform linting and static code analysis among other checks. -- Ensure dev dependencies are installed and you're in the virtual env -- Install hooks with `pre-commit install` or `pdm run pre-commit install` -- Manually trigger pre-commit hooks in following ways: - ```bash - # - # Using the tool directly - # - # Run all pre-commit hooks - pre-commit run - # Run specific pre-commit hook - pre-commit run flake8 - # Run mypy pre-commit hook for selected folder - pre-commit run mypy --files prompt-service/**/*.py - # Run mypy for selected folder - mypy prompt-service/**/*.py - - # - # Using pdm to run the scripts - # - # Run all pre-commit hooks - pdm run pre-commit run - # Run specific pre-commit hook - pdm run pre-commit run flake8 - # Run mypy pre-commit hook for selected folder - pdm run pre-commit run mypy --files prompt-service/**/*.py - # Run mypy for selected folder - pdm run mypy prompt-service/**/*.py - ``` - -### Backend - -- Check [backend/README.md](backend/README.md) for running the backend. - -### Frontend - -- Install dependencies with `npm install` -- Start the server with `npm start` - -### Traefik Proxy Overrides for Local + Docker Runs - -It is possible to simultaneously run few services directly on docker host while others are run as docker containers via docker compose. -This enables seamless development without worrying about deployment of other services which you are not concerned with. - -We just need to override default Traefik proxy routing to allow this, that's all. - -1. Copy `docker/sample.proxy_overrides.yaml` to `docker/proxy_overrides.yaml`. - Modify to update Traefik proxy routes for services running directly on docker host (`host.docker.internal:`). - -2. Update host name of dependency components in config of services running directly on docker host: - - Replace as `*.localhost` IF container port is exposed on docker host - - **OR** use container IPs obtained via `docker network inspect unstract-network` - - **OR** run `dockers/scripts/resolve_container_svc_from_host.sh` IF container port is NOT exposed on docker host or if you want to keep dependency host names unchanged - -Run the services. - -#### Conflicting Host Names - -When same host name environment variables are used by both the service running locally and a service -running in a container (for example, running in from a tool), host name resolution conflicts can arise for the following: - -- `localhost` -> Using this inside a container points to the container itself, and not the host. -- `host.docker.internal` -> Meant to be used inside containers only, to get host IP. -Does not make sense to use in services running locally. - -*In such cases, use another host name and point the same to host IP in `/etc/hosts`.* - -For example, the backend uses the PROMPT_HOST environment variable, which is also supplied -in the Tool configuration when spawning Tool containers. If the backend is running -locally and the Tools are in containers, we could set the value to -`prompt-service` and add it to `/etc/hosts` as shown below. -``` - prompt-service -``` - -## Generate Encryption key to be used in `backend` and `platform-service` - -An encryption key is used to securely encrypt and store data, for example credentials of connectors or adapters. -We make use of [cryptography's](https://pypi.org/project/cryptography/) Fernet to perform this encryption. Use this snippet to generate a key that can be set in your respective `backend` and `platform-service` `.env` files. - -```bash -ENCRYPTION_KEY=$(python -c "import secrets, base64; print(base64.urlsafe_b64encode(secrets.token_bytes(32)).decode())") -``` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 000000000..a33e14be8 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,3 @@ +# Contributing + +See [docs.unstract.com](https://docs.unstract.com/contributing/unstract). diff --git a/README.md b/README.md index 8887fe97e..2d92f98be 100644 --- a/README.md +++ b/README.md @@ -58,6 +58,7 @@ Next, either download a release or clone this repo and do the following: That's all there is to it! +See [user guide](https://docs.unstract.com/unstract_platform/user_guides/run_platform) for more details on managing the platform. Another really quick way to experience Unstract is by signing up for our [hosted version](https://us-central.unstract.com/). ## ⏩ Quick Start Guide @@ -137,7 +138,7 @@ Unstract comes well documented. You can get introduced to the [basics of Unstrac ## 🙌 Contributing -Contributions are welcome! Please read [CONTRIBUTE.md](CONTRIBUTE.md) for further details on setting up the development environment, etc. It also points you to other detailed documents as needed. +Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for further details on setting up the development environment, etc. It also points you to other detailed documents as needed. ## 👋 Join the LLM-powered automation community @@ -145,6 +146,12 @@ Contributions are welcome! Please read [CONTRIBUTE.md](CONTRIBUTE.md) for furthe - [Follow us on X/Twitter](https://twitter.com/GetUnstract) - [Follow us on LinkedIn](https://www.linkedin.com/showcase/unstract/) +## 🚨 Backup encryption key + +Do copy the value of `ENCRYPTION_KEY` config in either `backend/.env` or `platform-service/.env` file to a secure location. + +Adapter credentials are encrypted by the platform using this key. Its loss or change will make all existing adapters inaccessible! + ## 📊 A note on analytics In full disclosure, Unstract integrates Posthog to track usage analytics. As you can inspect the relevant code here, we collect the minimum possible metrics. Posthog can be disabled if desired by setting `REACT_APP_ENABLE_POSTHOG` to `false` in the frontend's .env file.