components llm_rag_crack_chunk_embed_index_and_register

Crack, Chunk, Embed, Index, and Register Data

llm_rag_crack_chunk_embed_index_and_register

Overview

Creates chunks no larger than chunk_size from input_data, extracted document titles are prepended to each chunk\n\n

LLM models have token limits for the prompts passed to them, this is a limiting factor at embedding time and even more limiting at prompt completion time as only so much context can be passed along with instructions to the LLM and user queries.\n Chunking allows splitting source data of various formats into small but coherent snippets of information which can be 'packed' into LLM prompts when asking for answers to user queries related to the source documents.\n\n

Supported formats: md, txt, html/htm, pdf, ppt(x), doc(x), xls(x), py\n\n

Also generates embeddings vectors for data chunks if configured.\n\n

If embeddings_container is supplied, input chunks are compared to existing chunks in the Embeddings Container and only changed/new chunks are embedded, existing chunks being reused.\n\n

After indexing completes, a MLIndex yaml and supporting files are registered as an AzureML data asset.\n\n"

Version: 0.0.31

Inputs

Name	Description	Type	Default	Optional
input_data		uri_folder		False
embeddings_container	Folder containing previously generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks.	uri_folder		True
asset_uri	Where to save MLIndex	uri_folder		True
input_glob	Limit files opened from `input_data`, defaults to '*/'	string	*/	False
chunk_size	Maximum number of tokens per chunk.	integer	768	False
chunk_overlap	Number of tokens to overlap between chunks.	integer	0	False
use_rcts	Use langchain RecursiveTextSplitter to split chunks.	boolean	True	False
citation_url	Base URL to join with file paths to create full source file URL for chunk metadata.	string		True
citation_replacement_regex	A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url.	string		True
doc_intel_connection_id	AzureML Connection ID for Custom Workspace Connection containing the `endpoint` key and `api_key` secret for an Azure AI Document Intelligence Service.	string		True
embeddings_model	The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}'	string		True
embeddings_connection_id	The connection id of the Embeddings Model provider to use.	string		False
batch_size	Batch size to use when embedding data.	integer	100	False
num_workers	Number of workers to use when embedding data.	integer	-1	False
asset_name	Name of the asset to register.	string		False
acs_config	JSON string containing the ACS configuration. e.g. {"index_name": "my-index"}	string		False
index_connection_id	The connection id of the ACS provider to use.	string		True
validate_deployments	Enables Validation of Model and Index deployments.	string		True
llm_config	JSON string containing the LLM configuration.	string		True
llm_connection_id	The connection id of the LLM provider to use.	string		True

Environment

azureml:llm-rag-embeddings@latest

Wiki menu

Home
Reference Documentation
- Components
- Data
- Environments
- Models
Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly