Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-modal support for vision models such as GPT-4 vision #331

Open
cmungall opened this issue Nov 7, 2023 · 44 comments
Open

Multi-modal support for vision models such as GPT-4 vision #331

cmungall opened this issue Nov 7, 2023 · 44 comments
Labels
enhancement New feature or request

Comments

@cmungall
Copy link
Contributor

cmungall commented Nov 7, 2023

https://platform.openai.com/docs/guides/vision

I think this is best handled by command line options --image and --image-urls to either encode and pass as base64, or to pass a URL.

@tomviner
Copy link

tomviner commented Nov 8, 2023

Indeed this would be awesome. Does it require changes to llm or can it be done in a plugin?

@cmungall
Copy link
Contributor Author

cmungall commented Nov 8, 2023

I suspect we'll be seeing more multimodal models so inclusion in core makes sense, but I defer to @simonw on this!

@simonw simonw added the enhancement New feature or request label Nov 8, 2023
@simonw simonw changed the title support gpt-4-vision-preview Multi-model support for vision models such as GPT-4 vision Nov 8, 2023
@simonw
Copy link
Owner

simonw commented Nov 8, 2023

I've been thinking about this a lot.

The challenge here is that we need to be able to mix both text and images together in the same prompt - because you can call GPT-4 vision with this kind of thing:

Take a look at this image:

<image 1>

Now compare it to this:

<image 2>

My first instinct was to support syntax like this:

llm -m gpt-4-vision \
  "Take a look at this image:" \
  -i image1.jpeg \
  "Now compare it to this:" \
  -i https://example.com/image2.png

Note that the -i/--image option here takes a filename or a URL, detecting files by seeing if they correspond to files on disk.

But... I don't think I can implement this, because Click really, really doesn't want to provide a mechanism for storing and retrieving the order of different arguments and parameters relative to each other:

I spent some time trying to get this to work with a custom Click command class and parse_args() but determined that I'd effectively have to re-implement the whole Click argument parser from scratch to handle cases like --enable-logging boolean flags and -p key value multi-value parameters. This doesn't feel worthwhile to me.

So now I'm considering the following instead:

llm "look at this image" -i image.jpeg --tbc
llm -c "and compare it with" -i https://example.com/image.png

The trick here is that new --tbc flag, which stands for "to be continued". It causes the prompt to be stored but NOT executed against he model yet - instead, any following llm -c calls can be used to stack up more context in the prompt which will be executed the first time --tbc is NOT used.

On a related note: llm chat could also support this - maybe letting you do this kind of thing:

llm chat -m gpt-4-vision
look at this image
!image image.jpeg

For multi-lined chats you would use the existing !multi command:

llm chat -m gpt-4-vision
!multi
look at this image
!image image.jpeg
and compare it with
!image https://example.com/image.png
!end

@simonw
Copy link
Owner

simonw commented Nov 8, 2023

Crucially, I want to leave the door open for other LLM models provided by plugins - like maybe https://github.com/SkunkworksAI/BakLLaVA - to also support multi-modal inputs like this.

So I think the model class would have a supports_images = True property it could set on to tell LLM that images are supported - otherwise using -i/--image would return an error.

@simonw
Copy link
Owner

simonw commented Nov 8, 2023

One note about the --tbc thing is that we can get basic image support working without it - we could implement this and say that support for multiple images in the same prompt is coming later:

llm -m gpt-4-vision "Caption for this image" -i image.jpeg

@simonw
Copy link
Owner

simonw commented Nov 8, 2023

This work is blocked on:

@simonw
Copy link
Owner

simonw commented Nov 8, 2023

Would be amazing to get this working with a Bakllava local model - relevant example code using llama.cpp here https://github.com/cocktailpeanut/mirror/blob/main/app.py

@cmungall cmungall changed the title Multi-model support for vision models such as GPT-4 vision Multi-modal support for vision models such as GPT-4 vision Nov 9, 2023
@psychemedia
Copy link

psychemedia commented Nov 13, 2023

Another claimed bakllava example (not tried it yet), this one using llama-cpp-python: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html

[Actually uses from llm_core.llm import LLaVACPPModel ; Trying to run the example code on my MacBook Pro M2 16GB and it just falls over...; other chat models of a similar size seem to work okay.)

@neomanic
Copy link

neomanic commented Dec 4, 2023

@simonw how about f-strings/templating style?

llm "look at this image {src_image} and compare it to {compare_image}" \
    --infile src_image=sample.jpeg --infile compare_image=known.jpeg
def _infiles_to_dict(
        ctx: click.Context, attribute: click.Option, infiles: tuple[str, ...]) -> dict[str, str]:
     return {k:v for k,v in (f.split("=") for f in infiles)}
@click.command()
@click.option(
    "-i",
    "--infile",
    multiple=True,
    callback=_infiles_to_dict,
    help="Input files in the form key=filename. Multiple files can be included."
)

Misc thoughts:

  • I do like the --tbc idea as well.
  • --image makes sense for now, but later might change to --infile when they can take audio, video, random multi-modal documents? The model would have to specify what formats it accepts? Then the prompt might have to be `llm --infile {video.mp4:v} unless some auto-detection for file format is done.

@NightMachinery
Copy link

https://github.com/tbckr/sgpt

SGPT additionally facilitates the utilization of the GPT-4 Vision API. Include input images using the -i or --input flag, supporting both URLs and local images.

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" "what can you see on the picture?"
The image shows a figure resembling a robot with a humanoid form. It has a
$ sgpt -m "gpt-4-vision-preview" -i pkg/fs/testdata/marvin.jpg "what can you see on the picture?"
The image shows a figure resembling a robot with a sleek, metallic surface. It

It is also possible to combine URLs and local images:

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" -i pkg/fs/testdata/marvin.jpg "what is the difference between those two pictures"
The two images provided appear to be identical. Both show the same depiction of a

@simonw
Copy link
Owner

simonw commented Mar 4, 2024

I built a prototype of this today, in the image-experimental branch - just for OpenAI so far using docs on https://platform.openai.com/docs/guides/vision but I want to also ship support for Gemini and Claude (and eventually local models like LLaVA).

I gave it this image:

image

And ran this:

llm -m 4v 'describe this image' -i image.jpg -o max_tokens 200

And got back:

This image shows a young pig being held by a person. The pig has a light brown coat with some bristle-like hair and a prominent snout that is characteristic of pigs. It appears to be a juvenile, given its size. The pig's snout is a bit dirty, suggesting it may have been rooting around in the ground, which is common pig behavior. The person is out of frame with only their arm visible, dressed in a red garment with a seemingly soft texture. They are holding the pig securely against their body. The background indicates that this is an indoor setting with wooden structures, possibly inside a barn or a similar animal enclosure.

@simonw
Copy link
Owner

simonw commented Mar 4, 2024

Lots still to do on this - I want it to support either URLs or file paths or - as an input but those should then be made available to the model such that models like GPT-4 that support URL images can pass the URL in directly, while models like Claude 3 that only support base64 fetch that URL and then send it base64 encoded instead.

Maybe have a thing with Pillow as an optional dependency which can resize the images before sending them?

Have to decide what to do about logs. I think I need to log the images to the SQLite database (maybe in a new BLOB table) because I need them in conversations so I can send follow-up prompts - but that could take a lot of space. So I need to add tooling that helps users clean up old images from their database if it gets too big.

@simonw
Copy link
Owner

simonw commented Mar 4, 2024

I am going to pass around an image object that has a .url property that may or may not return a URL string (otherwise None) and a .bytes and .base64 property that ALWAYS return binary data or that data base64 encoded.

That way plugins like OpenAI that can be sent URLs can use .url first and fall back to .base64 if the URL is not available, and plugins like Claude 3 can use base64 every time.

I'm tempted to offer a .resized(max_width, max_height) method which returns a Pillow resized image for models that know there is a maximum or recommended size limit and want to send a smaller request.

@simonw simonw pinned this issue Mar 6, 2024
@simonw
Copy link
Owner

simonw commented Mar 6, 2024

Idea: rather than store the images in the database, I'll store the path to the files on disk.

If you attempt to continue a conversation where the file paths no longer resolve to existing images, you'll get an error.

@tomviner
Copy link

tomviner commented Mar 6, 2024

Would be nice if the API server gave you a reference for every uploaded image, that you could just refer back to

@anarcat
Copy link

anarcat commented Mar 7, 2024

came here looking for non-text API endpoints... i was hoping to have a direct view into the audio and text-to-speech API endpoints, in particular.

so while it would be nice to have llm have a chat-like interface to interleave images, maybe an easier first step would be to have just a simple "prompt-to-image", "prompt-to-audio", "audio-to-text" kind of commands?

@simonw
Copy link
Owner

simonw commented Mar 15, 2024

Quick survey on Twitter: https://twitter.com/simonw/status/1768445876274635155

Consensus is loosely to do image and then text, rather than text then image:

[{"type":"image_url","image_url":{"url":"..."}}, [{"type":"text","text":"Describe image"}]

@simonw
Copy link
Owner

simonw commented Mar 15, 2024

Claude 3 Haiku is cheaper than GPT-3.5 Turbo and supports image inputs - a great incentive to finally get this feature shipped!

@simonw
Copy link
Owner

simonw commented Mar 15, 2024

https://twitter.com/invisiblecomma/status/1768561708090417603

The Claude Vision docs recommend image first

https://docs.anthropic.com/claude/docs/vision#image-best-practices

Image placement: Just as with document-query placement, Claude works best when images come before text. Images placed after text or interpolated with text will still perform well, but if your use case allows it, we recommend image-then-text structure. See vision prompting tips for more details.

@simonw
Copy link
Owner

simonw commented Mar 15, 2024

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

@simonw
Copy link
Owner

simonw commented Mar 16, 2024

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/design-multimodal-prompts#prompt-design-fundamentals

Put your image first for single-image prompts: While Gemini can handle image and text inputs in any order, for prompts containing a single image, it might perform better if that image (or video) is placed before the text prompt. However, for prompts that require images to be highly interleaved with texts to make sense, use whatever order is most natural.

@NightMachinery
Copy link

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

IMO llm should compress/resize images to avoid errors and make things easy to use. You can add an option --no-image-resize which disables this behavior, and people who care will disable it. The average user (myself included) just want the image to go the model, and the error is unhelpful.

BTW, OpenAI supports both low and high detail levels for processing images. Does Anthropic have sth similar? Is this exposed in llm?

simonw added a commit to simonw/llm-claude-3 that referenced this issue Mar 28, 2024
simonw added a commit to simonw/llm-gemini that referenced this issue Mar 28, 2024
@irthomasthomas
Copy link

irthomasthomas commented Mar 29, 2024

I made a simple cli for vision, if anyone needs it before llm-vision is ready. Only supports GPT4 for now. :(
https://github.com/irthomasthomas/llm-vision

It supports specifying an output format that prompts the model to generate markdown, or json in addition to plain text. One thing odd about gpt-4-vision is that it doesn't know you have given it an image, and sometimes doesn't believe it has vision capabilities unless you give it a phrase like 'describe the image'. But, if you want to extract an image to json, then a text description isn't very useful. So, I prompt it with 'describe the image in your head, then write the json document'.

There's also a work-in-progress gpt4-vision-screen-compare.py - this takes a screenshot every few seconds and compares the similarity with the last screenshot and if different enough it sends it to the model asking it explain the changes between them.

And here's a demo of what you can do with it: https://twitter.com/xundecidability/status/1763219017160867840
Problem: I Want to import blocked domains list from kagi to Bing Custom Search.

  • Discovered that Bing Custom Search requires manual data entry of blocked domains.

Solution: A little bash script that:
Screenshots kagi blocked domains list
Gpt4-vision streams a text list of domains
xdotool types the domains into bing webpage as they stream in.

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

Current status:

  • Branch has -i support
  • I have GPT-4 Vision support, plus branches of llm-gemini and llm-claude-3

The main sticking point is what to do with the SQLite logging mechanism

It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.

Some ways that could work:

  • For images on disk, store the path to that image on disk. Use that again in follow-up prompts, and throw a hard error if the file is no longer visible.
  • Some models support URLs. For public URLs to images I can store those URLs, and let the APIs themselves error if the URLs are 404ing
  • Images fed in to standard input could be stored in the database, maybe as BLOB columns
  • But since being able to compare prompts responses is so useful, maybe I should store images from disk in BLOB too? The cost in terms of SQLite space taken up may be worth it.

@irthomasthomas
Copy link

irthomasthomas commented Apr 4, 2024 via email

@NightMachinery
Copy link

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

Another open question: how should this work in chat?

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

But then should it be submitted the moment you hit enter, or should you get the opportunity to add a prompt afterwards? I think adding a prompt afterwards makes sense.

Also should !image be allowed inside !multi? I'm not sure. If it IS, then how would you send that raw text to a model e.g. as part of a longer code sample you are pasting in?

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

Yeah I'm beginning to think I may need to had a whole settings/preferences mechanism to help solve this. llm settings set image_log_mode blob kind of thing.

@NightMachinery
Copy link

@simonw

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

Perhaps you can use a TUI hotkey? E.g., Ctrl-i for inserting images.
Though this will quickly spiral out of control ... E.g., should the TUI present a dialogue for selecting files?

The ideal case is to be able to just paste, and detect images from the clipboard. But this seems impossible to do using native paste. Perhaps you can add a custom hotkey for pasting that checks the clipboard.

I have some functions for macOS that paste images, e.g.,

class='«class PNGf»'
osascript -e "tell application \"System Events\" to ¬
                  write (the clipboard as ${class}) to ¬
                          (make new file at folder \"${dir}\" with properties ¬
                                  {name:\"${name}\"})"

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

For pasting I think I'll hold off until I have a web UI working - much easier to handle paste there (e.g. https://tools.simonwillison.net/ocr does that) than figure it out for the terminal.

It would be good to get this working though:

pbpaste | llm -m claude-3-opus 'describe this image' -i -

Oh, that's frustrating: it looks like pbpaste only works for text content, I tried pbpaste > /tmp/image.png and got a 0 byte file.

ChatGPT did come up with this recipe which seems to work:

osascript -e 'set theImage to the clipboard as «class PNGf»' \
  -e 'set theFile to open for access POSIX file "/tmp/clipboard.png" with write permission' \
  -e 'write theImage to theFile' \
  -e 'close access theFile' \
  && cat /tmp/clipboard.png && rm /tmp/clipboard.png

I imagine there are cleaner implementations than that. Would be easy to wrap one into a little zsh script or similar.

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

I saved this in ~/.local/bin (on my path) as impaste and chmod 755 ~/.local/bin/impaste and it seems to work:

#!/bin/zsh

# Generate a unique temporary filename
tempfile=$(mktemp -t clipboard.XXXXXXXXXX.png)

# Save the clipboard image to the temporary file
osascript -e 'set theImage to the clipboard as «class PNGf»' \
  -e "set theFile to open for access POSIX file \"$tempfile\" with write permission" \
  -e 'write theImage to theFile' \
  -e 'close access theFile'

# Output the image data to stdout
cat "$tempfile"

# Delete the temporary file
rm "$tempfile"

Opus conversation here: https://gist.github.com/simonw/736bcc9bcfaef40a55deaa959fd57ca8

simonw added a commit to simonw/til that referenced this issue Apr 4, 2024
@simonw
Copy link
Owner

simonw commented Apr 5, 2024

Turned that into a TIL: https://til.simonwillison.net/macos/impaste

@paulsmith
Copy link

@simonw I was inspired by your TIL to try a little Swift. Here's a executable that does roughly the same thing: https://github.com/paulsmith/pbimg

Also used Claude Opus to help get started.

@simonw
Copy link
Owner

simonw commented Apr 6, 2024

OK, design decision regarding logging of images.

All models will support URL input. If the model can handle URLs directly those will be passed to the model - for models that can't retrieve URLs themselves LLM will fetch the content and pass it to the model.

If you provide a URL, then just that URL string will be logged to the database.

If you provide a path to a file on disk, the full resolved path will be stored.

If you pipe an image into the tool (with -i .) the image will be stored as a BLOB in an llm_images table.

You can also pass image file names and use a --save-images option to write them tot hat table too. This is mainly useful if you are building a research database of prompts and responses and want to pass that around.

@NightMachinery
Copy link

@simonw I guess you should add a command to clean the database from image blobs, and automatically purge blobs older than LLM_CLEAN_OLDER_THAN which should by default be 90 days.

@simonw
Copy link
Owner

simonw commented Apr 11, 2024

The option for storing the images should be --store for consistency with the llm embed-multi command. Which already has the ability to store images in BLOB columns:

"content_blob": value if (store and isinstance(value, bytes)) else None,

@cmungall
Copy link
Contributor Author

I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if llm encounters an [img](...) and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.

This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.

@tomviner
Copy link

I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if llm encounters an [img](...) and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.

This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.

@cmungall this is the simplest approach for sure. Could even support tags or raw URLs/file paths.

The downside is you're now making network calls based on the input text. You also need a way of turning the feature off, and also escaping whatever syntax is used.

@simonw
Copy link
Owner

simonw commented May 13, 2024

I'm going to change this to -a/--attachment instead of -i/--image because models that accept things like video or audio are rapidly starting to emerge.

@simonw
Copy link
Owner

simonw commented May 14, 2024

... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as --attachment might not make sense.

So maybe I do -i/--image and -v/--video and -a/--audio instead?

@codecrack3
Copy link

... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as --attachment might not make sense.

So maybe I do -i/--image and -v/--video and -a/--audio instead?

yes, I think we need to use many variant extensions of the attachments instead. Only --attachment is very good but I think we need to implement a list filter to choose right processor for each extension

@NightMachinery
Copy link

I also like --attachment, though perhaps a better name is simply --file. We can use either the extension or libmagic to detect the file type. Perhaps flags such as --image can also be added to force a particular format. (I.e., --attachment would auto-detect, while --image always assumes an image input.)

@thiswillbeyourgithub
Copy link

No update on this? The lack of multimodality is really a major reason I'm not using llm as much anymore :/

@cjcarroll012
Copy link

Really want multi-modal as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests