GPT-4 Vision is an amazing model, and if you know its capabilities and limits, you can use it to unlock new applications and new user experiences.
However, GPT-4V has a few drawbacks. It is expensive (when it comes out of preview) and it is not transparent. It runs on the servers of OpenAI, which means your data has to constantly travel to a third-party server. This can cause both privacy concerns as well as technical hurdles since you’re sending images as well as text.
Fortunately, there are open-source models that provide similar capabilities. Some of these models include LLaVA, Fuyu, and CogVLM. All these models are available for download and run on your servers. They score well on relevant benchmarks. And they are small and cost-efficient.
However, these models come with caveats. They are not as capable as GPT-4V and you have to know their limits. In my latest article, I explore these open-source models, their capabilities, and their architecture.
I also do a quick comparison of GPT-4V and LLaVA 1.5, which is reportedly the best open-source multi-modal LLM that is currently available. Unsurprisingly, GPT-4V is much better at responding to visual prompts, especially when it must extract and process structured data from the image.
My key finding is that beyond benchmarks, you should gather a small test set for your specific application and run them on different models to test their capabilities.
Read all about open-source alternatives to GPT-4V on TechTalks.
Recommendations:
My go-to platform for working with GPT-4 and Claude is ForeFront.ai, which has a super-flexible pricing plan and plenty of good features for writing and coding. I use ForeFront for all kinds of tasks, including writing, coding, and testing new prompting techniques. The pricing is very convenient and the platform is user-friendly.
You might also be interested in these articles:
I found CogVLM to be the most performant multimodal model out of the existing ones.
Also, did you get the chance to try Apple's recently open-sourced Ferret? https://github.com/apple/ml-ferret