The debate is ongoing on whether LLMs like GPT-4 possess reasoning abilities. What is evident is that with scale, these models seem to manifest emerging capabilities that resemble human reasoning.
However, there are also plenty of experiments that show LLMs can’t generalize beyond their training data.
A new study by scientists at Santa Fe Institute evaluates GPT-4 and GPT-4V on ConceptARC, a series of visual puzzles that are designed to assess abstract reasoning in both humans and AI.
What is interesting about this test is that it gives you a good framework to look beyond the surface of the impressive capabilities of state-of-the-art LLMs and properly evaluate them.
Perhaps unsurprisingly, the results of the study show that GPT-4 falls short of human abstract reasoning capabilities. But surprisingly, GPT-4V, the multimodal version of GPT-4, performs worse than the text-only version, which suggests that multimodality does not necessarily improve the reasoning capabilities of LLMs.
Read the full article on TechTalks.
Recommendations:
You can run your own ConceptARC tests on GPT-4 with ForeFront, my favorite platform for using GPT-4 and Claude. ForeFront has a very flexible pricing program plus very useful features for creating your own assistants for various tasks. I’ve been using it for months in writing and coding and it has improved my productivity impressively. Find out more here.
Join my new project!
I’ve just launched Tales from the Valley of Sand, a series of short stories that happen in a universe parallel to ours and reflect the events taking place in the tech world. Hillarious and entertaining, we will have one or two short stories per week. Read the first episode: “The Coup at the House of Flowers.”
More on AI research: