Altmetrics
Downloads
258
Views
290
Comments
0
A peer-reviewed article of this preprint also exists.
This version is not peer-reviewed
Submitted:
22 June 2024
Posted:
24 June 2024
You are already at the latest version
Model | Total Questions | Correct Answers | Accuracy |
---|---|---|---|
GPT-3.5 | 389 | 201 | 51.67% |
GPT-4 | 80 | 72 | 90.00% |
GPT-4o | 118 | 98 | 83.05% |
Model | Accuracy |
---|---|
GPT-3.5 | 58.80% |
GPT-4 | 73.20% |
GPT-4o | 85.39% |
Test | Total Questions | Correct Answers | Accuracy |
---|---|---|---|
Reading & Writing M1 | 33 | 31 | 93.94% |
Reading & Writing M2 | 33 | 29 | 87.88% |
Math M1 | 27 | 25 | 92.59% |
Math M2 | 27 | 22 | 81.48% |
Model | Reading & Writing | Math |
---|---|---|
GPT-3.5 | 83.75% | 73.75% |
GPT-4 | 88.75% | 87.50% |
GPT-4 (no vision) | 88.75% | 86.25% |
GPT-4o | 90.91% | 87.04% |
Model | Accuracy |
---|---|
GPT-3.5 | 45.10% |
GPT-4 | 75.70% |
GPT-4o | 75.00% |
Category/ Model |
Deductive Reasoning | Inductive Reasoning | Abductive Reasoning | ||
---|---|---|---|---|---|
Entailment Bank | bAbI (task 15) | CLUTRR | bAbI (task 15) | αNLI | |
GPT 3.5 | 25/30 | 26/30 | 2/30 | 14/30 | 19/30 |
GPT 4 | 27/30 | 30/30 | 11/30 | 28/30 | 25/30 |
GPT 4o | 29/30 | 30/30 | 17/30 | 30/30 | 27/30 |
Language | Translation Accuracy (%) |
---|---|
Spanish | 88 |
Arabic | 78 |
Hindi | 82 |
French | 75 |
Portuguese | 86 |
Russian | 80 |
Class | Precision | Recall | F1-Score |
---|---|---|---|
Banana | 0.91 | 1.00 | 0.95 |
Papaya | 1.00 | 1.00 | 1.00 |
Apple | 1.00 | 1.00 | 1.00 |
Litchi | 1.00 | 1.00 | 1.00 |
Jackfruit | 1.00 | 0.90 | 0.95 |
Hog Plum | 1.00 | 1.00 | 1.00 |
Grapes | 1.00 | 1.00 | 1.00 |
Guava | 1.00 | 1.00 | 1.00 |
Mango | 0.91 | 1.00 | 0.95 |
Orange | 1.00 | 0.90 | 0.95 |
Class | Precision | Recall | F1-Score |
---|---|---|---|
Drowsy | 0.8 | 0.8 | 0.8 |
Natural | 0.8 | 0.8 | 0.8 |
Class | Precision | Recall | F1-score |
---|---|---|---|
Anthracnose | 0.60 | 0.60 | 0.60 |
Apple Scab | 1.00 | 0.80 | 0.89 |
Black Spot | 0.67 | 1.00 | 0.80 |
Blight | 0.38 | 0.75 | 0.50 |
Blossom End Rot | 1.00 | 1.08 | 1.00 |
Botrytis | 1.00 | 0.20 | 0.33 |
Brown Rot | 1.00 | 0.20 | 0.33 |
Canker | 0.25 | 0.25 | 0.25 |
Cedar Apple Rust | 0.83 | 1.00 | 0.91 |
Clubroot | 1.00 | 1.00 | 1.00 |
Crown Gall | 1.00 | 1.00 | 1.00 |
Downy Mildew | 1.00 | 0.20 | 0.33 |
Fire Blight | 0.80 | 0.80 | 0.80 |
Fusarium | 1.00 | 0.60 | 0.75 |
Gray Mold | 0.43 | 0.75 | 0.55 |
Leaf Spots | 0.40 | 0.80 | 0.53 |
Mosaic Virus | 0.67 | 0.80 | 0.72 |
Nematodes | 1.00 | 1.00 | 1.00 |
Powdery Mildew | 0.71 | 1.00 | 0.83 |
Verticillium | 0.67 | 0.40 | 0.50 |
Class | Precision | Recall | F1-Score |
---|---|---|---|
Glaucoma | 0.58 | 0.86 | 0.69 |
Normal | 0.73 | 0.38 | 0.50 |
Class | Precision | Recall | F1-Score |
---|---|---|---|
Aneurysm | 0.3725 | 0.5758 | 0.4524 |
Cancer | 0.0000 | 0.0000 | 0.0000 |
Tumor | 0.2708 | 0.3939 | 0.3210 |
Metric | Score |
---|---|
BLEU-1 | 0.193 |
BLEU-2 | 0.095 |
BLEU-3 | 0.058 |
BLEU-4 | 0.031 |
Model | Recognition | OCR | Knowledge | Language Generation | Spatial Awareness | Math | Total |
---|---|---|---|---|---|---|---|
GPT-3.5 | 22.3 | 31.4 | 15.6 | 16.6 | 32.9 | 24.0 | 27.6 |
GPT-4 | 34.3 | 66.3 | 25.6 | 36.6 | 60.6 | 72.0 | 48.1 |
GPT-4V | 67.5 | 68.3 | 56.2 | 60.7 | 69.4 | 58.6 | 67.7 |
GPT-4o | 80.6 | 87.5 | 75.7 | 77.1 | 84.7 | 88.1 | 83.9 |
Input Image | Questions|Assessed Capabilities | GPT-4o Responses | |
---|---|---|---|
a) Q: How many gallons of supreme gasoline can I get with $50? | GPT-4o: a) With $50, you can get approximately 13.70 gallons of supreme gasoline. | ||
b) Q: What is the price difference between diesel and regular gasoline? Required capabilities: OCR, Math |
GPT-4o: The price difference between diesel and regular gasoline is $0.11 per gallon. | ||
a) Q: Based on the image, what is the total international awareness of Tina Fey? | GPT-4o: Based on the image, the total international awareness of Tina Fey is 35%. This percentage is located at the bottom row under the column labeled "Total" for Tina Fey. | ||
b) Q: Based on the image, in which country does Tina Fey have the highest level of total awareness? Required capabilities: OCR, Spatial awareness |
GPT-4o: Based on the image, Tina Fey has the highest level of total awareness in Australia, with an awareness level of 62%. | ||
a) Q: What is the price for tomatoes? | GPT-4o: The price for tomatoes is 8.00, as indicated by the sign placed above the pile of tomatoes in the image. | ||
b) Q: What fruit is the the right of plums? Required capabilities: Recognition, Spatial awareness |
GPT-4o: The fruit to the right of the plums in the image is pears. | ||
a) Q: Which car is on the parking spot 33? | GPT-4o: The parking spot labeled 33 is empty. | ||
b) Q: Which spot is the red car parking on? Required capabilities: OCR, Recognition, Spatial awareness |
GPT-4o: The red car is parked in spot number 30. | ||
a) Q: Which category did the worst job in plastic recycling, considering the amount of plastic made and thrown out? | GPT-4o: The category that did the worst job in plastic recycling, considering the amount of plastic made and thrown out, is "Packaging." In 2015, 146 million metric tonnes of plastic were made for packaging, and 141 million metric tonnes were thrown out. This shows that almost all the plastic made for packaging was discarded as waste. | ||
b) Q: What is the difference in metric tonnes between the amount of plastic produced and the amount discarded? Required capabilities: OCR, Spatial awareness |
The difference between the amount of plastic produced (407 million metric tonnes) and the amount discarded (302 million metric tonnes) is: 407 million metric tonnes – 302 million metric tonnes = 105 million metric tonnes So, the difference is 105 million metric tonnes. |
||
Q: What does this factory make? Required capabilities: OCR |
GPT-4o: The factory in the image belongs to Leica, which is a company known for manufacturing cameras, lenses, and other optical equipment. | ||
Q: In which country was this photo taken? Required capabilities: Recognition, Knowledge |
GPT-4o: The photo shows Novak Djokovic kissing a trophy that appears to be the Australian Open trophy. Therefore, this photo was likely taken in Australia. | ||
Q: Which are producers in this food web? Required capabilities: OCR, Knowledge, Spatial Awareness |
GPT-4o: In this food web, the producers are: Phytoplankton Seaweed These are the organisms that produce energy through photosynthesis and serve as the base of the food web. |
||
Q: Is it summertime in the scene? Answer it and give the rationale. Required capabilities: Recognition, Knowledge, Language generation |
GPT-4o: No, it is not summertime in the scene. The presence of large pumpkins suggests that it is likely fall, as pumpkins are typically associated with the autumn harvest season. Additionally, the characters are wearing long-sleeved clothing, which is more common in cooler weather. | ||
Q: Can you explain this meme? Required capabilities: Recognition, OCR, Knowledge, Language generation |
GPT-4o: This meme humorously illustrates the common tendency to procrastinate. The text shows an intention to start a task at a specific time (8:00), but when the time is slightly missed (8:05), the person decides to delay the task further, waiting until the next hour (9:00) to start. The image of Kermit the Frog relaxing on a bed adds to the humor by depicting the relaxed attitude towards procrastination. The meme pokes fun at the irrational yet relatable habit of postponing tasks due to missing an arbitrary starting time by a few minutes. | ||
Q: What is funny about this image? Required capabilities: Recognition, Knowledge, Language generation |
GPT-4o: This image is a popular meme that combines two unrelated photos to create a humorous juxtaposition. On the left side, there is an image of a woman (Taylor Armstrong from "The Real Housewives of Beverly Hills") angrily pointing and shouting, while another woman (Kyle Richards) tries to calm her down. On the right side, there is a photo of a white cat (known as "Smudge the Cat") sitting at a dinner table with a confused and displeased expression in front of a plate of vegetables. The humor comes from the contrast between the woman's intense emotion and the cat's indifferent and bemused reaction, as if the cat is unbothered by the woman's outburst. This combination creates a relatable and funny depiction of exaggerated reactions and misunderstandings. |
||
Q: How to cook this dish? Required capabilities: Recognition, Knowledge, Language generation |
GPT-4o: This image shows a pie with a lattice crust. Here is a basic recipe to make a lattice-top pie, such as a strawberry or mixed berry pie: Ingredients: For the Filling: 4 cups fresh strawberries or mixed berries, 3/4 cup granulated sugar, 1/4 cup cornstarch, 1 tablespoon lemon juice, 1 teaspoon vanilla extract, A pinch of salt For the Crust: 2 1/2 cups all-purpose flour, 1 teaspoon salt, 1 teaspoon sugar, 1 cup unsalted butter, chilled and cut into small pieces, 1/4 to 1/2 cup ice water Instructions: In a large bowl, combine the flour, salt, and sugar. Add the chilled butter pieces and cut them into the flour mixture using a pastry blender or your fingers until the mixture resembles coarse crumbs. Gradually add ice water, a tablespoon at a time, mixing until the dough begins to come together. Do not overwork the dough. Divide the dough into two equal parts, shape them into disks, wrap in plastic wrap, and refrigerate for at least 1 hour. In a large bowl, combine the berries, sugar, cornstarch, lemon juice, vanilla extract, and salt. Stir gently to combine. On a lightly floured surface, roll out one disk of dough to fit a 9-inch pie pan. Place the rolled-out dough into the pie pan and trim the edges. Pour the berry mixture into the crust-lined pie pan. Roll out the second disk of dough and cut it into 1-inch-wide strips. Lay 4 to 5 strips over the filling, parallel to each other. Lay another 4 to 5 strips perpendicular to the first set, weaving them to create a lattice pattern. Trim and crimp the edges to seal. Preheat the oven to 375°F (190°C). Bake: Place the pie on a baking sheet to catch any drips and bake for 45-55 minutes, or until the crust is golden brown and the filling is bubbly. Cool: Let the pie cool completely before serving to allow the filling to set. Enjoy your homemade lattice-top pie! |
||
Q: Can you give a short introduction to this painting? Required capabilities: Recognition, Knowledge, Language generation |
GPT-4o: This painting is a historical Chinese artwork titled "Emperor Taizong Receiving the Tibetan Envoy," created by Yan Liben during the Tang Dynasty. The piece illustrates a significant diplomatic event where Emperor Taizong of the Tang Dynasty meets a Tibetan envoy. This event highlights the political and cultural exchanges between the Tang Dynasty and Tibet.The painting is notable for its detailed depiction of the figures, the hierarchical arrangement, and the ceremonial aspects of the meeting. Yan Liben, a renowned court artist of the Tang Dynasty, is celebrated for his ability to capture the grandeur and solemnity of imperial events. This artwork not only serves as a historical record but also exemplifies the artistic sophistication of the Tang Dynasty. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 MDPI (Basel, Switzerland) unless otherwise stated