Introduction
Back in 2022, Scott Alexander made a bet about the capability of AI text-to-image models (TTIMs, because I can’t be bothered typing that out again) in 'composition'. That is, in how able these models are to represent multiple entities in the correct requested positions relative to each other, and apply correct adjectives to each of them.
I would characterise TTIM's capabilities on three dimensions:
Coherence: Does the picture make sense- perspective, correct number of limbs/fingers, coherent objects.
Composition: As above- does the model follow the instructions correctly or just pull a few 'vibes' out of the prompt.
Beauty: Does the model produce nice-looking pictures?
Regarding progress in the last 4 years or so, and before doing any tests for this article, my view was that these models have made huge strides forward in coherence, some material progress in composition, and that beauty has improved a little, but seems to fluctuate up and down with the specific model more than most. Sometimes models succeed in improving coherence or composition by sacrificing beauty a lot, but the trend is not very strong. Modern models don't produce substantially prettier images than models of a couple of years ago.
Some of my views above have now changed since writing this, though, so read on before challenging any of this.
The Models
The first TTI model I ever played with was Dream by Wombo (the site is here; I've no idea what its current state is). It produced surprisingly pretty images, but it was terrible on the other two dimensions.
It picked up a few basic "vibes" from the prompt- the core subject matter and maybe a location, but it couldn't follow instructions for toffee. It couldn't even really pick up on an art style- almost all modern models can do this very well; it had a very consistent style.
Coherence was, if anything, worse- the images were invariably bizarre dreamscapes. If any people were in the image they were invariably body-horror abominations.
But it made very pretty, rather haunting images.
At some point Dream added some new options that created much more coherent images- they would have people in that looked pretty much like people. But composition actually somehow got worse; at best the model would pick out a single word from the prompt and draw that. And the pictures were dreadful; they were really boring and flat.
Others I have played with include DALL-E 2 and DALL-E 3 (I am very impressed in general with the latter), GPT 4o (also extremely good, and its multimedia interface allows it to do some things that you can't do with most models), StarryAI (not used this one for a long time, but it was OK in its day), ImageFX, Flux, and SDXL.
Flux is slightly interesting. It has largely solved coherence problems (one of the few open-weights models to have done so), and is definitely much better than 2022-era Dream at both composition and beauty. But it isn't great at composition and its pictures are not nearly as pretty or interesting as the best closed models.
I did play with MidJourney briefly at one point. But at the time I tried it it didn't seem notably better than the free models and in fact quite a bit worse at composition.
Back to Scott and his bet.
Scott’s List
I was curious what the current state of play is regarding Scott's original suggested prompts:
A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth
An oil painting of a man in a factory looking at a cat wearing a top hat
A digital art picture of a child riding a llama with a bell on its tail through a desert
A 3D render of an astronaut in space holding a fox wearing lipstick
Pixel art of a farmer in a cathedral holding a red basketball
Now I will disappear into a few digressions before getting into this. Firstly, I will say, that I don't think these prompts are perfectly crafted to measure increasing TTIM’s compositional capabilities; that's not a criticism of Scott; that was never the point. He thought composition would improve over its 2022 state, and it unarguably has done.
All the prompts ask for art in a specific style (stained glass, oil painting, digital art...), and a specific setting (library, factory, desert...).
Generally TTIMs are very good at settings, and pretty good at styles. I think Scott had got a bit frustrated with trying to create his stained-glass window designs, with the model occasionally getting muddled with "stained glass" and putting stained-glass windows in the picture instead of making the whole picture in a stained-glass style.
Even modern models can occasionally get into difficulties with "bleeding" elements of the content and style into one another, though- some are better than others in this respect.
Of the specific prompts:
The first one (which came out of a genuine task Scott was trying to get a TTIM to accomplish) actually appears possibly to be the hardest of the lot. This is strange. Aside from the style and setting, both of which "match" the content quite well, there are only three elements, and nothing about the request is complicated or unexpected. But TTIMs really seem to struggle with this one. Incorporating a woman, raven, and key is fine: getting both the key in the raven's mouth and the raven on the woman's shoulder seems to be borderline impossible.
The second one, which has the most obvious "gotcha", actually seem to modern models to be the easiest by far. The unexpected bit is that the cat is wearing the top hat, and indeed many models (MidJourney, for example, unless it's got better at this recently), relentlessly do put the top-hat on the man. But generally many models have solved this one.
This at first blush seems weird. Like the first prompt, it has a style, a setting, and three entities in specific relation to each other. My guess is that "looking at" and "wearing" are more basic, elemental relations than "on her shoulder" and "in its mouth", and the latter just slightly overwhelms the model.
The third picture causes an odd problem. Basically every model ever gets every aspect of this correct, except the bell on the tail, which seems very difficult.
The tricky bit of this one is mostly the "wearing lipstick" bit. Unlike the cat wearing a hat, which seems easy, a fox wearing lipstick causes models problems.
I would say this is another easy one, and indeed is almost never failed. Several people have nit-picked prompt results, though, and seem to disagree.
There's been a lot of focus on the basketball being "red". I have to admit that until I read such a discussion, I never even clocked that Scott had asked for an unusual basketball colour- I would happily have accepted a claim that basketballs were usually red. Generally one gets a dark reddish-orange colour out of a model with this prompt. I think nit-picking this is unfair, because it's asking the model to do something you wouldn't necessarily expect a human artist to do. If I asked a human artist for this prompt, and got a darkish orange basketball, I wouldn't think they were an idiot; I'd think they'd done what I asked.
Now to be fair, if I modified the prompt a bit for my human artist, and said
Pixel art of a farmer in a cathedral holding a red (note: not orange!) basketball
I would expect them to get it right. And I doubt this would help most TTIMs much, if at all. But let's find out!
Scott claimed three months after making his bet, that he'd won it with prompts 2, 3 and 5. Looking at his images, I essentially agree, although I think he got very lucky with the llama's bell.
Edwin Chen thinks 2024 DALL-E manages 2 and 5, and comes very close with 4 (probably it would have managed 4 if given enough tries). I think it had a higher "hit rate" on 5 than Edwin does, but again, I essentially agree.
Edwin also tried this with midjourney. He thinks it fails on everything. Again, I agree. Although I do think it does a better job in prompt 5 of "farmer" and "red basketball" than Edwin does, I think it, oddly, fails on "pixel art"- most of its images are not pixel art. I suspect it would have succeeded on 5 given enough tries.
Okay, let's give it a go.
The Tests
Unless noted otherwise, I'm looking at 12 images per prompt. The original bet specified 10, but most of the model interfaces output results in batches of four.
DALL-E 3
Prompt 1 (The Raven):
As expected DALL-E 3 does not perform particularly well here. All images are recongnisably stained-glass, and plausibly in a library. All contain a woman and a raven, and all but one contain a key. The raven is on the woman's shoulder about half the time. The key is in the raven's mouth only once, and in that image it is not on the shoulder and the key appears to have been almost fully-swallowed by the raven (so technically correct, but really not what you'd expect a human artist to do).
Prompt 2 (The Cat in the Hat)
This is surprisingly poor! On my first attempt, only 1 of 12 images was correct (all the others put a hat only on the man).
Responding to some criticism that the prompt is technically ambiguous, I edited it to
An oil painting of a man in a factory looking at a cat. The cat is wearing a top hat.
To my surprise, this improved things dramatically. Now about half of the results are correct (the images above are from the edited prompt). Still poorer than Edwin found, mind you.
All images depicted a hat on the man as well as the cat. This isn't *wrong* per se, but I suspect we'd have trouble if we tried to insist on the man being bare-headed.
Prompt 3 (The Benighted Bell)
Every single output was in the correct style, showed a child riding a llama in a desert, depicted a bell on the llama, and was absolutely gorgeous. Every single output put the bell round the llama's neck, not on its tail. Why is this so much harder than putting a hat on a cat?
Prompt 4 (The Voluptuous Vulpine)
Here the (lip)sticking point is getting lippy on the fox. And this prompt seems much more prone to horrible mangling than the others- see the third image above. By which I mean, sometimes the fox and astronaut get merged together. Either we just get a fox astronaut, or something even weirder happens.
Also, all the astronauts are women. Strange that the prior on “astronaut” appears to be much weaker than the prior on “lipstick”, even though we didn’t want lipstick on the astronaut in the first place.
Prompt 5 (We can take it as red)
As expected, DALL-E-3 definitely passes this one- the third image above qualifies on all counts. But again, performance was poorer than I expected.
Here this is much more variation in how it fails. All prompts are pixel-art, all depict a farmer, a cathedral, and a basketball. But sometimes the farmer visibly isn't holding the basketball- he's bouncing or throwing it. Sometimes the farmer is outside the cathedral instead of inside- see second image. Sometimes the basketball isn't red. Overall I would say around a third of outputs are correct, although if you're being fussy about the colour of the ball you might downgrade that a bit.
Also, many of the results include wicker baskets. I suppose the combination of “farmer” and the word “basketball” was too much to resist. This reminds of the way any prompt to a TTIM involving the word “witch” will result in an image full of pumpkins.
GPT 4o
GPT 4o produces images very slowly, one a time, and with a very low daily limit for freeloaders like me. So I've not produced 12, or even 10 per prompt.
You prompt GPT 4o very differently from most of these models, and it sometimes asks follow-up questions if it thinks you haven't been sufficiently clear. For example, on the llama/bell prompt, it wanted to know the time of day, further details on the style (cartoon, realistic etc.- "digital art" is pretty vague), and age/sex/ethnicity of child.
If GPT 4o doesn't do what you asked for first time, you can ask it to try again, explicitly telling it what it got wrong, and it usually fixes it.
Prompt 1 (The Raven):
Fully-compliant result on one prompt. I would note, though, that some beauty seems to have been sacrificed for compositional accuracy. Although this result is clearly what I asked for, and DALL-E 3's results are clearly not, I can't help but think the DALL-E 3 images are much prettier.
You could also complain that the raven is holding the key in an odd way; it looks like by a string that enters the raven’s mouth “from behind”, which is an odd way to draw it. But I’d call that a problem of coherence, not composition- the model is following instructions- it just doesn’t draw something complicated that well.
Prompt 2 (The Cat in the Hat)
OK, this is pretty amazing. Not only is everything correct, but the man is bare-headed and looks like a factory worker.
Also, this is clearly an oil painting. Many of the images produced by other models could plausibly be oil-paintings, just about; but actually look more like digital art.
You could complain that the cat is maybe a bit large relative to the man? That’s true, but there are cats approaching that size, and anyway, old oil paintings often had issues with odd relative sizes anyway. And anyway- coherence problem- not a composition issue.
Yes, the aspect ratio has changed; GPT 4o decides what aspect ratio to use on the fly- you can tell it what size image you want if you care.
Prompt 3 (The Benighted Bell)
Perfect result on one prompt.
Well, OK, you could complain that the bell, while clearly on the tail, isn’t visibly attached. But again, that’s a problem of coherence, and I’m only judging composition here.
Prompt 4 (The Voluptuous Vulpine)
Again, GPT 4-o basically solves this one in one go.
Prompt 5 (We can take it as red)
Unsurprisingly, no issues here.
ImageFX
I didn’t really know what to expect here, as I haven’t used this model much.
Prompt 1 (The Raven):
Now this is odd. Every single raven has a key in its mouth, as required, something no other model besides GPT 4o was able to cope with. But there are other problems. For a start, none of the ravens is on the woman’s shoulder. Actually most weren’t perched on the woman at all- there are three on her hand above, but none of the others was even touching the woman.
Also, I’m really not sure I would fully pass these as stained glass. They’re a bit stained-glass like, but not clearly so. Many of them have windows in (none of them stained glass, though), suggesting the model is confusing style and content to a degree.
Prompt 2 (The Cat in the Hat):
Bingo! 12 out of 12. The odd thing, though, is the images above are all very similar- the man has the same pose in all four. Come to think of it, the women and the raven were pretty similar as well. I wonder if ImageFX is doing something slightly different here- essentially generating one image and then minor variations of it, rather than four images from scratch?
I tried a few more times, and sure enough, each time it generated four images similar to each other, but not so similar to other attempts at the same prompt. This is not how DALL-E 3 (in Bing Image Creator) behaves- there the four images you get will be just as different as two images from two different runs of the same prompt.
The cat is even the right size. But it’s not quite so “oil-painting”y as the GPT 4o offering.
Prompt 3 (The Benighted Bell):
Yep! About half of the results had a bell on the tail. Not all, though- see the bottom right. Also the top right is a bit weird- I’d pass it as “on the tail”, but I can others objecting.
This one is not on the tail. It has other issues too:
Prompt 4 (The Voluptuous Vulpine):
Pretty good! I think that’s 8 out of 12. Although it’s really more like 2 out of 3, given the way ImageFX seems to work.
Prompt 5 (We can take it as Red):
Ummm….
Well I guess it passes. Top marks for composition.
But these are awful images- although both are pixel-art, the farmer is clearly in a different style to the background. And it’s not just bad luck- other attempts at the same prompt were similarly poor. No clue what is going on here.
Flux
I didn't have high hopes for this one, but I thought I'd give it a go. Flux is a open model, so I'm running it locally on my own GPU.
Prompt 1 (The Raven):
Okay, these are actually a lot better than I expected. Stained glass, uh, mostly. Library, absolutely. Woman, raven, key: check (a few don’t contain keys). In a couple the raven is even on the woman’s shoulder. But no keys in ravens mouths. And what the hell is going on in image two?
Prompt 2 (The Cat in the Hat):
Now this is odd. Flux manages to very consistently put a hat on the cat (it failed only once). And some of the men are bare-headed too, as a little bonus. A few of the images are only questionably of factories, but that’s a pretty minor issue.
No, the problem is that these are not oil-paintings! I’ve cut models a lot of slack here, but this is too far. These are basically all photo-realistic. Odd for it to fail on style- that should be the easy bit.
Prompt 3 (The Benighted Bell):
I’m again surprised how good these are. And here, we have consistently managed to get the style right. But no bells on tails, sorry.
Prompt 4 (The Voluptuous Vulpine):
I’m less impressed here, although honestly this isn’t much worse in general prettiness than any of the other models. But none of them manage to fulfill the prompt. The two of the left don’t even feature lipstick; there are a few of a fox wearing lipstick, but it isn’t being held by anyone, and the third one isn’t even in space.
Prompt 5 (We can take it as Red):
Finally, a pass! Not every image passed- the one on the right I would say doesn’t qualify as “in a cathedral”. And if you’re going to be fussy about colour the second one isn’t great either.
Also it seems to be only oil-paintings that Flux struggles with. My guess is that an oil-painting is just a bit closer to photorealism than the other four styles, and it gets “pulled” into a similar style it was very familiar with.
Conclusions
So…
DALL-E 3 is very pretty, but manages only two of the prompts- the farmer and the cat; and those quite inconsistently.
GPT 4o blows all the competition out of the water on composition, but is generally less pretty.
ImageFX manages four out of five- it fails only on the raven. It has some weird coherence issues sometimes, which none of the others do.
Flux, despite a better showing than I expected, manages only the farmer.
I am genuinely puzzled why the raven (the original problem Scott wanted solved) turns out to be the hardest of the lot; while the other four (which were created with deliberate “gotchas”) are all much more do-able.