I believe most image generating models are too small (like only 4GB RAM). Deepseek R1 is 1.5TB ram (or half or quarter that at reduced precision) to get some semblance of “general knowledge”. So to get the “semantics” of an image right, not just the “syntax” you’d need bigger models and probably more data describing images. Of course, do we really want that?
I believe most image generating models are too small (like only 4GB RAM). Deepseek R1 is 1.5TB ram (or half or quarter that at reduced precision) to get some semblance of “general knowledge”. So to get the “semantics” of an image right, not just the “syntax” you’d need bigger models and probably more data describing images. Of course, do we really want that?