From a single view point it can, but just like a human artist doing the same with a still photo, some incorrect assumptions about size and depth may be made. For example, I just used AI to generate an image of our two dogs as two of the deer on Santa's sleigh, as a fun little thing to print on the back of our Christmas cards. This is something that'd have taken me an hour or three with Photoshop a few years ago, but took me all of 5 seconds to upload a photo and type what I wanted into chatGPT. In any case, it did a great job, except because the photo I uploaded only had one view/perspective with our 13 lb. dog in the foreground and our 32 lb. dog in the background, chatGPT assumed they must be the same size.
Uploading multiple images would have allowed it to see the animals from more sides and make better assumptions about relative size, etc. Likewise with video, but my God... the processing power to synchronize every frame of a few minutes of video! That'd be 36,000 images uploaded and sync'd into pairs, to generate another 18,000 frames for just 5 minutes of video at 60 fps.

People complain about data center power usage?
View attachment 4618880 View attachment 4618881