Why the rendered Videosequences are always end in the same Starting-Position as on the Control-Image?

#104
by bbcreativo - opened

Suggestion: A Person with her/his hands on her lap, should wave with her hands.
It doesn´t end, withe waving, but with her/his hands on the lap.
So, many frames are lost for turning back to the starting position.

WHY?

Maybe you put end frame(s) in your workflow? The generated video will aim to end to this frame(s) every time.

grafik.png

this is just a 'getting around' suggestion, but you can pick some frame from the middle of your generated video where the hands are actualy in the air waving and use that frame as end_image ;-)

Maybe you put end frame(s) in your workflow? The generated video will aim to end to this frame(s) every time.

grafik.png

No, I don't...

this is just a 'getting around' suggestion, but you can pick some frame from the middle of your generated video where the hands are actualy in the air waving and use that frame as end_image ;-)

If I render a complete video, just for taking a frame out of the Middle: Isn't that contra productive?

this is just a 'getting around' suggestion, but you can pick some frame from the middle of your generated video where the hands are actualy in the air waving and use that frame as end_image ;-)

If I render a complete video, just for taking a frame out of the Middle: Isn't that contra productive?

nope, just bad explanation from my side - > I meant - 1st generation - you make the video you are not satisfied with (aka end with hands in lap) - and from that video you extract that frame, so you can use it in 2nd generation, now with the end frame. So you will get your desired result at the end... but at the end of 2nd generation, I know... Do you understand me better now?

It's like the model takes the frames in a sine wave: from start to end, back to start.
I.E.: WORKFLOW: THE WOMAN SIT HERSELF DOWN. FACT: THE WOMAN SITS DOWN, AND STANDS UP AGAIN. I just don't understand the reason of that behaviour...
There's also a lot of loss of frames by that.🤷‍♂️

A problem could be that your prompts are too short. Everything you describe in the prompt is maybe archived within the first half of the video so the model has no further informations on what to do next and goes back torwards the starting position.

Try something like: THE WOMAN SIT HERSELF DOWN and does not move anymore.

I came across this with my video extend workflow where like for example you prompt something like "He spills a glass of milk on the table." - and toward the end the milk would magically evaporate. After some poking around, I tracked it down to something funky with the Wan Video VACE Start To End Frame node. I suspect internally it might be doing something like defaulting to the start image to the end image if no ending image isn't provided. I was able to get around it by copying this bit out of the default VACE workflow in Comfy:

image.png

@OrionFL79 Wan Video VACE is VERY complicated to understand how it works. And when there are put custom nodes in front of it it is getting even more complicated. It is not just start and end frame(s) + reference image. There is so much more involved like real and false masking, an how VACE reacts on what is given to him in which format. This reddit post is a good starting point to understand what you need to do to get which kind of results: https://www.reddit.com/r/StableDiffusion/comments/1m04uv6/wan_21_vace_howto_guide_for_masked_inpaint_and/

I completely agree, and without any sort of authoritative documentation it leads to a lot of detective work, trial, error, and comparison to debug when things glitch out. Or if you're really brave and have a high tolerance for pain (at least in my case as a .net dev) - looking through some of the source code.

I do however stand corrected on my assumption, after looking through the source for that node (https://github.com/kijai/ComfyUI-WanVideoWrapper/blob/cfdae3b49f10561138f60fb1759c4675c2537d0a/nodes_utility.py#L94) I couldn't immediately spot anything where its setting any default if an input value is empty. But trial, error, and comparison seems to point to something in there as the culprit.

Its only about 100 lines of code but Python... Yuck. Overall, it looks like its doing the following:

1 - If no start or end image is provided, but the control images are, its fills the latent with the control images.
2 - It adds in a bunch of empty masks.
3 - It resizes / upscales things if the height / width of the images provided don't match the inputs.
4 - If a start image is provided it plops that in at the first frame.
5 - If an end image is provided, it plops that at the last frame.
6 - If control images are provided, it goes through and fills in any blank frames with those.
7 - If there's an inpaint mask provided, it does some kind of masking magick with that.
8 - It shovels the whole thing off through the outputs to be fed into the sampler.

In all, it seems to be sort of a Swiss Army Knife / convenience node to handle multiple use cases when prepping a things to go into the sampler.

However though with it in the chain just feeding in my source image I'm getting the evaporating liquids / reset back to the source at the end of the clip issue. Taking that out and replacing it with this glob of spaghetti from the Wan VACE Reference to Video template in Comfy (only spread out so I could tell what the heck it was doing) the issue is resolved.

image

For the spaghetti blob:
On the mask side its just shoveling in a transparent mask into the first frame, then its filling in the rest with a bunch of solid empty masks.
On the image side, its setting the first frame to the source image then reusing the solid empty masks as filler for the rest of the frames.

Which is pretty much all the convenience node was doing, only its relying on a bunch of nodes instead of python code. ^_^

As for masking in general, this isn't really something I use or need to play with so its one of those things I leave as default. With my use case I'm like - here take this image and this prompt and make it do something. ^_^

I did however spot one little oops when I spliced that in. I'm not setting the height / width of the masks to match my input image ... which may actually be the cause of another glitch I was running into where all of a sudden toward the end of the extend chain the whole thing sometimes starts to turn into an anime video. @_@

Sign up or log in to comment