Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:04,480
Stay Refusion 3 is released. I know you are busy so I'm gonna give you a summary right here.
2
00:00:05,120 --> 00:00:11,840
Yes, you should start using SD3. Can I use it day one? Yes, I'll show you where to download it,
3
00:00:11,840 --> 00:00:16,080
how to get started. Will it give me better results day one? Probably not. Needs fine
4
00:00:16,080 --> 00:00:21,680
tuning. Is it still worth using? Yes. Someone told me this model is bad, is it? Well no, it's a medium
5
00:00:21,680 --> 00:00:28,240
size 2B model compared to the 8B model. You will be fine and you probably won't need the 8B model
6
00:00:28,240 --> 00:00:33,760
until you get a better GPU and I think most fine tunes will be made on the 2B model. Anyway,
7
00:00:33,760 --> 00:00:40,000
why is it better? Text, prompt understanding, 16 channel VAE. If you want no more, keep watching
8
00:00:40,000 --> 00:00:46,080
oh and you know, some more stuff. I'm clearly the superior model. People can't get enough of my
9
00:00:46,080 --> 00:00:52,240
anime art and all that amazing control net setup. Please control net, I got that too and also much
10
00:00:52,240 --> 00:00:57,760
higher resolution. Don't be silly, those control and lowers can match my 1.5 control nets and I can
11
00:00:57,760 --> 00:01:03,440
do higher resolution too with high res fixes and tie up skills just to name a few. Step aside kids,
12
00:01:03,440 --> 00:01:09,760
you're outdated. I can do text, you know, letters that actually go together and make words. Let me
13
00:01:09,760 --> 00:01:16,400
spell it out for you. Bye bye. Who does words anyway? Exactly, can you even animate? Well I'm
14
00:01:16,400 --> 00:01:26,240
not sure yet but I got prompt understanding. Better faces and hands, well at least sometimes.
15
00:01:27,840 --> 00:01:37,680
No one can do hands. Yikes. Are you even fine tuned? Well no, not yet but I mean the community
16
00:01:37,680 --> 00:01:42,800
will surely fix that. I have thousands of models and lowers, you'll never match that. I'm sure
17
00:01:42,800 --> 00:01:49,040
people will fine tune me. I'm also safe to use, unlike some of you guys. What does that even mean?
18
00:01:49,040 --> 00:01:53,600
I'm all about unlimited control. Hi guys, I can do pretty images.
19
00:01:53,920 --> 00:02:06,080
So jokes aside, SD3 will probably outperform both 1.5 and SDXL. Will it do it day one? Well
20
00:02:06,720 --> 00:02:13,040
probably not, maybe not. Maybe. It's a starting point. I think we will need some fine tunes from
21
00:02:13,040 --> 00:02:19,680
the community to see Stable Fusion 3 XL and shine as much as possible. But there are some key
22
00:02:19,680 --> 00:02:26,800
architectural features where Stable Fusion 3 really outshine other models. First let's talk about
23
00:02:26,800 --> 00:02:35,600
the VAE. So VAE used in previous models, use a 4 channel VAE. Now you can go 4 channel, A channel,
24
00:02:35,600 --> 00:02:45,280
16, 32, etc. You get the point. And Stable Fusion 3 uses a 16 channel VAE. You can clearly see
25
00:02:45,280 --> 00:02:51,520
in this comparison image here, which I think is a great reference and showing the difference.
26
00:02:51,520 --> 00:03:01,040
And this is just not on outputting an image. It's actually incorporated when you train the model.
27
00:03:01,040 --> 00:03:08,400
So as you are training the model, you will retain more detail and will be able to output more detail.
28
00:03:08,400 --> 00:03:18,400
Now SD3 is a 1024x1024 pixel model not to be confused with the previous one. So SD1.5 was trained
29
00:03:18,400 --> 00:03:27,760
for 512x512. SDXL was trained 1024x1024. The difference here though is that SD3 works well
30
00:03:27,760 --> 00:03:36,480
by doing 512x512 images. This does not mean that it's a 512 model. This is a 1024 model
31
00:03:36,480 --> 00:03:43,760
that can be used in more sizes than previous. So this is good and especially for people that
32
00:03:43,760 --> 00:03:52,800
are not on a massive GPU machine. I mean I'm on a 4090 so I can generate a lot, not everything,
33
00:03:52,800 --> 00:04:00,240
but a lot. But many people are not able to run SDXL. So having a model, a better model that can
34
00:04:00,240 --> 00:04:08,320
run 512x512 and doing that faster and less resource intensive than SDXL. That's a huge deal
35
00:04:08,320 --> 00:04:16,160
for many people. And one of the reasons for that is that we are getting the medium model or the 2B
36
00:04:16,160 --> 00:04:25,040
model. So there is the full weight on the 8B. For the majority of people, 2B will be fine. It's
37
00:04:25,040 --> 00:04:31,200
going to be great. Requirements are going to be much less than the SDXL. You're going to have a
38
00:04:31,200 --> 00:04:37,840
great time with it on most machines. The diminishing returns you get from the increased quality in
39
00:04:37,840 --> 00:04:45,760
8B compared to the resources you require. I mean the curve goes like and then it flattens out.
40
00:04:45,760 --> 00:04:53,280
There's the diminishing returns, right? So is 8B better? Well, yes. Yes. Yes, it is. Will you need
41
00:04:53,280 --> 00:05:00,640
it? Probably not. At least right now, the majority of users will want to use the 2B1 because it's
42
00:05:00,640 --> 00:05:06,640
going to be faster to run. And if you eventually get the 8B1 and you want to run it, sure go for it.
43
00:05:06,640 --> 00:05:12,560
It's going to be slower. You're going to get a slight increase in quality where some areas are
44
00:05:12,560 --> 00:05:19,280
going to be higher quality than others. So it depends on what your end goal is. But right now,
45
00:05:19,280 --> 00:05:26,560
go for the 2B1. That's the one we have. And start fine-tuning it because it's a great base.
46
00:05:26,560 --> 00:05:31,920
Just jumping back to the V8 and the 16 channel thing. That's going to be a huge improvement.
47
00:05:31,920 --> 00:05:35,920
And if you start digging in the research paper of Stable Fusion 3 under the chapter
48
00:05:35,920 --> 00:05:42,240
improved auto encoders here, you can see here similar to this research paper from 2023,
49
00:05:42,240 --> 00:05:46,240
we find that increasing the number of latent channels significantly boosts
50
00:05:46,320 --> 00:05:54,080
performance. See Table 3. And Table 3 is this one here. Now it's not as pretty as the image we
51
00:05:54,080 --> 00:06:02,560
looked at before. These are, this is a comparison, 4 channel, 8 channel, 16 channel and the FID score
52
00:06:02,560 --> 00:06:12,320
here, which is something called a freshet inception distance. The arrow indicates which is better.
53
00:06:12,320 --> 00:06:16,560
So a lower score is better. 16 channel clearly outperforms the other one's
54
00:06:16,560 --> 00:06:23,200
perceptual similarity. Also lower is better. It's almost twice as good as the 4 channel one.
55
00:06:23,840 --> 00:06:29,440
And in all the other data, it outperforms the other channels. You can keep reading here, but
56
00:06:29,440 --> 00:06:34,240
basically, you know, yadda yadda yadda. Best models with increased capacity should be able to
57
00:06:34,240 --> 00:06:39,600
perform better for a larger yadda yadda, ultimately achieving higher image quality.
58
00:06:39,600 --> 00:06:45,600
We confirm this hypothesis in figure 10. Let's look at figure 10. And here's a channel,
59
00:06:45,600 --> 00:06:52,000
16 channel is the yellow one here. So here are a couple of images that I pulled from the research
60
00:06:52,000 --> 00:06:58,080
paper. I'm going to compare these to just mid-journey, WFusion XL and some doggly three
61
00:06:58,080 --> 00:07:04,000
generations. Now it's not a fair comparison. Okay, I'll give you that because not everyone
62
00:07:04,000 --> 00:07:10,800
handles the prompting in the same way. So using a prompt in mid-journey, the same prompting door
63
00:07:10,800 --> 00:07:15,680
and the same prompting stabilization, it's not a fair comparison. I get it. It's apples, oranges,
64
00:07:15,680 --> 00:07:20,640
and all that. I get it. I get it. I get it. But I think it's, it's, you know, people want to see
65
00:07:20,640 --> 00:07:25,360
that kind of comparison. So I figured I'll do that. And we're going to use the images again,
66
00:07:25,360 --> 00:07:32,240
like in the research paper, we're also going to render them with the actual SD3 model and see
67
00:07:32,240 --> 00:07:38,720
how cherry-picked these are. So first we'll look at this. We have a beautiful pixel art of a wizard
68
00:07:38,720 --> 00:07:44,000
with hovering text achievement unlocked. Diffusion models can spell now. This one looks, well,
69
00:07:44,000 --> 00:07:49,280
looks kind of good. On the right here, we have this frog. Frog sitting in a 1950s diner wearing
70
00:07:49,280 --> 00:07:53,280
a leather jacket and a top hat. On the table is a giant burger and a small sign that says Froggy
71
00:07:53,280 --> 00:07:59,680
Fridays. And I mean, it looks pretty good. We have the frog. It's in a diner. He's wearing a
72
00:07:59,680 --> 00:08:05,200
leather jacket and a top hat. You have the burger and the sign is spelled correctly. Now,
73
00:08:05,200 --> 00:08:10,800
I don't know how many generations it took to get that or if they got it like 90% plus of the time,
74
00:08:10,800 --> 00:08:15,840
but I'm sure we'll see that once we generate it ourselves. And here we have something I think
75
00:08:15,840 --> 00:08:21,920
is kind of cool. This translucent pig inside is a smaller pig. So spoiler, I did this in some of
76
00:08:21,920 --> 00:08:26,880
the other tools and this was a hard prompt to get right. And the last one here, a massive alien
77
00:08:26,880 --> 00:08:32,080
spaceship that is shaped like a pretzel. So all of them kind of cool. And let's look at the first
78
00:08:32,080 --> 00:08:38,480
comparisons. So in this one here, we have old SDXL on the left, we have mid journey in the middle,
79
00:08:38,480 --> 00:08:46,000
and we have Dolly on the right. Clearly in the SDX wall, while it looks kind of good as an image,
80
00:08:46,000 --> 00:08:51,760
the text is not working at all. We get some froggies, some fargoggy, frog roggies, Somali
81
00:08:51,760 --> 00:08:58,960
bird eyes. It just doesn't make any sense. The frog, however, is wearing a leather jacket and a
82
00:08:58,960 --> 00:09:04,880
top hat in all three shoulders of bowling and bowler hat bowling. I don't know. It's kind of
83
00:09:04,880 --> 00:09:10,480
getting it right. Not in all of them. The burger is in front of the frog in three of the images.
84
00:09:10,480 --> 00:09:15,600
You know, we're halfway there. It's okay. The text is not okay. In the mid journey one,
85
00:09:15,600 --> 00:09:23,200
we have this very mid journey cinematic, artsy kind of style. Froggy Fridays is actually working
86
00:09:23,200 --> 00:09:30,560
in most of the images. The first one here is a slight misspelling. That one just says froggy,
87
00:09:30,560 --> 00:09:38,240
but in general, I think it's okay. This one said Fridays here on the little sign down here. This
88
00:09:38,240 --> 00:09:44,080
one's pretty good. It's actually straight and everything. Froggy Fridays. We did get some
89
00:09:44,160 --> 00:09:50,160
Fridays and froggy down here too. In the dolly one, the images look kind of dolly. We got froggy
90
00:09:50,160 --> 00:09:56,960
Fridays, but it's hardly working. In the third image here, we actually get a properly spelled
91
00:09:56,960 --> 00:10:04,160
Fridays. Now bear in mind, I've not cherry picked any of these. It's just the first four generations
92
00:10:04,160 --> 00:10:10,800
out. I can be lucky. I can be unlucky. Again, this is an apples to oranges kind of comparison. We'll
93
00:10:10,800 --> 00:10:19,440
look at the SD3 ones in a bit. Pig here. Translucent pig inside is a smaller pig, SDXL. It's just one
94
00:10:19,440 --> 00:10:24,720
pig. It is sort of translucent mid journey. While you're getting two pigs and some of them,
95
00:10:25,440 --> 00:10:32,880
none is actually inside the other pig. The dolly actually has all of the smaller pigs are inside
96
00:10:32,880 --> 00:10:40,000
the big pig. They're not really translucent. You can like, there's a part where you can see inside.
97
00:10:40,000 --> 00:10:46,400
So as you know, I would say, as far as understanding what the task was, dolly kind of, you know,
98
00:10:46,400 --> 00:10:53,040
solved it in the best way, but also didn't, you know, I think mid journey in general from this
99
00:10:53,040 --> 00:10:58,720
instance had the most most realistic looking images. Looking at the wizard, we clearly have
100
00:10:58,720 --> 00:11:06,240
different styles. SDXL pixel art is very minimalistic. The mid journey one is very artsy and the dolly
101
00:11:06,240 --> 00:11:12,960
is somewhere kind of in between. And in this instance, the mid journey ones actually have the
102
00:11:12,960 --> 00:11:18,240
better text. We have achievement unlock diffusion models now, achievement unlock diffusion can
103
00:11:18,240 --> 00:11:26,320
spell now. It's a long sentence. And it's, you know, models now diffusion can spell. It's getting
104
00:11:26,320 --> 00:11:32,160
there, right? Whereas in the dolly one, the words are kind of in the right place, but everything's
105
00:11:32,720 --> 00:11:39,120
misspelled and all over the place. I should aim the lock diffusion your models can spell now. It's
106
00:11:39,120 --> 00:11:45,600
like, yeah, you can see that it tried the I is not useful for anything. The good part is the words
107
00:11:45,600 --> 00:11:49,760
kind of get, you know, in the proper positions, you could go in after a fact in Photoshop and,
108
00:11:49,760 --> 00:11:56,320
you know, fix that. And SDXL is all over the place. It's this isn't working. So compared to SD3,
109
00:11:56,320 --> 00:12:01,840
you know, that's a huge step. And here again, SDXL on the left mid journey in the center and
110
00:12:01,840 --> 00:12:10,080
dolly on the right. Starting with the SDXL one, we got spaceships. And they are really pretzels,
111
00:12:10,080 --> 00:12:15,360
like they're super pretzels. They're actual pretzels, probably not what we're looking for
112
00:12:15,360 --> 00:12:22,000
in the mid journey ones. They are getting there, like you getting a spaceship and weird kind of
113
00:12:22,000 --> 00:12:28,400
dimensions. And it's trying to like circle them up or, you know, piece them together like a pretzel,
114
00:12:28,400 --> 00:12:34,880
not really getting the shape right, but you know, pretty okay, just as an image. And for the dolly
115
00:12:34,880 --> 00:12:42,400
ones, same thing, right? You're getting a spaceship that goes into the weird shapes. And this one
116
00:12:42,400 --> 00:12:47,440
actually has a pretzel shape, but that's above the spaceship or maybe it's a separate spaceship.
117
00:12:47,440 --> 00:12:53,760
So as far as these examples go, that our chair picked in the research paper, I will see what
118
00:12:53,760 --> 00:13:02,000
we actually get. Okay, guys, we are alive. Hugging face stability AI stable fusion three medium.
119
00:13:02,000 --> 00:13:11,040
We need to agree. This is me. Here's an email. I'm in this country. No, yes, agree. Once we have
120
00:13:11,040 --> 00:13:16,080
agreed to this, we're going to file some versions when you have some options here. You get depending
121
00:13:16,080 --> 00:13:21,120
on what you want to do, you can download the medium one. That doesn't include the clips. If
122
00:13:21,120 --> 00:13:25,520
you're using stable swarm, those are committed to download automatically. So that there are text
123
00:13:25,520 --> 00:13:32,160
encoders, which is the clip G, the clip L, the T five, those to go in the models clips folders,
124
00:13:32,160 --> 00:13:37,680
you can also download medium, including the clips or the including clips with the T five,
125
00:13:37,680 --> 00:13:42,080
which go in your models checkpoint or models stable fusion folder, depending on which you are
126
00:13:42,080 --> 00:13:48,320
use. There's also some example workflows that you can use here. So if you drop one of them in
127
00:13:48,320 --> 00:13:53,280
the basic one here, for example, this is a workflow that you get, you have a loaded checkpoint. So
128
00:13:53,280 --> 00:14:00,640
this is SD three, we have the clips here. If you are using the SD three with including clips,
129
00:14:00,640 --> 00:14:05,920
I would assume that you can just drag that and put that in there instead. And if you journey
130
00:14:05,920 --> 00:14:12,160
generate now, you will get as a female character with long flowing hair. And if we can get an
131
00:14:12,160 --> 00:14:16,880
outment here in a second, that is exactly what we're getting default. We have a resolution of
132
00:14:16,880 --> 00:14:21,840
1024 by 1024. We will note here is as resolution should be around one megapixel and with height
133
00:14:21,840 --> 00:14:28,480
must be a multiple of 64 got a basic problem. So this is interesting. The case sampler is by
134
00:14:28,480 --> 00:14:37,440
default set to 28 steps at a C of 4.5 with DPM plus plus two M, which, well, I like that one.
135
00:14:37,440 --> 00:14:43,440
I've used Keras previously. But let's go with the SGM uniform. And see how that does because
136
00:14:43,440 --> 00:14:49,760
that's what they preset for us. And if you are using stable swarm, if you use selecting the model
137
00:14:49,760 --> 00:14:54,240
here, and when we press generate, we'll get some downloads in the background, which are the text
138
00:14:54,240 --> 00:15:00,480
encoders. And then you get this cute little SD three text encoder thing here. So you can select
139
00:15:00,480 --> 00:15:07,840
clip only t five only or clip plus t five, t five only will get you a slight increase in quality
140
00:15:07,840 --> 00:15:14,240
at a massive resource cost. So clip only will get you, you know, pretty far. Now I don't know what
141
00:15:14,240 --> 00:15:20,720
negatives they used in our general in the comparisons. So we're just going to leave that
142
00:15:20,720 --> 00:15:26,400
blank for now, which is probably again, not a fair comparison. But it is what it is. It is what I have
143
00:15:26,480 --> 00:15:33,200
changes to four images, we're going to generate, we have a 1024 by 1024 resolution, I just starting
144
00:15:33,200 --> 00:15:37,920
using this. So I might be doing something wrong, obviously. So we're getting first, we're actually
145
00:15:37,920 --> 00:15:42,880
not getting any pigs inside of the other pigs, as you can see here. So get generating tell me in
146
00:15:42,880 --> 00:15:47,680
the comments what you think about it so far. If you're still watching this day one, you can use
147
00:15:47,680 --> 00:15:52,560
it on any comfy back end system that doesn't only include comfy into a swarm, I think some of the
148
00:15:52,560 --> 00:15:59,040
focus variants you can use as well. I'm going to keep playing with this in the coming days. So
149
00:15:59,040 --> 00:16:04,880
it picked more videos on the topic, but I'm going to stop it here for now, just so I can get this out
150
00:16:04,880 --> 00:16:07,840
and you can start play with it. Have fun.
19972
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.