All language subtitles for Stable Diffusion 3 IS FINALLY HERE!_2

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish Download
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:04,480 Stay Refusion 3 is released. I know you are busy so I'm gonna give you a summary right here. 2 00:00:05,120 --> 00:00:11,840 Yes, you should start using SD3. Can I use it day one? Yes, I'll show you where to download it, 3 00:00:11,840 --> 00:00:16,080 how to get started. Will it give me better results day one? Probably not. Needs fine 4 00:00:16,080 --> 00:00:21,680 tuning. Is it still worth using? Yes. Someone told me this model is bad, is it? Well no, it's a medium 5 00:00:21,680 --> 00:00:28,240 size 2B model compared to the 8B model. You will be fine and you probably won't need the 8B model 6 00:00:28,240 --> 00:00:33,760 until you get a better GPU and I think most fine tunes will be made on the 2B model. Anyway, 7 00:00:33,760 --> 00:00:40,000 why is it better? Text, prompt understanding, 16 channel VAE. If you want no more, keep watching 8 00:00:40,000 --> 00:00:46,080 oh and you know, some more stuff. I'm clearly the superior model. People can't get enough of my 9 00:00:46,080 --> 00:00:52,240 anime art and all that amazing control net setup. Please control net, I got that too and also much 10 00:00:52,240 --> 00:00:57,760 higher resolution. Don't be silly, those control and lowers can match my 1.5 control nets and I can 11 00:00:57,760 --> 00:01:03,440 do higher resolution too with high res fixes and tie up skills just to name a few. Step aside kids, 12 00:01:03,440 --> 00:01:09,760 you're outdated. I can do text, you know, letters that actually go together and make words. Let me 13 00:01:09,760 --> 00:01:16,400 spell it out for you. Bye bye. Who does words anyway? Exactly, can you even animate? Well I'm 14 00:01:16,400 --> 00:01:26,240 not sure yet but I got prompt understanding. Better faces and hands, well at least sometimes. 15 00:01:27,840 --> 00:01:37,680 No one can do hands. Yikes. Are you even fine tuned? Well no, not yet but I mean the community 16 00:01:37,680 --> 00:01:42,800 will surely fix that. I have thousands of models and lowers, you'll never match that. I'm sure 17 00:01:42,800 --> 00:01:49,040 people will fine tune me. I'm also safe to use, unlike some of you guys. What does that even mean? 18 00:01:49,040 --> 00:01:53,600 I'm all about unlimited control. Hi guys, I can do pretty images. 19 00:01:53,920 --> 00:02:06,080 So jokes aside, SD3 will probably outperform both 1.5 and SDXL. Will it do it day one? Well 20 00:02:06,720 --> 00:02:13,040 probably not, maybe not. Maybe. It's a starting point. I think we will need some fine tunes from 21 00:02:13,040 --> 00:02:19,680 the community to see Stable Fusion 3 XL and shine as much as possible. But there are some key 22 00:02:19,680 --> 00:02:26,800 architectural features where Stable Fusion 3 really outshine other models. First let's talk about 23 00:02:26,800 --> 00:02:35,600 the VAE. So VAE used in previous models, use a 4 channel VAE. Now you can go 4 channel, A channel, 24 00:02:35,600 --> 00:02:45,280 16, 32, etc. You get the point. And Stable Fusion 3 uses a 16 channel VAE. You can clearly see 25 00:02:45,280 --> 00:02:51,520 in this comparison image here, which I think is a great reference and showing the difference. 26 00:02:51,520 --> 00:03:01,040 And this is just not on outputting an image. It's actually incorporated when you train the model. 27 00:03:01,040 --> 00:03:08,400 So as you are training the model, you will retain more detail and will be able to output more detail. 28 00:03:08,400 --> 00:03:18,400 Now SD3 is a 1024x1024 pixel model not to be confused with the previous one. So SD1.5 was trained 29 00:03:18,400 --> 00:03:27,760 for 512x512. SDXL was trained 1024x1024. The difference here though is that SD3 works well 30 00:03:27,760 --> 00:03:36,480 by doing 512x512 images. This does not mean that it's a 512 model. This is a 1024 model 31 00:03:36,480 --> 00:03:43,760 that can be used in more sizes than previous. So this is good and especially for people that 32 00:03:43,760 --> 00:03:52,800 are not on a massive GPU machine. I mean I'm on a 4090 so I can generate a lot, not everything, 33 00:03:52,800 --> 00:04:00,240 but a lot. But many people are not able to run SDXL. So having a model, a better model that can 34 00:04:00,240 --> 00:04:08,320 run 512x512 and doing that faster and less resource intensive than SDXL. That's a huge deal 35 00:04:08,320 --> 00:04:16,160 for many people. And one of the reasons for that is that we are getting the medium model or the 2B 36 00:04:16,160 --> 00:04:25,040 model. So there is the full weight on the 8B. For the majority of people, 2B will be fine. It's 37 00:04:25,040 --> 00:04:31,200 going to be great. Requirements are going to be much less than the SDXL. You're going to have a 38 00:04:31,200 --> 00:04:37,840 great time with it on most machines. The diminishing returns you get from the increased quality in 39 00:04:37,840 --> 00:04:45,760 8B compared to the resources you require. I mean the curve goes like and then it flattens out. 40 00:04:45,760 --> 00:04:53,280 There's the diminishing returns, right? So is 8B better? Well, yes. Yes. Yes, it is. Will you need 41 00:04:53,280 --> 00:05:00,640 it? Probably not. At least right now, the majority of users will want to use the 2B1 because it's 42 00:05:00,640 --> 00:05:06,640 going to be faster to run. And if you eventually get the 8B1 and you want to run it, sure go for it. 43 00:05:06,640 --> 00:05:12,560 It's going to be slower. You're going to get a slight increase in quality where some areas are 44 00:05:12,560 --> 00:05:19,280 going to be higher quality than others. So it depends on what your end goal is. But right now, 45 00:05:19,280 --> 00:05:26,560 go for the 2B1. That's the one we have. And start fine-tuning it because it's a great base. 46 00:05:26,560 --> 00:05:31,920 Just jumping back to the V8 and the 16 channel thing. That's going to be a huge improvement. 47 00:05:31,920 --> 00:05:35,920 And if you start digging in the research paper of Stable Fusion 3 under the chapter 48 00:05:35,920 --> 00:05:42,240 improved auto encoders here, you can see here similar to this research paper from 2023, 49 00:05:42,240 --> 00:05:46,240 we find that increasing the number of latent channels significantly boosts 50 00:05:46,320 --> 00:05:54,080 performance. See Table 3. And Table 3 is this one here. Now it's not as pretty as the image we 51 00:05:54,080 --> 00:06:02,560 looked at before. These are, this is a comparison, 4 channel, 8 channel, 16 channel and the FID score 52 00:06:02,560 --> 00:06:12,320 here, which is something called a freshet inception distance. The arrow indicates which is better. 53 00:06:12,320 --> 00:06:16,560 So a lower score is better. 16 channel clearly outperforms the other one's 54 00:06:16,560 --> 00:06:23,200 perceptual similarity. Also lower is better. It's almost twice as good as the 4 channel one. 55 00:06:23,840 --> 00:06:29,440 And in all the other data, it outperforms the other channels. You can keep reading here, but 56 00:06:29,440 --> 00:06:34,240 basically, you know, yadda yadda yadda. Best models with increased capacity should be able to 57 00:06:34,240 --> 00:06:39,600 perform better for a larger yadda yadda, ultimately achieving higher image quality. 58 00:06:39,600 --> 00:06:45,600 We confirm this hypothesis in figure 10. Let's look at figure 10. And here's a channel, 59 00:06:45,600 --> 00:06:52,000 16 channel is the yellow one here. So here are a couple of images that I pulled from the research 60 00:06:52,000 --> 00:06:58,080 paper. I'm going to compare these to just mid-journey, WFusion XL and some doggly three 61 00:06:58,080 --> 00:07:04,000 generations. Now it's not a fair comparison. Okay, I'll give you that because not everyone 62 00:07:04,000 --> 00:07:10,800 handles the prompting in the same way. So using a prompt in mid-journey, the same prompting door 63 00:07:10,800 --> 00:07:15,680 and the same prompting stabilization, it's not a fair comparison. I get it. It's apples, oranges, 64 00:07:15,680 --> 00:07:20,640 and all that. I get it. I get it. I get it. But I think it's, it's, you know, people want to see 65 00:07:20,640 --> 00:07:25,360 that kind of comparison. So I figured I'll do that. And we're going to use the images again, 66 00:07:25,360 --> 00:07:32,240 like in the research paper, we're also going to render them with the actual SD3 model and see 67 00:07:32,240 --> 00:07:38,720 how cherry-picked these are. So first we'll look at this. We have a beautiful pixel art of a wizard 68 00:07:38,720 --> 00:07:44,000 with hovering text achievement unlocked. Diffusion models can spell now. This one looks, well, 69 00:07:44,000 --> 00:07:49,280 looks kind of good. On the right here, we have this frog. Frog sitting in a 1950s diner wearing 70 00:07:49,280 --> 00:07:53,280 a leather jacket and a top hat. On the table is a giant burger and a small sign that says Froggy 71 00:07:53,280 --> 00:07:59,680 Fridays. And I mean, it looks pretty good. We have the frog. It's in a diner. He's wearing a 72 00:07:59,680 --> 00:08:05,200 leather jacket and a top hat. You have the burger and the sign is spelled correctly. Now, 73 00:08:05,200 --> 00:08:10,800 I don't know how many generations it took to get that or if they got it like 90% plus of the time, 74 00:08:10,800 --> 00:08:15,840 but I'm sure we'll see that once we generate it ourselves. And here we have something I think 75 00:08:15,840 --> 00:08:21,920 is kind of cool. This translucent pig inside is a smaller pig. So spoiler, I did this in some of 76 00:08:21,920 --> 00:08:26,880 the other tools and this was a hard prompt to get right. And the last one here, a massive alien 77 00:08:26,880 --> 00:08:32,080 spaceship that is shaped like a pretzel. So all of them kind of cool. And let's look at the first 78 00:08:32,080 --> 00:08:38,480 comparisons. So in this one here, we have old SDXL on the left, we have mid journey in the middle, 79 00:08:38,480 --> 00:08:46,000 and we have Dolly on the right. Clearly in the SDX wall, while it looks kind of good as an image, 80 00:08:46,000 --> 00:08:51,760 the text is not working at all. We get some froggies, some fargoggy, frog roggies, Somali 81 00:08:51,760 --> 00:08:58,960 bird eyes. It just doesn't make any sense. The frog, however, is wearing a leather jacket and a 82 00:08:58,960 --> 00:09:04,880 top hat in all three shoulders of bowling and bowler hat bowling. I don't know. It's kind of 83 00:09:04,880 --> 00:09:10,480 getting it right. Not in all of them. The burger is in front of the frog in three of the images. 84 00:09:10,480 --> 00:09:15,600 You know, we're halfway there. It's okay. The text is not okay. In the mid journey one, 85 00:09:15,600 --> 00:09:23,200 we have this very mid journey cinematic, artsy kind of style. Froggy Fridays is actually working 86 00:09:23,200 --> 00:09:30,560 in most of the images. The first one here is a slight misspelling. That one just says froggy, 87 00:09:30,560 --> 00:09:38,240 but in general, I think it's okay. This one said Fridays here on the little sign down here. This 88 00:09:38,240 --> 00:09:44,080 one's pretty good. It's actually straight and everything. Froggy Fridays. We did get some 89 00:09:44,160 --> 00:09:50,160 Fridays and froggy down here too. In the dolly one, the images look kind of dolly. We got froggy 90 00:09:50,160 --> 00:09:56,960 Fridays, but it's hardly working. In the third image here, we actually get a properly spelled 91 00:09:56,960 --> 00:10:04,160 Fridays. Now bear in mind, I've not cherry picked any of these. It's just the first four generations 92 00:10:04,160 --> 00:10:10,800 out. I can be lucky. I can be unlucky. Again, this is an apples to oranges kind of comparison. We'll 93 00:10:10,800 --> 00:10:19,440 look at the SD3 ones in a bit. Pig here. Translucent pig inside is a smaller pig, SDXL. It's just one 94 00:10:19,440 --> 00:10:24,720 pig. It is sort of translucent mid journey. While you're getting two pigs and some of them, 95 00:10:25,440 --> 00:10:32,880 none is actually inside the other pig. The dolly actually has all of the smaller pigs are inside 96 00:10:32,880 --> 00:10:40,000 the big pig. They're not really translucent. You can like, there's a part where you can see inside. 97 00:10:40,000 --> 00:10:46,400 So as you know, I would say, as far as understanding what the task was, dolly kind of, you know, 98 00:10:46,400 --> 00:10:53,040 solved it in the best way, but also didn't, you know, I think mid journey in general from this 99 00:10:53,040 --> 00:10:58,720 instance had the most most realistic looking images. Looking at the wizard, we clearly have 100 00:10:58,720 --> 00:11:06,240 different styles. SDXL pixel art is very minimalistic. The mid journey one is very artsy and the dolly 101 00:11:06,240 --> 00:11:12,960 is somewhere kind of in between. And in this instance, the mid journey ones actually have the 102 00:11:12,960 --> 00:11:18,240 better text. We have achievement unlock diffusion models now, achievement unlock diffusion can 103 00:11:18,240 --> 00:11:26,320 spell now. It's a long sentence. And it's, you know, models now diffusion can spell. It's getting 104 00:11:26,320 --> 00:11:32,160 there, right? Whereas in the dolly one, the words are kind of in the right place, but everything's 105 00:11:32,720 --> 00:11:39,120 misspelled and all over the place. I should aim the lock diffusion your models can spell now. It's 106 00:11:39,120 --> 00:11:45,600 like, yeah, you can see that it tried the I is not useful for anything. The good part is the words 107 00:11:45,600 --> 00:11:49,760 kind of get, you know, in the proper positions, you could go in after a fact in Photoshop and, 108 00:11:49,760 --> 00:11:56,320 you know, fix that. And SDXL is all over the place. It's this isn't working. So compared to SD3, 109 00:11:56,320 --> 00:12:01,840 you know, that's a huge step. And here again, SDXL on the left mid journey in the center and 110 00:12:01,840 --> 00:12:10,080 dolly on the right. Starting with the SDXL one, we got spaceships. And they are really pretzels, 111 00:12:10,080 --> 00:12:15,360 like they're super pretzels. They're actual pretzels, probably not what we're looking for 112 00:12:15,360 --> 00:12:22,000 in the mid journey ones. They are getting there, like you getting a spaceship and weird kind of 113 00:12:22,000 --> 00:12:28,400 dimensions. And it's trying to like circle them up or, you know, piece them together like a pretzel, 114 00:12:28,400 --> 00:12:34,880 not really getting the shape right, but you know, pretty okay, just as an image. And for the dolly 115 00:12:34,880 --> 00:12:42,400 ones, same thing, right? You're getting a spaceship that goes into the weird shapes. And this one 116 00:12:42,400 --> 00:12:47,440 actually has a pretzel shape, but that's above the spaceship or maybe it's a separate spaceship. 117 00:12:47,440 --> 00:12:53,760 So as far as these examples go, that our chair picked in the research paper, I will see what 118 00:12:53,760 --> 00:13:02,000 we actually get. Okay, guys, we are alive. Hugging face stability AI stable fusion three medium. 119 00:13:02,000 --> 00:13:11,040 We need to agree. This is me. Here's an email. I'm in this country. No, yes, agree. Once we have 120 00:13:11,040 --> 00:13:16,080 agreed to this, we're going to file some versions when you have some options here. You get depending 121 00:13:16,080 --> 00:13:21,120 on what you want to do, you can download the medium one. That doesn't include the clips. If 122 00:13:21,120 --> 00:13:25,520 you're using stable swarm, those are committed to download automatically. So that there are text 123 00:13:25,520 --> 00:13:32,160 encoders, which is the clip G, the clip L, the T five, those to go in the models clips folders, 124 00:13:32,160 --> 00:13:37,680 you can also download medium, including the clips or the including clips with the T five, 125 00:13:37,680 --> 00:13:42,080 which go in your models checkpoint or models stable fusion folder, depending on which you are 126 00:13:42,080 --> 00:13:48,320 use. There's also some example workflows that you can use here. So if you drop one of them in 127 00:13:48,320 --> 00:13:53,280 the basic one here, for example, this is a workflow that you get, you have a loaded checkpoint. So 128 00:13:53,280 --> 00:14:00,640 this is SD three, we have the clips here. If you are using the SD three with including clips, 129 00:14:00,640 --> 00:14:05,920 I would assume that you can just drag that and put that in there instead. And if you journey 130 00:14:05,920 --> 00:14:12,160 generate now, you will get as a female character with long flowing hair. And if we can get an 131 00:14:12,160 --> 00:14:16,880 outment here in a second, that is exactly what we're getting default. We have a resolution of 132 00:14:16,880 --> 00:14:21,840 1024 by 1024. We will note here is as resolution should be around one megapixel and with height 133 00:14:21,840 --> 00:14:28,480 must be a multiple of 64 got a basic problem. So this is interesting. The case sampler is by 134 00:14:28,480 --> 00:14:37,440 default set to 28 steps at a C of 4.5 with DPM plus plus two M, which, well, I like that one. 135 00:14:37,440 --> 00:14:43,440 I've used Keras previously. But let's go with the SGM uniform. And see how that does because 136 00:14:43,440 --> 00:14:49,760 that's what they preset for us. And if you are using stable swarm, if you use selecting the model 137 00:14:49,760 --> 00:14:54,240 here, and when we press generate, we'll get some downloads in the background, which are the text 138 00:14:54,240 --> 00:15:00,480 encoders. And then you get this cute little SD three text encoder thing here. So you can select 139 00:15:00,480 --> 00:15:07,840 clip only t five only or clip plus t five, t five only will get you a slight increase in quality 140 00:15:07,840 --> 00:15:14,240 at a massive resource cost. So clip only will get you, you know, pretty far. Now I don't know what 141 00:15:14,240 --> 00:15:20,720 negatives they used in our general in the comparisons. So we're just going to leave that 142 00:15:20,720 --> 00:15:26,400 blank for now, which is probably again, not a fair comparison. But it is what it is. It is what I have 143 00:15:26,480 --> 00:15:33,200 changes to four images, we're going to generate, we have a 1024 by 1024 resolution, I just starting 144 00:15:33,200 --> 00:15:37,920 using this. So I might be doing something wrong, obviously. So we're getting first, we're actually 145 00:15:37,920 --> 00:15:42,880 not getting any pigs inside of the other pigs, as you can see here. So get generating tell me in 146 00:15:42,880 --> 00:15:47,680 the comments what you think about it so far. If you're still watching this day one, you can use 147 00:15:47,680 --> 00:15:52,560 it on any comfy back end system that doesn't only include comfy into a swarm, I think some of the 148 00:15:52,560 --> 00:15:59,040 focus variants you can use as well. I'm going to keep playing with this in the coming days. So 149 00:15:59,040 --> 00:16:04,880 it picked more videos on the topic, but I'm going to stop it here for now, just so I can get this out 150 00:16:04,880 --> 00:16:07,840 and you can start play with it. Have fun. 19972

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.