All language subtitles for 007 Term Frequency Inverse Document Frequency (TFIDF)_en[UdemyIran.Com]

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian Download
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:06,810 --> 00:00:09,980 Now, of course, indices aren't quite that simple. 2 00:00:10,140 --> 00:00:15,060 An index is actually what's called an inverted index and this is basically the mechanism by which pretty 3 00:00:15,060 --> 00:00:17,310 much all search engines work. 4 00:00:17,520 --> 00:00:21,970 As an example, imagine I have a couple of documents in my index that contain text to data. 5 00:00:22,650 --> 00:00:26,800 Let's say I have one document that contains: Space the final frontier, 6 00:00:26,820 --> 00:00:31,410 these are the voyages, and maybe I have another document that says: he's bad, 7 00:00:31,440 --> 00:00:35,190 he's number one, he's a space cowboy with a laser gun, 8 00:00:35,190 --> 00:00:40,500 and if you understand what both of those are references to, then you and I have a lot in common. Now an 9 00:00:40,500 --> 00:00:43,100 inverted index wouldn't store those strings directly, 10 00:00:43,140 --> 00:00:48,690 instead, it sort of flips it on its head. A search engine, such as the elastic search, actually splits each 11 00:00:48,690 --> 00:00:51,770 document up into its individual search terms, 12 00:00:51,870 --> 00:00:56,520 and in this example, we'll just split it up for each word and we'll convert them to lowercase just to 13 00:00:56,520 --> 00:00:58,610 normalize things. 14 00:00:58,620 --> 00:01:03,900 Then what it does is map each search term to the documents that those search terms occur within. 15 00:01:03,900 --> 00:01:09,960 So in this example, the word space actually occurs in both documents, meaning the inverted index would 16 00:01:09,960 --> 00:01:15,090 indicate that the word space occurs in both documents one and two, the word 17 00:01:15,100 --> 00:01:17,710 the also appears in both documents, 18 00:01:17,710 --> 00:01:24,580 so that will also map to both documents one and two, and the word, final, only appears in the first document, 19 00:01:24,580 --> 00:01:29,820 so the inverted index would match the word, final, as a search term to document one. 20 00:01:29,830 --> 00:01:34,120 Now it's a little bit more complicated than that in practice and in reality it actually stores not only 21 00:01:34,120 --> 00:01:37,690 what documents end but also the position within the document that it's in. 22 00:01:38,260 --> 00:01:44,440 But at a high conceptual level, this is the basic idea. An inverted index is what you're actually getting 23 00:01:44,440 --> 00:01:48,790 with a search index, where it's mapping things that you're searching for to the documents of those things 24 00:01:48,790 --> 00:01:51,850 live within, and of course it's not even quite that simple. 25 00:01:54,850 --> 00:01:58,530 So how do I actually deal with the concept of relevance? 26 00:01:58,600 --> 00:02:00,420 Let's take - for example - the word the, 27 00:02:00,460 --> 00:02:02,230 how do I deal with that? 28 00:02:02,230 --> 00:02:06,170 The word the is going to be a very common word in every single document., 29 00:02:06,190 --> 00:02:10,600 so how do I make sure that only documents where the is a special word are the ones that I get back, 30 00:02:10,630 --> 00:02:18,010 if I actually search for the word the? Well that's where TF IDF comes in, that stands for a term frequency 31 00:02:18,010 --> 00:02:24,590 times inverse document frequency, it's a very fancy-sounding term but it's actually a very simple concept. 32 00:02:24,790 --> 00:02:31,540 So let's break it down. Term frequency is just how often a given search term appears within a given document. 33 00:02:31,540 --> 00:02:37,420 So if the word space occurs very frequently in a given document, it would have a high term frequency. 34 00:02:37,510 --> 00:02:40,630 The same applies if the word appears frequently to the document, 35 00:02:40,660 --> 00:02:46,880 it would also have a high term frequency. Now document frequency is just how often a term appears in 36 00:02:46,910 --> 00:02:49,640 all of the documents in your entire index. 37 00:02:49,640 --> 00:02:51,900 So here's where things get interesting. 38 00:02:52,010 --> 00:02:57,710 So the word space probably doesn't occur very often across the entire index, so it would have a low document 39 00:02:57,710 --> 00:02:59,150 frequency. 40 00:02:59,180 --> 00:03:02,610 However, the word does appear in all documents pretty frequently, 41 00:03:02,750 --> 00:03:09,000 so it would have a very high document frequency. So if we divide term frequency by document frequency, 42 00:03:09,360 --> 00:03:12,440 that's the same as multiplying by the inverse document frequency, 43 00:03:12,480 --> 00:03:14,470 mathematically we get a measure of relevance. 44 00:03:14,490 --> 00:03:17,300 So we see how special this term is to the document. 45 00:03:17,740 --> 00:03:21,990 It measures not only how often does this term occur within the document, but how does that compare to 46 00:03:21,990 --> 00:03:26,520 how often this term occurs in documents across the entire index? 47 00:03:26,520 --> 00:03:31,650 So with that example, the word, space, in an article about space would rank very highly. 48 00:03:31,650 --> 00:03:35,390 However, the word the wouldn't necessarily rank very highly at all, 49 00:03:35,430 --> 00:03:38,330 that's a common term found in every other document as well, 50 00:03:38,460 --> 00:03:43,170 and this is the basic idea of how search engines work. If you're searching for a given term, it will try 51 00:03:43,170 --> 00:03:48,480 to give you back results in the order of their relevancy. Relevancy is loosely based at least on the 52 00:03:48,480 --> 00:03:50,460 concept of TF-IDF, 53 00:03:50,460 --> 00:03:52,260 it's not really that complicated. 5943

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.