Press shift question mark to access a list of keyboard shortcuts
Tastaturkürzel
Shortcuts Open/Close/ or ?
Spielen/PauseLeertaste
Lautstärke Erhöhen↑
Lautstärke Verringern↓
Vorwärts Springen→
Rückwärts Springen←
Untertitel An/Ausc
Vollbild/Vollbild Beendenf
Summ Schalten/Stummschaltung Deaktivierenm
Decrease Caption Size-
Increase Caption Size+ or =
Gehe zu %0-9
Live
00:00
00:00
00:00
0 of 0
Part 1 / Teil 1
[MUSIK] the following content has been provided by the University of Erlangen Nürnberg so it's 1 plus D then it's D minus 1 plus 2 so D minus 1 plus 2 so it's again D plus 1 plus and how often do we need that D times right we have D times over 2 so we have D plus one half so that's the number of degrees of freedom so that's a huge number if we have 100 dimensional features then you can find out that you need tons of parameters that you have to estimate out of the training data and and one idea you got that that was the idea of little Gauss third year at school or something like that you remember the famous story when the teacher of Gauss came to school after carnival was still a little bit with alcohol in blood and then he said okay can you please sum up all the numbers from 1 to 1000 and then he wanted to have a nice day and Gauss came with a solution after three minutes you don't remember that that's the story it's coming back over and over again not the alcohol in the blood but the story and the problem itself which tells us it the for me that's always a nice example if I go to school explain people or to students potential students what computer science is and why it's important that you know how to think about problems okay you know about the second nice problem I always explain to students to check whether your are able to study computer science now you are filming right I'll tell you by the way it's nice it's nice can you imagine that's a glass of red wine so it's red wine and this is a white wine it has nothing to do with Unix and read and writer permissions red wine and white wine red wine and white wine and the question you take a spoon of the red wine and put it into the white wine and then you your mix things up and then you take the spoon and do one spoon back into the red wine and the question is is there more red wine in the white wine or more white wine in the white wine more white wine in the red wine and it's a very nice problem and I explain it to students this way computer scientists think the following way think about a spoon that is very very small so it cannot even carry one molecule right think about a spoon that cannot carry even one molecule what happens you bring it over you mix it you bring it back so it's it's the same ratio right and now take a spoon that is twice as large as one of the glasses here so you put the whole glass in you put it in here you mix it and then put back half a glass sorry that is exactly the spoon should have exactly the size of the glass so you take the whole glass in here you mix it and then you bring half back so the ratio is still the same why should it different be different in between right that's the argument okay why should it be different in between that's a legal argument but it's not pattern recognition so forget about this okay but it's the the way of of thinking that we expect from our students right so how do we break down the number of parameters and now we are in the middle of the topics we have considered last week we were assuming that these elements here of the feature vector are mutually independent and that's a standard trick in pattern recognition if the problem itself grows in terms of its dimensionality usually we start to assume independence assumptions or we start to bring in independence assumptions and one was to say okay the components of X are mutually independent independent so that's basically P of X given Y is then nothing else but the product from 1 to D P of X I given Y so we multiply the things the probabilities for each for each component and assume independence and if you look at the number of unknowns how the number of unknowns behaves with the increase of the feature dimension you you find out that it's well behaving and the implication is that the parameters of these densities here can be estimated rather robustly right so you get a way better robust estimate and that's also a reason why these algorithms work quite often pretty well even if the independence assumption is not justified by physics or any other intuition that comes from the real world scenario so the number of unknowns is reduced and so our estimators that allow us to compute the parameters are more robust and that's why the so called Naive Bayes is working so well and I told you this classifier is called Naive Bayes or the more charming name is idiot space you can say I built a Bayesian classifier that minimizes the 0 1 cost function now what you can do is you can you can now start to study the decision boundary of a Naive Bayesian classifier and what do we expect if the Bayesian classifier is using a one dimensional Gaussian as its components what structure of what structure geometric structure of the decision boundary do you expect quadratic quadratic if the covariances are different and indeed here we end up with a quadratic decision boundary and if the covariance are the same then we know that we get a straight line a linear function what does it mean basically for the covariance matrix if I have here mutually independent components of the feature vector can you say something about the structure of the covariance matrix yes please your name Georg it's a diagonal matrix actually if these components are mutually independent you end up a diagonal matrix so you can read out of the matrix of the covariance matrix whether you have mutually independent independent component or not it's a diagonal matrix and if the diagonal if the diagonal matrices of 2 different classes are the same then we get a linear decision boundary and if we end up with diagonal matrices that are very different we have a quadratic decision boundary and it's now interesting to study these things in more detail where is the I have all this also the slides so you don't need to copy this from the blackboard but it's easier explain and developing these things on the blackboard so let's assume the following example let us assume a Naive Bayes classifier so let us assume Naive Bayes let us assume that the priors P of plus minus 1 is one over two so the priors are the same for class number 1 and class number minus 1 and let us assume that the covariance matrix sigma of minus 1 is equal to sigma of plus one and this is the identity matrix Georg can you tell me what is the decision rule in principle that we have to apply now feature vector comes in and now I have to decide which class does this future belong to what is the decision rule Y star is well we have statistical framework so we take a Bayesian decision rule argmax of Y P of Y given X which is argmax of YP of X given Y why do I do this because the priors are the same so forget about this pre factor so how does this look like here for a Gaussian or for two Gaussians where the identity matrix is the or where the covariance matrices are identity matrices well any ideas or what I can do I can of course I can say argmax of Y and then I can say this is the logarithm of P it doesn't change and P is the Gaussian so 1 over square root of determinant of 2 pi identity matrix this is a constant so forget about this plus and now we have logarithm of exponential function minus so we get minus one half X minus Y plus or minus 1 transposed times the inverse of the identity matrix is Georg the identity matrix so it's very cooperative so and the logarithm of the e function is one so this is a constant so forget about this so we have to maximize minus blah blah blah so we have to minimize the whole thing so this is argmax so if we switch the the sign ah sorry no so we have to maximize the whole term this is constant forget about this maximize the negative of this one is to minimize the positive of this one so argmin of X minus mu Y transposed X minus mu Y and can somebody give me a geometric interpretation of this term here I mean if you think in two dimensions right so X1 X2 and here is mu plus 1 and here is mu minus 1 and here the feature vector X comes in let me choose a different one X here what do we compute to decide which class this belongs it's just the this is just the Euclidean distance to the power of 2 of this difference vector here you compute X minus mu and you have the inner product and that's basically the Euclidean length and this is also the Euclidean length and the Naive Bayes using identity matrices as covariances is basically ending up being the nearest neighbor classifier else where the prototypes are the mean vectors so conclusion you can see we can generate here result scientific results by by just massaging a little bit these these formulas and we will get more and more used to these things 35 years this was efficient to get a PhD in pattern recognition that was a contribution you're laughing that's right trust me I didn't do my PhD 35 years ago I had a way harder time and more power okay so the result let me write this down somewhere conclusion we end up with the nearest neighbor nearest nearest neighbor classifier where reference vectors vectors are mean vectors of the classes right now we know if we if we construct by accident features where feature vectors where the components are mutually independent and the components are normally distribution with variance of one that the nearest neighbor classifier is the best that we can get the nearest neighbor classifier using the mean vectors as references is the best we can get with respect to the 0 1 loss function and the associated average loss that's what we have shown so far this is a very cool result with a high excitement factor and high probability being asked in the oral exam right Tanja I'm going to ask you exactly this question write it down another interesting thing is let's look at the decision boundary how do we analyze decision boundary usually well we analyze the decision boundary by looking at the ratio and the logarithm of the ratio which is required to be 0 so we look for the function basically we end up with here with a zero level set defining the decision boundary once again the posteriors for class 0 1 or minus 1 and plus 1 it's a matter of taste and dependent on the current situation will look at here it makes no big difference we factorize we factorize using the Bayesian formula the posterior so we end up with a ratio here we have the prior and the class conditional then we break things up using the homomorphic properties of the logarithm multiplication and summation so we get the logarithm of the prior of the ratio of the prior plus the logarithm of the ratio of the class condition and now say okay we have a Bayesian classifier basically we have here a nice factorization because all the components are mutually independent and the interesting thing is that here we have a product in the argument of the logarithm so we can write here this in terms of a sum of logarithms okay and the same we can do with the denominator which is basically minus log of the product of the denominator is minus sum over log of these arguments here and we end up with a so called generalized linear model we have a constant plus a summation of functions here these are logarithms of probabilities a summation of function where each component in a the sum is only depending on one entry of the feature vector and what we can do as we can be built generalized linear models here use different functions we can also think about polynomials here or sine cosine functions or any other families of functions and we can build this type of decision boundaries out of that so we have a good understanding of the structure of the decision boundary of the Bayesian Naive Bayesian classifier or Naive Bayes classifier now the question is is there anything in between Bayes and Naive Bayes we have one extreme where all the components are mutually dependent and the other extreme where all the components are mutually independent is there something in between and that's also a trick that we apply rather often in pattern recognition we take statistical dependencies of limited order and I'll explain to you this I mean we we have considered the two extremes right we have considered the two extremes one is P of X given Y is P of X 1 given Y times P of X2 given Y, X1 times P of X3 given Y X1 X2 and so on so we have get the the full chain and the other extreme is that we assume independency and we end up with a factorization of this type okay and in between we can think of dependency of limited order and that basically means we assume P of X given Y and if I take order 1 for instance I can say okay P of X1 is the way I start or given Y and then I say it's coffee time pardon me can you imagine how I suffer if you consume coffee I'm running out of coffee in in my blood yes please plus no times P of X3 given and now I would expect X1 X2 but having dependency of limited order and dependency of length one just means I only remember the predecessor but not the predecessor of the predecessor so I use here X2 and I forget about X1 and I'm gone ahead here P of X4 Y X3 and so on so that's first order second order is P of X given Y P of X1 Y times P of X2 Y X1 X3 given Y X1 X2 X 4 given I hope I do not make any mistake here X2 X3 so you take the two predecessors now you know what Nth order means that you take the N predecessor and so you can move from the one extreme by using histories of longer and longer and longer size and you end up with a full model and in pattern recognition when we design probabilistic models we look at the 2 extremes and at things in between for instance if you want to study speech recognition if Georg goes into his BMW in uses his speech control of his devices right of his advanced navigation system the navigation system is basically using statistical models which have a first order dependency structure that is that's what is totally sufficient for solving trivial things like speech recognition systems or speech recognition algorithms okay talk to and tell him that I was claiming things like that okay and how do we see these dependency structures looking at normally distributed features we see it just looking by just looking at the covariance matrices and look at the side diagonals if we have 3 diagonals then we know we have dependency of first order if we have five diagonals we have second order and so on so this is the example for the first order dependency I was describing and you can write this at the end of the day in a compact manner by saying the I component depends on the class and the predecessor I minus the first component and here you see the structure of the covariance matrix which tells us that the first and the second one are statistically dependent the first and the third one the forth one fifth one are statistically independent because there is a zero in there here is another example how you can also do parameter tying if the covariances of all the components are the same so you have on the main diagonal all the same elements then you also reduce the number of degrees of freedom by parameter tying where you say that these co or variances here or standard deviations are the same as not nicely written here should end up here but you see what's meant by that and we also know that these two are the same and these two are the same because its a symmetric matrix right yes sir it's what Toeplitz structure well depends on the on the dependency assumptions you take but this one here is one of the special form that you were mentioning but if you say only dependency of degree D minus one then you have here the 0 and the rest is full and you have to find out something in between so what did we learn we have still 6 minutes left so we will also start a little bit with the subsequent section keep in mind the Naive Bayes is rather successful so it's always worth if you are required to build a classifier in industry for your master's thesis project or whenever it's always worth to give a try give the Naive Bayes classifier a try to give to give it a try and if it works with a proper recognition rate why not using it it's fine the nice side effect is the Naive Bayes due to the independences assumptions is implying a very low dimensional parameter estimation problem maximum likelihood estimation problem so you are not required to have a huge amount of training data you always have to deal with a trade off of the number of degrees of freedom in our model and the number of training data to make this plausible I mean think about the following situation you want to estimate a straight line and you get only one point you cannot estimate the straight line you need at least 2 points right and if you have very complex models you also need as many points as possible as many samples as possible to get a reliable estimate of the function you want to fit in and the more data you have the better it actually will be but you have to find a way that you have a sufficient amount of training data if your parametric model has 100000 unknowns I mean you need at least dependent on the model actually round about 100000 or even more 10 times more samples in the training set so that's an issue that's a real issue then we have the trade off between statistical dependency versus the the search space and Naive Bayes idiot space give it a try and if it works it works and then you should use it awesome two books I still i recommend the book of Ripley it's already 15 years 16 years old but still it's always worth to have a look into it and the book of Christopher Bishop who is one of the most famous pattern recognition people worldwide is also worth being checked regarding the Bayesian or Naive Bayesian classifier so these are a typical questions you are faced within the oral exam what is the assumption of the Naive Bayes very simple mutually independent features how does the assumption affect the class depended PDF you can factorize the class dependent PDF in product off PDF of its components what is the structure of the covariance matrix of normally distribute classes in the Naive Bayes it's a diagonal matrix and how can Naive Bayes be extended to first order statistical dependencies it's just this type of factorization and if you use the Gaussian you get just two more diagonals into the covariance matrix that's the understanding I hopefully grew in your mind during the lecture today burn it into the outer shell of your brain you need it it's important and ops what we are now going to discuss which is one chapter that is very crucial for pattern recognition and that's discriminant analysis and to motivate you a little bit this classifier and this transformation we are going to introduce by tomorrow is implemented in the Adidas One shoe if you're going to buy the Adidas One shoe in the Adidas outlet center or in any Adidas store worldwide you have to know 2 things it has implemented a discriminant analysis scheme in the shoe and the second thing is the software was written by us in the lab if you go to Brazil and buy it there you get our software and that's really cool knowing knowing things like that what is the Adidas one shoe doing the Adidas one shoes is a intelligent shoe that is adapting the stiffness of the heel to the surface the runner is running on so if you have a harder surface the sole gets softer and if you run on a softer surface the sole gets harder and this is basically measured by a sensor how much the deformation of the heel is and the control circuit that was implemented by by Bjorn Escoffier that time is basically adjusting the stiffness of the heel to the surface such that the deformation properties of the shoe remain constant during a run okay that's cool marketing I I like these shoes I also have a pair it's it's fun trying these things whether it's good or not I don't know but it's funny and it's also interesting that all the Adidas athletes in Peking four year ago they had these shoes once they got their medal they always got the Adidas one shoe or wear the Adidas one shoe and to understand how the Adidas one shoe works what do we have to do we have to consider the a posteriori probability which is rewritten in terms of the generative model so we factorize the joined PDF by P of Y times P of X given Y and we rewrite the evidence in terms of marginal you get that you should understand what I'm saying here the evidence was P of X once again P of X so it's P of X this here as P of X and P of X is looking at the joined density and summing up over Y that's the equivalent and the joined density can be factorized this way times X given Y so these 3 formulations are something that you really should know by heart from now on so if you look at these things you should you should right away see what's what's going on there and you will get used to it and we will now look at the class conditionals and we also have seen that it's rather comfortable if we have as we have seen on one of the previous blackboards if we have covariances which are just the identity matrix and the question that we are going to consider tomorrow morning is are there any feature transforms that we end up with new features that have a covariance matrix that is the identity matrix and that's what we are going to look into by tomorrow thank you very much and see you tomorrow morning [MUSIK]