Press shift question mark to access a list of keyboard shortcuts
Tastaturkürzel
Shortcuts Open/Close/ or ?
Spielen/PauseLeertaste
Lautstärke Erhöhen↑
Lautstärke Verringern↓
Vorwärts Springen→
Rückwärts Springen←
Untertitel An/Ausc
Vollbild/Vollbild Beendenf
Summ Schalten/Stummschaltung Deaktivierenm
Decrease Caption Size-
Increase Caption Size+ or =
Gehe zu %0-9
Live
00:00
00:00
00:00
0 of 0
Part 1 / Teil 1
[MUSIK] sum over all the feature vectors XI times X I transposed if you think in terms of column vectors and what is the eigenvectors that corresponds to the largest eigenvalue that's the principal axis that's the direction in which the one dimensional projections on this straight line have maximum variance so the interval that is covered by the 1-D projections is maximum the second eigenvector that belongs to the second largest eigenvalue is the second principle axis or we have the second highest variation and so on so what you can do is you can basically do a so called spectral decomposition we have written it here a spectral decomposition a spectral decomposition of the covariance matrix that is you can rewrite the whole covariance matrix in terms of a linear combination of rank one matrices these are the projection matrices that do projection on the on the eigenvalue vectors and these one rank 1 matrices are weighted by their eigenvalue this decomposition is true for the positive positive semi definite covariance matrices and now you can think of the following you have this covariance matrix and you can characterize the covariance matrix by the E I's and these matrices and the lambda I's and now you can change the lambda I's a little bit for instance so you have a larger scaling or a smaller scaling in a certain principal direction and that's how you can build different kidney models out of this type of decomposition and that's actually what we did we have used here the mean vector if it's zero it's gone and then we say our vectors X are basically characterized by linear combinations of our eigenvectors and these eigenvectors are differently weighted by AI and by manipulating these AI's which have been in previous in previous examples the lambda I's I can built different shapes and if just change A1 the highest eigenvalue if I modify that and if if I plot the point vector or the high dimensional feature vector and the associate volume I see basically within the kidney what is the deformation with the highest variation that's very interesting we have had for instance one project with the Adidas company and what we tried to do is we tried to measure the three dimensional surface of the human feet so we measured the human foot and then we have computed this PCA decomposition of the covariance matrix and we get the eigenvector that belongs to the largest eigenvalue and then we have rewritten this point set using this type of representation and then we looked at variations of the largest eigenvalue and what do you think what is the change of a human foot in terms of its variability over humans the length right that's the length and that's expressed here by the largest eigenvalue that's very interesting without using any domain knowledge we have captured 5000 feet of people we did the PCA decomposition using these these surface data and then we have computed the PCA of the covariance matrix and then we have analyzed where is the highest variation in of course in its length and the second highest variation is the width and the third highest variation and then things start to become interesting because then shoemakers can think about what is the parameters the can work on to build shoes that fit very well right very interesting right very interesting right and this is brute force pattern recognition that was applied and this type of decomposition is used in pattern recognition let's say 10 15 years but not longer before people haven't thought about these thing but take the eigenvectors and and modify the weights and you become different types of shapes and the principle changes are basically captured by that and if you scan humans and if you represent humans you can manipulate with two pre factor's basically the height and the width or the degree of fatness of people with two parameter that's cool you can adjust the shape by just two parameters if you eat too much you just change the second principle component that's a very nice explanation right what did change after we met last time you can say the second principal component has doubled you know what that means so that gives us a very good understanding and here you see this is the kidney of you name is Niels and this is the kidney of I don't know your name is Wayne and here you have variations of the mean kidney shape along the first one second eigenvector and you can play around with it and it's it's fantastic you know it's easy to implement in matlab SVD computation and then you implement two sliders and you play and you see how things change an you can also analyze faces for instance if you if you take faces and you compute principal faces you can compute eigenfaces in you can see what what basically makes the difference in faces it's fascinating as everything in pattern recognition actually we also use these models once we have done a segmentation of all the kidneys we have computed the the model using the eigenvalues we can use this type of model and estimate the AI's given a certain type of CT-scan for instance and if you want to find the kidneys we just estimate three or four AI's and weight factors of the eigenvectors and then we can do here kidney segmentation robustly without just looking at gradient information we use the model and we try to fit in the model into the image such that a certain optimization or optimality criterion is optimized and here you see one example it's a very nice 3-D view on the human kidney in a CT-scan where we did this segmentation by using these eigenmodels we call these eigenkidneys this is an eigenkidney eigenkidney eigenface eigenkidney okay any questions so far no that's cool right that's cool and if you if you browse the web CVPR paper is one of the most prestigious conferences computer vision and pattern recognition three four years ago people have used this type of modeling for for human shape modeling and have generated a very very exciting and motivating videos okay good so that was an exciting example so let's come back to the boring pattern recognition technology stuff this all can be used for many many cool applications but we have to set the base for for all these fancy things so let's look at one problem we have faced several times and now you should just look at these formulas and immediately know what's going on there this alpha transposed X that's the projection of X on alpha and that's the intersect so that that's a linear manifold that's a linear manifold X is our feature vector and dependent on the side where the new point ends up with in terms of multiplication with normal vector of this manifold we are on the upper side or the lower side of the manifold so we have a two class problem plus 1 minus 1 and this is just decided by looking at the sign if you put in an arbitrary feature vector into this equation and if alpha has unit length this gives you the signed distance to the linear manifold basically this term and the sign tells you if it's positive it's in the same direction like the normal vector alpha and if it's negative it's in the opposite direction that's the way how you have to read this and we have seen already a way to estimate alpha and alpha 0 given a set of training samples which method have we learned already which method have we studied already to estimate a linear decision boundary the the sigmoid function you can use the sigmoid function what else the Gaussian with which with what constraint covariance matrix thank you very much that's the second one and I told I ask you usually give me five methods logistic regression sigmoid function so that's the same category and this is now one example that usually people did not remember in the exam so look at this example so what we do is we solve this estimation of alpha and alpha 0 or the training of the classifier by doing the following so we say we have a set of classified features that's always the same it's X1 the associated class Y1 X2 Y2 if you cannot read my head writing blame it tot the iPad X3 Y3 and so on so what we can do is we can say okay alpha transposed X plus alpha 0 that's the signed distance and we can say simply this is alpha 1 it's either positive or negative plus 1 minus 1 we can say alpha transposed X2 plus alpha 0 is Y2 and so on of course this systems of equation of equations cannot be fulfilled but we can set up a lease square estimate saying at least the sign here can be plus or minus one correctly how do we solve this type of lease square estimation problem so what we can do is we can rewrite this whole thing in terms of a matrix equation so I have X one one the one is due to the fact that alpha 0 is multiplied by one the Y is just the class assignments and our unknown data is just alpha and alpha 0 right so we can write our problem this way A times no it's called X sorry theta is Y and how did we solve this is the past SVD okay SVD so we compute the SVD of X and basically we get theta is the pseudo inverse of X times Y so this is not a transposed this is pseudo inverse pseudo inverse so that's V transposed sigma the inverse of the singular values in sigma times U good cool so that's another way to compute the linear decision boundary and you will end up with a completely different result like with the Gaussian or with the logistic regression or the sigmoid function so we get another straight line for for classification and the 2-D plane and again the question is which one is the better one and how do we judge whether the decision boundary is doing a good job or not we use our function we use our loss function so please remember that as I said the first one who is going to write this according to the journey through five different methods will get a price will be awarded so what we did is one option is to estimate and solve the linear regression problem that's what we actually did by the pseudo inverse okay but what we haven't considered here was a constraint requiring for instance that theta has to have unit length or the alpha has to have unit length here anything can happen we looked just for a theta we have no constraint and there that's the solution I have done this on the blackboard already remember in the material science building I did all the computations and I've shown to you that the pseudo inverse looks like that Niels you remember that so we we remember that we remember or if you don't remember this formula don't worry SVD is totally sufficient for me how do you solve that SVD is fine okay good a few obvious question that show up here why should we prefer the Euclidean norm to any other norm which norms do you know from math lectures instead of the L to norm the euclidean have you heard something about norm lectures Manhattan norm Manhattan distance no no yes there are different norms there are L1 norm L infinity LP norm there are many norms so who tells that for pattern recognition applications the L2 norm is the right thing to do and I can give you an answer that it's the right thing do if you can solve quadratic optimization problems if you are unable to solve any other optimization problem you will stick to the L2 norm and that's why we have to discuss optimization problems and optimization algorithms and now we will switch to other norms and see that these ones have clear advantages will different norms lead to different results hopefully and in which situations are certain norms or other norms better for instance if you have the L2 norm if you have the L2 you basically measure the euclidean length so if you have the situation like this you have X X times theta minus Y and this is the L2 and if you call this the residual and that's a residual the difference vector and if it's not fitting perfectly the residual vector should be small this is the residual vector and the L2 norm of that is just the sum of its components right and it's the L2 norm and if the entries are smaller smaller than 1 in between 0 and 1 the square is doing what it's making it smaller so if you have small deviations this quadratic deviations is doing a good job but if your deviation is larger than one then the square increases the deviation so you penalize higher deviations form the from the residual zero high deviations are penalized higher way higher and that causes a problem I mean think about the situation do we have medical engineers here no at least one sorry okay so let's use philosophy student he is joining a class in pattern recognition we do measurements and I don't want to be offending Reinhardt and you do measurements so you have the computer science student is measuring then the electrical engineering student is measuring and the what else multimedia student is measuring and mechanical engineering student is measuring and now the philosophy student is measuring okay and you want to fit a straight line into that the point is the ones are doing pretty good but here we have a large deviation and due to this type of residual the large deviation will have a huge impact on the final result so this guy will get way more attention than it's worth this philosophy student gets way more attention than it's worth and the line will basically move down a little bit so you get a systematic error here and it would be nice to have a method that says these guys are nice guys cool and this one I ignore and that would be the right penalty function and that something you have to study the robustness in respect to outliers good I know I'm very offending I mean it did not say these are the boys and this was the girl I didn't say this right I did that once and I won't do it again I bet you will remember this no I didn't say this right I didn't say this this time the other point is I want to have fun while teaching so I do things like that which normal decision boundary is the best one should we use logistic regression should we use this approach and instead of the L2 the L1 norm should we use the idea with the Gaussian that share the covariance matrix what should we do to get the best possible linear decision boundary interesting questions and can we incorporate some prior knowledge in linear regression can we put some prior knowledge into the regression problem what is prior knowledge we know that the Bayesian classifier is using prior knowledge because we can decompose the prior probability into a product of the prior the posterior probability into a product of the prior with the class conditional divided by the evidence you remember so can we put some prior knowledge in that so many interesting questions and and now I have to talk about rich regression and Lasso these are two methods that are well known in pattern recognition for many years and if you look into the computer tomography community you will find out that they have invented compressed sensing like sevens years ago which is actually nothing else but the Lasso that pattern recognition people do for many years already so what is the idea of rich regression rich regression or the so called regularized regression is doing the following the extent the objective function and the objective function before was basically X theta minus Y and this is the residential and we want to minimize that and we don't put it away we have to share I mean friends or is it gone already okay okay so I don't want to share so it's medication makes you to resist the boring lecture or what is the purpose of the medication the bad weather well the sun is shining today so we we add here and extend the objective function objective function that's the Zielfunktion in German right and we want to extend that and enforce a certain length of this parameter here so what for instance we can do is that we penalize for instance the length of this vector here so what is theta transposed theta that's just the squared euclidean length of the vector and we also can translate this into the language of statistics or probability theory where we say that our our theta has to fulfill the Gaussian property with a zero mean and a covariance matrix where you one over Lambda as a diagonal which where is Lambda Lambda is a constant constant factor Lagrangian factor so what do we do we say okay our theta has to fulfill here a Gaussian distribution we estimate our well this has to be shifted upwards a little bit the area has to be one but we expect our theta to be normally distributed we require this to be normally distributed and if you look at that I can also rewrite this this is X theta minus Y transposed X theta minus Y and here in between in between I can put in the identity matrix that is supposed to be a covariance matrix so this is basically the exponent of a Gaussian density with the unit or with the identity matrix as covariance so I have the probability of observing X and Y and this is the residual is Gaussian distributed plus and then I can use a prior on the parameter on the I am looking for plus the logarithm of the Gaussian that is basically defining the theta and you have a Bayesian estimator basically for theta theta where the class conditional or the conditional density for for theta is a Gauss is here a Gaussian for the observations given theta and theta itself is a normal distribution so things are rather linked here and you can write this here you want to minimize the residual plus Lambda and the length of the theta should be part of the optimization process because you minimize this function by using gradients you can also write here instead of theta here minus one and optimize that so you compute the derivative with respect to theta how do you do that you would look it up in the matrix cookbook and then you basically end up with this estimator so if you have a constraint on the length of theta rated by a Lambda this Lambda basically effects the estimator the pseudo inverse by looking at the gram matrix that has to be inverted and you add up here an identity matrix and the more weight you put on the length of theta the more this matrix will look like a diagonal matrix very interesting if you increase the lambda the weight of the length of this vector the more this tends to be a diagonal matrix so introducing a constraint into a lease square estimator where you use the L2 norm here and the L2 norm there it's just a modification of the X transposed X matrix where you add up a diagonal matrix where the diagonal elements are weighted by the Lagrangian factor and what a Lagrangian factor is if you haven't seen that in in engineering mathematics we will do that in two weeks from now okay so very cool very cool and at home it's important I think homework verify this formula verify this it's not difficult it's a little massaging equations okay good so if I talk about rich regression what I'm doing is I'm minimizing the residual and I enforce basically the vector that I'm estimating to have unit length that's basically the message how does that look like in in 2-D so let's let's look at the following situation I mean I told you several times that's it's very important that you built up a mental model what's going on there so here I have theta one and theta two that's why I have two components okay I search in a in a two dimensional a two dimensional parameter space and I want to find a theta such that this residual is minimum and let's think about theta the optimum theta is here can somebody tell me how the curve looks like for the residual having the having the euclidean norm of one how does the level set look like which points here which theta which thetas have here a residual of unit length well that depends on the covariance matrix of this residual basically you get some elliptic structure right so this here is for one this here is for two this here is for three so these are the level sets level sets where I say X theta minus Y 2 2 is some constant C plot this takes a systems of linear equations let's take two equations for two unknowns and plot these level sets you will get these elliptic structures Y is it an elliptic structure because the covariance of the residual I mean we have seen this we have we can associate this with a normal density and basically you end up with these elliptic structures for level sets and if you now require the solution to be unit length what does that mean basically means that the solution the theta we're looking for is the vector of length one so the solution has to lie on the circle here if I look for a theta that has unit length if my pen here has length one right all the solutions here are on this unit circle and now if I want to minimize the residual and enforce the solution to be unit length the solution will be the intersection or the point where the ellipse where the ellipse oh no now it's getting worse what do I do save and now this let's see if I will be able to fix that save so I have this here and I have a circle and this and this has unit length and this here is the solution and that's that's the point that has the minimum residual and length one right that's the one we are looking for that's the one we are looking and now you tell me why does he draw these figures it becomes interesting if we replace the L2 norm by another norm because then the vectors that have unit length look different and let's let's look at this on the next slide so once again is there anybody in the audience who can not read this figure it's a very important thing we're look let's think about this system of equations X theta minus Y has to be 0 okay let's think about the situation that we are in 2-D so theta has one two dimensions theta one theta two then for some theta this equation here is fulfilled or close to be fulfilled so that a point that's not necessarily zero right that's a point somewhere okay and then we can draw the lines where the residual is constant so we can ask for instance the question X theta minus Y the euclidean length of this has to be 5 how does this this function look like how does this manifold or how does this level set look like well when you draw the figure where this is five then you draw the figure where this is six the residual and then you can draw these different levels sets for different C values and what we are looking for is for a point that's on the unit ball of the euclidean norm which are all the vectors that have unit length and the point that show us that lies on a level set which has the lowest possible residual and that's this point here and that's exactly what we basically solve if we do rich regression that's the point we look for okay got that roughly roughly okay good what is the Lasso doing and now it's starting to become rather interesting the Lasso is using the L1 one norm what is the L1 norm of a vector X for instance Niels no no idea it's the sum so it's sum over XI how agrees on that who is unhappy with this notation okay how can we fix that the absolute value yes and now you know why people hate to use the L1 norm because if you have to optimize this and you have absolute values here you remember 5th year in in high school basically you had a mental breakdown when you saw absolute values right and if you think about absolute value you think about functions with a kink for instance yah and then we know if you want to compute the gradient here or the derivative your math teacher was telling you no impossible and blah-blah-blah so these optimization problems they appear hard and we have to say 15 years ago people try to avoid optimization problems where you have this type of kink functions nowadays we take it easy going because we know how to deal these things there was a lot of innovation in the field of optimization to deal with this type of problem an now let's draw the figure again and that's the last comment I want to do for today for the situation let's draw the figure so we have here theta one and theta two and now the question is if we have theta and the L1 norm is required to be one what is the geometry before we had here the two so we had theta one square plus theta two square is one so it was a circle now we have theta one absolute value plus theta two absolute value is one how does the figure look like it's no longer a circle but it's a diamond right you get here theta one plus theta two is one so theta two is minus theta one plus one and now you have to think about what's in this what's in this what's in this and what's in that quadrant and you have to resolve the absolute values but basically you end up here with this type of diamond this structure that's that's the set of all vectors that have all the length of the L1 norm okay okay okay Klemens the maximum norm homework it's rotated by 90 degrees 45 yea you can rotate it by 90 or backwards by it's interesting if you watch to videos I think I did exactly the same mistake 15 years ago that's my limited brain power so but now the interesting thing happens let's let's look at the ellipsis that's the solution we're looking for and here are our level sets residual is one residual is two residual is five and now let's see where this so it first hits here and it's most probable that you hit one of the four corners it rarely happens that you hit here this connecting line it's most probable that you hit one of the corners and the the consequence as a consequence the resulting estimate you get for theta one or theta two is one where one component is none 0 in the other one is 0 so you get sparse results spares results saying you get a lot of 0 entries so if you use the Lasso with the L1 regularization you most probably get solutions where you have in high dimensional spaces a lot of zeros and that's what CT people do in computed tomography they tried to do a reconstruction where you have as many zeros as possible if you just want to reconstruct vessels or something like and they call it compressed sensing now if you talk to medical physicist they have compressed sensing and what is compressed sensing that's just an L1 regularized lease square estimator which is the Lasso which was used by pattern recognition people for a long time but that also shows that if you if you are in between fields and you cross the or you you think you cross the borderline of certain fields you can come up with quite smart solutions so that's I have to admit that this level set is not so convincing but I think you trust me and you are eager to show up tomorrow again thank you very much and we quit here [MUSIK]