12 - Learning to benchmark (A. Hero, University of Michigan, USA) [ID:35829]
50 von 358 angezeigt

That is clear in a minute.

So, you can also see down here.

There is a Python script that implements the talk that I'm giving today so if you get bored with my talk, just click on that, that that link. And assuming that you have Google Collaboratory login, you'll be able to run that and if not it takes about five minutes to to get the free registration to the Google, Google Collaboratory.

So, I need to acknowledge from students and former students and postdocs, many of which have been critical in the development of the theoretical and the application of

the theory of this work, probably Kumar Shricharan who was a student of mine and graduated in the 2000s was the sort of the originator of the central idea of ensemble averaging of very poor approximations to obtain a very strong approximation by virtue of averaging.

And that has been been developed into many different directions including the one I'm talking about today.

So I'm going to first talk about what I mean by benchmark.

And why we might be interested in a benchmark, and then I'm going to talk about learning information divergence learning here. All it means is approximation.

So we're going to use our norm, our metric on a Hilbert space of random variables, and then learning ensembles for accelerated learn for accelerating the process of information divergence approximation.

And then I'll discuss some applications including a deep network application, which I hope you'll find interesting.

There are lots and lots of methods that have been applied to classify different labels. Here we have three labels, the blue the red and the green labels of data in this case in two dimensions feature spaces in two dimensions.

D equals to D is going to be a dimension.

That will be a variable in the talk and k nearest neighbor classifier seeks to take a new point, x i, and then map it to a function that function is basically a k nearest neighbor majority vote function.

And it then assigns the label of the majority closest to the new point x i among the k nearest neighbors right that's the k nearest neighbor classifier very simple, very simple mapping or complicated mappings or the support vector machine

that has to embed in a linear space, a discrimination function, this discrimination function is nonlinear by using colonization method.

And it's very, very standard method, very popular method until probably 15 years ago, when methods like this came up this the convolutional neural network and other types of neural networks that even have more complicated mappings.

So the question is okay we have all of these different algorithms for approximating a label, assigning a label to a new point a new data sample and a feature space, based on a training set.

How do we know which one to use.

And how do we know when we should probably start keeping on searching for a better mapping.

So, that sort of illustrates this my point here.

A couple of years ago there was a yet another deep neural network that was proposed, called the moth net by Charles, Charlie Delahunt at University of Washington.

Fair amount of interest in this moth net basically works instead of on the classic neural architecture of the perceptron, it introduces the sensory system of a moth into the architecture, and then they generate as a function of the number of training

functions or what are called mini batches in machine learning. That is the number, total number here of features that are available and to train as a function of that, they plot accuracy, the accuracy being the average number of errors of labeling errors

and approximation. And they have they achieved this red line up here. And there are all these other methods that include nearest neighbors and convolution neural nets and SVM and so forth.

And so the question is, well, we have this red line it's good, it's above 70% when we train to the, to the limit, we can get somewhere around 85 90%, but is that the best we can do.

What is the best achievable accuracy of any of these mappings, without having to actually build the mapping in advance, we have an answer to that.

And that's that's what benchmarking is benchmarking is to learn the what's called the base optimal achievable rate of accuracy, directly from the data. And that way you can then compare your particular mathematical algorithm for classification to a benchmark,

and then have an absolute sense of when you're close enough. When you still have to work hard. When you made when when you might want to change the architecture to go to something more complicated or simpler or what have you.

So, this sounds like a pretty ambitious problem.

Maybe even an impossible problem.

And in one sense, you're right, and I'll tell you what that sense is in a minute, but it's, it's been at least 4050 years since the theory of Canaris neighbor performance established that in fact the Bayes error is bracketed sandwiched, that's east are

the Bayes error, the optimal minimal probable probability of error of any classifier structure. It is sandwiched in between two functions of the empirical error rate of Canaris neighbor classifier.

And if, if that empirical error rate is equal to one half which means that here we have two classes in this case. That means that the Bayes error rate is one half to. So, if the Canaris neighbor can't do it no one can do it.

That's basically what it says. The caveat is this is for its asymptotic is n goes to infinity.

It is possible. Here's one that approximation if you like a bound approximation to that, that error rate that benchmark.

But there are some negative results, and in the 90s. Yes, question. At this point where you say a case near neighbor, what is K. I mean is there a number K I mean I don't see the K in the in the formula here.

Yeah, so so K goes into this error.

Yeah, so that, so the empirical rate if you change K you will get a different rate, and you'll find that you know as K varies, the rates will vary in this particular theory.

So any K is is fixed. And so this holds for any K has asymptotically as you let the number of training samples get large, the Canaris neighbor has an error which arbitrarily will approximates the Bayes error when that is when that empirical

error converges to one half. Now when it's not one half this just gives you a bracket, a bound right.

So you don't know what the Bayes error is you know it's in between these two numbers.

No, I mean,

a rule for choosing this optimal K I mean is it depends what I said we're considering or there is a general.

Usually, usually it's chosen by cross validation. So, you know, you go through the data set you build a bunch of these mappings by leaving out random different instances, and then you look at the classification performance on on the instance that you left out,

and you do that over and over again and then you, you choose a K that sort of gives you the best average error.

So there are there are many. I mean that's an open problem mathematically that one doesn't really there doesn't exist an optimal K for a theory for choosing the optimal K.

That's general, and it applies to a wide class of distributions.

The theorem that I'm giving here is a negative result, and it says that if that class of distributions is not constrained at all. So there's no smoothness there's nothing.

Nope, no properties of the class be arbitrary of atoms in the distribution. So it's not the big continuous.

It could be a fractal distribution. Then, this theorem is a very strong result that says that there always exists some distribution, no matter what type of mapping you choose.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:14:44 Min

Aufnahmedatum

2021-07-14

Hochgeladen am

2021-07-16 13:36:04

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen