eBooks

Fundamentals of Machine Learning

Support Vector Machines Made Easy

0713
2020
978-3-8385-5251-4
978-3-8252-5251-9
UTB 
Floris Ernst
Achim Schweikard

Künstliche Intelligenz wird unser Leben nachhaltig verändern - sowohl im Job als auch im Privaten. Doch wie funktioniert maschinelles Lernen eigentlich genau? Dieser Frage gehen zwei Lübecker Professoren in ihrem englischsprachigen Lehrbuch nach. Definitionen sind im Buch hervorgehoben und Aufgaben laden die LeserInnen zum Mitdenken ein. Das Lehrbuch richtet sich an Studierende der Informatik, Technik und Naturwissenschaften, insbesondere aus den Bereichen Robotik, Artificial Intelligence und Mathematik. Artificial intelligence will change our lives forever - both at work and in our private lives. But how exactly does machine learning work? Two professors from Lübeck explore this question. In their English textbook they teach the necessary basics for the use of Support Vector Machines, for example, by explaining linear programming, the Lagrange multiplier, kernels and the SMO algorithm. They also deal with neural networks, evolutionary algorithms and Bayesian networks. Definitions are highlighted in the book and tasks invite readers to actively participate. The textbook is aimed at students of computer science, engineering and natural sciences, especially in the fields of robotics, artificial intelligence and mathematics.

<?page no="0"?> Floris Ernst | Achim Schweikard Fundamentals of Machine Learning Support Vector Machines Made Easy <?page no="1"?> Eine Arbeitsgemeinschaft der Verlage Böhlau Verlag · Wien · Köln · Weimar Verlag Barbara Budrich · Opladen · Toronto facultas · Wien Wilhelm Fink · Paderborn Narr Francke Attempto Verlag / expert verlag · Tübingen Haupt Verlag · Bern Verlag Julius Klinkhardt · Bad Heilbrunn Mohr Siebeck · Tübingen Ernst Reinhardt Verlag · München Ferdinand Schöningh · Paderborn transcript Verlag · Bielefeld Eugen Ulmer Verlag · Stuttgart UVK Verlag · München Vandenhoeck & Ruprecht · Göttingen Waxmann · Münster · New York wbv Publikation · Bielefeld utb 5251 <?page no="3"?> Floris Ernst | Achim Schweikard Fundamentals of Machine Learning Support Vector Machines Made Easy UVK Verlag · München <?page no="4"?> Autoren Prof. Dr. Floris Ernst und Prof. Dr. Achim Schweikard lehren KI (Künstliche Intelligenz) und Robotik an der Universität Lübeck. Authors Prof. Dr. Floris Ernst and Prof. Dr. Achim Schweikard teach AI (artificial intelligence) and robotics at the University of Lübeck. Umschlagabbildung: © 4X-image - iStock Bibliografische Information der Deutschen Bibliothek Die Deutsche Bibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http: / / dnb.ddb.de abrufbar. 1. Auflage 2020 © UVK Verlag 2020 - ein Unternehmen der Narr Francke Attempto Verlag GmbH + Co. KG Dischingerweg 5 · D-72070 Tübingen Das Werk einschließlich aller seiner Teile ist urheberrechtlich geschützt. Jede Verwertung außerhalb der engen Grenzen des Urheberrechtsgesetzes ist ohne Zustimmung des Verlags unzulässig und strafbar. Das gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen und die Einspeicherung und Verarbeitung in elektronischen Systemen. Internet: www.narr.de eMail: info@narr.de Einbandgestaltung: Atelier Reichert, Stuttgart CPI books GmbH, Leck utb-Nr. 5251 ISBN 978-3-8252-5251-9 (Print) ISBN 978-3-8385-5251-4 (ePDF) <?page no="5"?> Contents Preface ............................................................................................................................................7 Part I | Support Vector Machines 1 Symbolic Classification and Nearest Neighbour Classification ....................11 1.1 Symbolic Classification.............................................................................................11 1.2 Nearest Neighbour Classification ..........................................................................11 2 Separating Planes and Linear Programming .......................................................15 2.1 Finding a Separating Hyperplane ..........................................................................16 2.2 Testing for feasibility of linear constraints .........................................................16 2.3 Linear Programming .................................................................................................19 MATLAB example......................................................................................................22 2.4 Conclusion...................................................................................................................24 3 Separating Margins and Quadratic Programming .............................................25 3.1 Quadratic Programming...........................................................................................25 3.2 Maximum Margin Separator Planes ......................................................................30 3.3 Slack Variables............................................................................................................32 4 Dualization and Support Vectors ..............................................................................37 4.1 Duals of Linear Programs ........................................................................................37 4.2 Duals of Quadratic Programs..................................................................................38 4.3 Support Vectors..........................................................................................................40 5 Lagrange Multipliers and Dualit y..............................................................................43 5.1 Multidimensional functions ....................................................................................46 5.2 Support Vector Expansion .......................................................................................47 5.3 Support Vector Expansion with Slack Variables ................................................48 6 Kernel Functions ..............................................................................................................53 6.1 Feature Spaces ............................................................................................................53 6.2 Feature Spaces and Quadratic Programming ......................................................54 6.3 Kernel Matrix and Mercer’s Theorem...................................................................58 6.4 Proof of Mercer’s Theorem......................................................................................59 Step 1 - Definitions and Prerequisites ..................................................................59 Step 2 - Designing the right Hilbert Space ..........................................................61 Step 3 - The reproducing property........................................................................63 <?page no="6"?> 6 Contents 7 The SMO Algorithm ........................................................................................................ 67 7.1 Overview and Principles.......................................................................................... 67 7.2 Optimisation Step...................................................................................................... 68 7.3 Simplified SMO .......................................................................................................... 69 8 Regression ......................................................................................................................... 79 8.1 Slack Variables ........................................................................................................... 80 8.2 Duality, Kernels and Regression............................................................................ 81 8.3 Deriving the Dual form of the QP for Regression ............................................. 82 Part II | Beyond Support Vectors 9 Perceptrons, Neural Networks and Genetic Algorithms ................................. 89 9.1 Perceptrons ................................................................................................................. 89 Perceptron-Algorithm .............................................................................................. 91 Perceptron-Lemma and Convergence .................................................................. 92 Perceptrons and Linear Feasibility Testing......................................................... 93 9.2 Neural Networks ....................................................................................................... 94 Forward Propagation ............................................................................................... 95 Training and Error Backpropagation ................................................................... 96 9.3 Genetic Algorithms................................................................................................... 98 9.4 Conclusion .................................................................................................................. 99 10 Bayesian Regression ................................................................................................... 101 10.1 Bayesian Learning................................................................................................... 101 10.2 Probabilistic Linear Regression............................................................................ 103 10.3 Gaussian Process Models....................................................................................... 108 10.4 GP model with measurement noise .................................................................... 111 Optimization of hyperparameters ....................................................................... 112 Covariance functions.............................................................................................. 113 10.5 Multi-Task Gaussian Process (MTGP) Models................................................. 114 11 Bayesian Networks ........................................................................................................ 121 Propagation of probabilities in causal networks .......................................................... 131 Appendix - Linear Programming ..................................................................................... 139 A.1 Solving LP0 problems ................................................................................................. 140 A.2 Schematic representation of the iteration steps ................................................... 146 A.3 Transition from LP0 to LP ......................................................................................... 148 A.4 Computing time and complexity issues ................................................................. 150 References ................................................................................................................................ 151 Index ........................................................................................................................................... 153 <?page no="7"?> Preface When the first methods for machine learning emerged in the 1970s, it quickly became apparent that the algorithms depended very much on the target field of the application. Examples for such fields were robotics, medical diagnosis, speech recognition, image understanding, chess systems and geometric reasoning. For each application, a new set of methods emerged. This gave rise to the question, whether one could find more general principles, potentially applicable to all domains, and yet powerful enough to solve relevant problems in end-user applications. More recently, when faster computers, and large data sets became available, such principles were found. As one first example, classification and regression with support vector machines was successfully applied to a large range of applications. However, basic methods still must be customized to an application. To be able to do that, users must understand the basic mathematical principles underlying machine learning. The goal of our book is to introduce the methods for machine learning precisely for this purpose: to provide an accessible, yet sufficiently complete basis for machine learning, which will enable users to develop their own methods for new fields. Our book was written for a one-semester course in machine learning at TU München and the University of Lübeck. Students are graduates in Computer Science, Electrical Engineering, Robotics, or Applied Mathematics. <?page no="9"?> Part I | Support Vector Machines <?page no="11"?> 1 Symbolic Classification and Nearest Neighbour Classification 1.1 Symbolic Classification Suppose we wish to classify unknown data by computer. Thus, we present data strings describing molecules to the computer, and ask: does this molecule have a specific function in biology? Likewise, the data could also be attributes describing symptoms of diseases instead of molecules. Thus, the attributes could be fever, headache, nausea, etc. We now ask whether a specific disease is present or not. In both cases, the goal is to classify positive and negative samples. There are two phases: We first present data strings which already have received a classification label by a human. This first phase will be called training phase. We then have a production phase, where we present unclassified data to the system. If the training algorithm works, the system should then be able to decide whether the new data string is a positive or a negative sample. Thus, we have the following situation: Training Phase Input: Sample objects, marked as positive or negative examples. Goal: To learn which properties or attributes distinguish positive from negative examples. Production Phase Input: New sample object, without label. Output: Label of the new object (states whether it is a positive or negative sample). A second goal could be: Find rules from facts. We consider again training samples. Each sample is given as a vector with attributes. The samples are either positive or negative, and each attribute can take attribute values. 1.2 Nearest Neighbour Classification Suppose now, each of our samples has two values: the size and the weight of an object. Clearly, the size and the weight of an object are correlated. Suppose further, we have two types of objects. Our objects are either metal or wooden. Marking metal with a ‘ + ’, and wooden with a ‘ − ’, we can plot our data. The graph may look like this: <?page no="12"?> 12 Fundamentals of Machine Learning Figure 1 - Example of two classes of objects, plotted by size and weight The idea in nearest neighbor classification is the following: Given an unknown point, classify it in the same way as its nearest (known) neighbor point in space. To do so, we must find the nearest neighbor of a point 𝑈𝑈 . Here, 𝑈𝑈 is the new, unclassified sample. To do this, we have to perform the following steps: ➊ Let 𝑈𝑈 be the new, unclassified sample ➋ Define a metric 𝑑𝑑(𝑈𝑈, 𝑋𝑋) , the distance of the points 𝑈𝑈 and 𝑋𝑋 in some space ➌ Find the nearest neighbor 𝑁𝑁 of a point 𝑈𝑈 using 𝑑𝑑(𝑈𝑈,⋅) ➍ Assign the class of 𝑁𝑁 to 𝑈𝑈 If we again look at Figure 1 and add an unknown sample, marked ∗ , we will arrive at the following: Figure 2 - The same samples as in Figure 1, with a new, unknown sample (marked ∗ ) + + + + + + − − − − − − size weight + + + + + − + − weight − − − − size ∗ <?page no="13"?> Support Vector Machines Made Easy 13 Now, using the metric 𝑑𝑑 as the Euclidean distance, we will find - looking at the dashed region from Figure 2 - the following: Figure 3 - Subset of Figure 2, showing the distances 𝑑𝑑 between the new sample and the nearest old samples This gives rise to computing the distances 𝑑𝑑 1 through 𝑑𝑑 5 as: 𝑑𝑑 1 = 3.12 𝑑𝑑 2 = 1.77 𝑑𝑑 3 = 4.94 𝑑𝑑 4 = 4.06 𝑑𝑑 5 = 3.43 Since the value 𝑑𝑑 2 is lowest, the corresponding class ( + ) is assigned to the new sample ∗ . 𝑑𝑑 1 𝑑𝑑 2 𝑑𝑑 3 𝑑𝑑 4 𝑑𝑑 5 + + − + − − − ∗ + <?page no="15"?> 2 Separating Planes and Linear Programming In the previous chapter, we looked at sample data represented as data points in space. As one way of classifying an unknown sample, we suggested to find its nearest neighbor. But there is another way to classify unknown data: Suppose again the sample (training) data are a cloud of points. For each point, we know whether the point is a positive or negative sample. Now, find a plane in space which separates positive from negative samples. New samples are then classified according to the side of the plane they are on. The following figure illustrates the situation: Figure 4 - Classification example from Figure 1, including a separating line 𝑃𝑃 and an unknown sample 𝑈𝑈 Here, we have the same situation as before in Figure 1. Again, 𝑈𝑈 is an unknown sample. The training data points are marked by ‘ + ’ and ‘ − ’. In this new situation, we have added a line 𝑃𝑃 that separates the two classes. In general, we will not call this a line but rather a separating (hyper)plane to indicate that we do not care about the dimensionality of the underlying space. Then, new samples are classified according to the side of the separating hyperplane they lie on. Characteristic for methods of this type are the running times: ▶ The classification time for a new sample is constant ▶ Preprocessing takes polynomial time Now, however, the obvious question is: how do we compute this separating hyperplane? + + + + + + − − − − − weight − 𝑈𝑈 size 𝑃𝑃 <?page no="16"?> 16 Fundamentals of Machine Learning 2.1 Finding a Separating Hyperplane We consider linear inequalities. An example is the following inequality: 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 10 There are two variables 𝑥𝑥 1 , and 𝑥𝑥 2 . More generally, we allow for variables to have coefficients. Thus, the general form of a linear inequality is: 𝑎𝑎 1 𝑥𝑥 1 + 𝑎𝑎 2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 The coefficients are the constants 𝑎𝑎 𝑖𝑖 and 𝑏𝑏 . We will call the inequalities constraints. Our goal is to test whether a given set of linear constraints admits a (feasible) solution. This problem is called Linear Programming. 2.2 Testing for feasibility of linear constraints Suppose we are given a system of inequalities: 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 10 𝑥𝑥 1 ≥ 5 𝑥𝑥 2 ≥ 4 (1) There are two variables, hence this system can be visualized in 2D: Figure 5 - A set of three linear inequalities in 2D space It is clear that, in this example, the three hyperplanes overlap, i.e. there is a region in 2D space containing points which satisfy all the constraints given in Eq. (1) . We will call such a system of constraints feasible. 𝑥𝑥 2 ≥ 4 𝑥𝑥 1 ≥ 5 5 10 10 4 solutions <?page no="17"?> Support Vector Machines Made Easy 17 However, not every system is feasible. An example for an infeasible linear constraint system is the following: 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 10 𝑥𝑥 1 ≥ 5 𝑥𝑥 2 ≥ 6 (2) This system is visualized in the following figure: Figure 6 - A system of infeasible linear equations In this example, we see that there is no region in space which satisfies all constraints from Eq. (2) because the individual hyperplanes do not overlap. This system is infeasible. Using our notation, a general system of inequalities with 𝑛𝑛 variables has the following form: 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 This system can also be written in matrix form as: 𝐴𝐴𝑥𝑥 ≤ 𝑏𝑏 or (𝐴𝐴𝑥𝑥) 𝑖𝑖 ≤ 𝑏𝑏 𝑖𝑖 Notice that 𝑚𝑚 and 𝑛𝑛 typically are distinct! The problem of testing whether the above general system of inequalities admits a solution will be called linear feasibility test. We will use the abbreviation (F) to denote this test. There is a close connection between a test of the above form (F) and the problem of finding a separating plane. 5 10 10 6 𝑥𝑥 1 ≥ 5 𝑥𝑥 2 ≥ 6 <?page no="18"?> 18 Fundamentals of Machine Learning We consider the following 2D example: We are given two sample points: (𝑎𝑎 1 , 𝑏𝑏 1 ) is a positive sample (𝑎𝑎 2 , 𝑏𝑏 2 ) is a negative sample Our goal is to find a line with equation 𝑦𝑦 = 𝑢𝑢 ⋅ 𝑥𝑥 + 𝑣𝑣 , such that (𝑎𝑎 1 , 𝑏𝑏 1 ) is above the line and (𝑎𝑎 2 , 𝑏𝑏 2 ) is below. Hence, we are looking for values 𝑢𝑢 , 𝑣𝑣 such that 𝑏𝑏 1 ≥ 𝑢𝑢 ⋅ 𝑎𝑎 1 + 𝑣𝑣 𝑏𝑏 2 ≤ 𝑢𝑢 ⋅ 𝑎𝑎 2 + 𝑣𝑣. Notice that each sample point determines an inequality. Now, we will have to make use of a new concept, called dualization. Dualization is a process where points become lines and lines become points. Thus, a point (𝑎𝑎, 𝑏𝑏) becomes a line: ▶ insert (𝑎𝑎, 𝑏𝑏) into 𝑦𝑦 = 𝑚𝑚 ⋅ 𝑥𝑥 + 𝑛𝑛 , i.e. the line equation becomes 𝑏𝑏 = 𝑚𝑚 ⋅ 𝑎𝑎 + 𝑛𝑛 ▶ rearrange to −𝑛𝑛 = 𝑎𝑎 ⋅ 𝑚𝑚 − 𝑏𝑏 ▶ this line equation depends on the variable 𝑚𝑚 to compute 𝑛𝑛 Vice versa, a line with equation 𝑦𝑦 = 𝑎𝑎 ⋅ 𝑥𝑥 + 𝑏𝑏 becomes a point (𝑎𝑎, −𝑏𝑏) . The same can be done in higher dimensions. Let us now call the 2D plane - where we began - the primal plane. This primal plane contains points and is transformed into the dual plane using the aforementioned concept, i.e. the points in the primal plane become lines in the dual plane and vice versa. This is visualized in the next figure. Figure 7 - Example of primal (left) and dual (right) planes. Points in the primal ( 𝑃𝑃 1 through 𝑃𝑃 6 ) become lines in the dual ( 𝑝𝑝 1 through 𝑝𝑝 6 ) whereas points in the dual ( 𝑄𝑄 1 through 𝑄𝑄 3 ) become lines in the primal ( 𝑞𝑞 1 through 𝑞𝑞 3 ). Then, for a positive sample point (𝑎𝑎, 𝑏𝑏) , we consider the half-plane below the line 𝑛𝑛 = 𝑎𝑎 ⋅ 𝑚𝑚 − 𝑏𝑏 . Likewise, for negative samples, we consider the half-plane above the corresponding dual line. The result is an intersection of half-planes. This intersection <?page no="19"?> Support Vector Machines Made Easy 19 of half planes is non-empty if and only if the original (primal) points can be separated by a line. Likewise, any point in the solution (or intersection) polyhedron in dual space gives rise to a separating line in primal space. Refer to Figure 7: the intersection polyhedron of the hyperplanes is non-empty, and 𝑄𝑄 1 through 𝑄𝑄 3 are points inside this solution space. The corresponding lines in the primal plane separate the classes + and ◻ . Obviously, if the solution polyhedron is non-empty, there is an infinite number of points in the dual, each giving rise to a separating hyperplane in the primal. The next question is: how can we find a point inside this polyhedron? 2.3 Linear Programming In the previous section it became clear that the feasibility test (F) for a set of inequalities is related to finding a separating hyperplane for the points described by this system of inequalities. This problem can be solved using a technique called linear programming. The standard form of ‘linear programming’ (LP) is the following. We have ▶ constraints and ▶ a target function. Both constraints and the target function are linear. In general, we can thus write: ▶ Constraints 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 ▶ Target function Maximize 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 under the constraints, while 𝑥𝑥 𝑖𝑖 ≥ 0 for each 𝑖𝑖 . Here, 𝑚𝑚 is the number of inequality constraints, 𝑛𝑛 is the number of variables and, consequently, also the dimension of the problem space. The above standard form for (LP) can be visualized in the following way: <?page no="20"?> 20 Fundamentals of Machine Learning Figure 8 - Visualization of a linear program (LP) in standard form The target function to be maximized in the example in Figure 8 is: 1 ⋅ 𝑥𝑥 1 + 0 ⋅ 𝑥𝑥 2 In general, (𝑐𝑐 1 , … , 𝑐𝑐 𝑛𝑛 ) T is an 𝑛𝑛 -dimensional vector 𝒄𝒄 . This vector determines the direction of the maximization, as shown in Figure 9. Figure 9 - Visualization of a linear program (LP) in standard form, including the direction of maximization, feasible (light) and infeasible (dark) areas This figure also shows the so-called implicit constraints, i.e. we require 𝑥𝑥 𝑖𝑖 ≥ 0 . Thus, only the intersection of the solution polyhedron with the first quadrant is feasible (light), while the parts of the polyhedron in the other quadrants (dark regions) are infeasible. We will now compare the two problems (LP) and (F). (LP) is the classic linear programming, (F) is the linear feasibility test. Now, (LP) does more than (F), since feasible region maximum 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 Maximum in 𝒄𝒄-direction 𝑥𝑥 1 𝑥𝑥 2 <?page no="21"?> Support Vector Machines Made Easy 21 (LP) not only computes a feasible point, but it also finds the one feasible point that is the furthest in 𝑐𝑐 -direction. This is usually a vertex of the feasible polyhedron. On the other hand: The feasibility test (F) computes any feasible point, whereas (LP) in standard form cannot return a point with negative coordinates, since we have the implicit constraints 𝑥𝑥 𝑖𝑖 ≥ 0 . But (LP) can be modified in such a way that it is able to return negative points as well! We replace the variables 𝑥𝑥 𝑖𝑖 in the LP-problem by the expression 𝑦𝑦 𝑖𝑖 − 𝑦𝑦 𝑖𝑖′ where 𝑦𝑦 𝑖𝑖 and 𝑦𝑦 𝑖𝑖′ are two separate new variables. Then this new LP-problem is solved just in the same way as if it was in standard form. The result is: Despite the fact that both 𝑦𝑦 𝑖𝑖 and 𝑦𝑦 𝑖𝑖′ are both positive in the end, the expression 𝑦𝑦 𝑖𝑖 − 𝑦𝑦 𝑖𝑖′ can very well be negative. This allows for negative values of 𝑥𝑥 𝑖𝑖 , since 𝑥𝑥 𝑖𝑖 = 𝑦𝑦 𝑖𝑖 − 𝑦𝑦 𝑖𝑖′ ! Thus, the Feasibility Problem (F) 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 can now be written as (LP) with 2𝑛𝑛 variables 𝑎𝑎 11 (𝑦𝑦 1 − 𝑦𝑦 1′ ) + ⋯ + 𝑎𝑎 1𝑛𝑛 (𝑦𝑦 𝑛𝑛 − 𝑦𝑦 𝑛𝑛′ ) ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 (𝑦𝑦 1 − 𝑦𝑦 1′ ) + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 (𝑦𝑦 𝑛𝑛 − 𝑦𝑦 𝑛𝑛′ ) ≤ 𝑏𝑏 𝑚𝑚 . Maximize 𝑐𝑐 1 (𝑦𝑦 1 − 𝑦𝑦 1′ ) + ⋯ + 𝑐𝑐 𝑛𝑛 (𝑦𝑦 𝑛𝑛 − 𝑦𝑦 𝑛𝑛′ ) with the implicit inequalities 𝑦𝑦 𝑖𝑖 , 𝑦𝑦 𝑖𝑖′ ≥ 0 . Hence, we can find a separating plane for points with 𝑛𝑛 coordinates by calling an LP-solver! The above feasibility test (F), and (LP) are both closely related to solution methods for linear equation systems. A method for solving a system of 𝑛𝑛 linear equations with 𝑛𝑛 variables determines an intersection point of 𝑛𝑛 hyperplanes in 𝑛𝑛 -space. However, for (LP) there are typically more than 𝑛𝑛 inequalities. Why? The implicit inequalities must not be forgotten! For each variable 𝑥𝑥 𝑖𝑖 , there is an implicit inequality 𝑥𝑥 𝑖𝑖 ≥ 0 . Hence, typically the number of inequalities in linear programming is larger than the number of variables. This is not so for linear equation solvers. Now any 𝑛𝑛 -subset of these 𝑚𝑚 inequalities in a given LP-problem (here we can assume 𝑚𝑚 > 𝑛𝑛 ) determines an intersection point in 𝑛𝑛 -space (if in general position). But this intersection point may not be in the feasible polyhedron (see Figure 10 )! <?page no="22"?> 22 Fundamentals of Machine Learning Figure 10 - Intersection of 𝑛𝑛 hyperplanes in 𝑛𝑛 -space Thus, we cannot use linear equation solvers to solve an LP-problem. But we can construct a very simple (but naive) method for solving (F), given a linear equation solver: Take all the n-element subsets of the m inequalities, replacing ≤ or ≥ by = (i.e. just assume they are equations), and solve all the linear equation systems, until a feasible point has been found. There are � 𝑚𝑚𝑛𝑛 � such sets. Thus, this naïve method is exponential. But: if there is a feasible point, this naïve method will find it. The naïve method just sketched illustrates the fact that linear programming is generally a more complex problem than linear equation solving.  MATLAB example Find a separating plane for two sets of points. Let the ∗ points be at 𝑥𝑥 = 4 and 𝑦𝑦 = 1,2,3,4 and let the + points be at 𝑥𝑥 = 5 and 𝑦𝑦 = 1,2,3,4. Goal: find a separating line with smallest slope. Figure 11 - Two sets of points shall be separated by a line with minimal slope (left). The separating line is shown on the right. Feasible polyhedron Intersection point of 𝑛𝑛 planes <?page no="23"?> Support Vector Machines Made Easy 23  Code [II_linprog.m] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Finding a separating plane for two sets of points marked as % negative and positive samples %% Search method for finding the plane: linear programming %% Points have coordinates (u,v) %% Negative Samples are (uneg1,vneg1)... (uneg4,vneg4) % Positive Samples are (upos1,vpos1)... (upos4,vpos4) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Prepare data clear; upos = [5; 5; 5; 5]; vpos = [1; 2; 3; 4]; uneg = [4; 4; 4; 4]; vneg = [1; 2; 3; 4]; %% Set-up matrix A % Separating line has equation y = m*x + c % Hence variables of the LP are m and c %% Constraints have the form % m*uneg + c <= vneg % m*upos + c >= vpos [or equivalently -m*upos c <= -vpos] A = [[uneg; -upos], [ones(4,1); -ones(4,1)]]; %% Set-up vector b b = [vneg; -vpos]; %% Set-up function to be minimized f = [1; 0]; %% call linear programming module in MATLAB res = linprog(f,A,b); %% plot the separating line and the data x = 3.8: 0.1: 5.2; y = res(1)*x + res(2); plot (x,y, upos,vpos, 'b+', uneg,vneg, 'r*'); <?page no="24"?> 24 Fundamentals of Machine Learning 2.4 Conclusion A separating plane for positive and negative sample points (in 𝑛𝑛 dimensions) can be computed by linear programming. Specifically: linear programming does a feasibility test followed by a maximization. Only the feasibility test is needed here. The plane computed is a separating plane, if the original problem was feasible. It may not be a very good separating plane, due to poor margins. Notes In the appendix, a method for linear programming is described. The method is called the simplex method. For better understanding of this chapter, the understanding of the simplex method will help. Discussion Nearest Neighbor Classification is applicable even if there is no separating plane. But it requires logarithmic classification time. For the case of classification with a separating plane: a new sample can be classified in constant time. In learning, a separating line or plane can represent the information to be learned. Below we will look at more general functions (not just lines and planes). Then this aspect will be useful. Both methods require preprocessing. The choice of a method for classification requires an analysis of the time required for preprocessing, and the time required for classification. In addition, it may be relevant to know which type of misclassification is more severe (false positive or false negative). Variants Classification can also be done by finding the convex hull of positive samples. Exercise: compute a convex hull for a cloud of points in 𝑛𝑛 dimensions with an LP solver.  Exercise 1 Given the (LP) problem max 3𝑥𝑥 + 2𝑦𝑦 Subject to 2 𝑥𝑥 + 𝑦𝑦 ≤ 18 2 𝑥𝑥 + 3 𝑦𝑦 ≤ 42 3 𝑥𝑥 + 𝑦𝑦 ≤ 24 𝑥𝑥 , 𝑦𝑦 ≥ 0 Solve this 2D problem graphically, by drawing the constraint half-planes into a coordinate system and finding the maximizing vertex. 1 This exercise is taken from http: / / www.phpsimplex.com/ en/ graphical_method_example.htm <?page no="25"?> 3 Separating Margins and Quadratic Programming The separating line computed above (see Figure 11), e.g. with simplex linear programming, is not very good. When we compute a separating plane with linear programming (for example as implemented in the simplex algorithm, see appendix), the final solution will always be a vertex of the feasible polyhedron. A vertex is an extremal point (corner point) in this polyhedron. This extremal point arises as intersection of several hyperplanes bounding the feasible polyhedron. Thus, the separating plane computed by linear programming contains n of the sample points, positive or negative ones. This is the reason why there is typically no margin for a separating plane computed by linear programming. A better plane can be computed with quadratic programming. Figure 12 - Maximum margin classifier for the example from Figure 11 3.1 Quadratic Programming Informally, quadratic programming is the same as linear programming, the only difference being that the function to be maximized may now be quadratic. Formally, we will call the following Quadratic Programming (QP): Minimize the function 𝑓𝑓(𝒙𝒙) = 12 𝒙𝒙 T 𝐐𝐐𝒙𝒙 + 𝒄𝒄 T 𝒙𝒙 (3) under the constraints 𝐀𝐀𝒙𝒙 ≤ 𝒃𝒃, (4) <?page no="26"?> 26 Fundamentals of Machine Learning Where 𝐐𝐐 is symmetric and 𝑛𝑛 -by- 𝑛𝑛 , 𝒄𝒄, 𝒙𝒙 ∈ ℝ 𝑛𝑛 , 𝒃𝒃 ∈ ℝ 𝑚𝑚 . Notice that some references and text books include positivity constraints on the 𝒙𝒙 -variables, i.e. 𝒙𝒙 ≥ 0 . In this case, most of the time we require that 𝐐𝐐 be positive definite. A symmetric 𝑛𝑛 by- 𝑛𝑛 matrix 𝐌𝐌 ∈ ℝ 𝑛𝑛×𝑛𝑛 is called positive definite if and only if 𝒙𝒙 T 𝐌𝐌𝒙𝒙 > 0 ∀ 𝒙𝒙 ∈ ℝ 𝑛𝑛 ∖ {0} . It is called positive semidefinite if and only if 𝒙𝒙 T 𝐌𝐌𝒙𝒙 ≥ 0 ∀ 𝒙𝒙 ∈ ℝ 𝑛𝑛 . Lemma 2 If the system 𝐀𝐀𝒙𝒙 ≤ 𝒃𝒃 is feasible, 𝐐𝐐 is positive semidefinite and 𝑓𝑓 is not unbounded in the feasible region, then there is a unique vector 𝒙𝒙 0 minimizing the quadratic program given in equations (3) and (4) .  Example The identity matrix 𝐈𝐈 ∈ ℝ 𝑛𝑛×𝑛𝑛 is positive definite. To see why, consider the following: Let 𝑥𝑥 ∈ ℝ 𝑛𝑛 ∖ {0} . Then 𝑥𝑥 T 𝐈𝐈𝑥𝑥 = ∑ 𝑥𝑥 𝑖𝑖2 𝑛𝑛𝑖𝑖=1 > 0. Obviously, for 𝐐𝐐 = 0 , (QP) is just (LP). The question now is how to use this (QP) to find a separating line (or plane in 3D) with a margin? Recall that before, we used 𝑦𝑦 = 𝑚𝑚 ⋅ 𝑥𝑥 + 𝑐𝑐 to denote lines in the plane. We will now use 𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑 = 0 where 𝒘𝒘, 𝒙𝒙 ∈ ℝ 𝑛𝑛 and 𝑑𝑑 ∈ ℝ . In this case 𝒘𝒘 and 𝒙𝒙 are vectors! And 𝒘𝒘 T 𝒙𝒙 will denote the scalar product of 𝒘𝒘 and 𝒙𝒙 . Here, 𝒘𝒘 is the normal vector of the line, 𝒙𝒙 is a point on the line and 𝑑𝑑 denotes the distance from the origin, in case 𝒘𝒘 is a unit vector. The advantages of this representation are: ▶ This type of equation works in any dimension; hence one can look at lines, planes, hyperplanes in the same way. ▶ Additionally, vertical lines can also be represented. We will sometimes call the vector 𝒘𝒘 the weight vector. Of course, there is a mathematical connection between the two representations: Let 𝒑𝒑 = (𝑥𝑥, 𝑦𝑦) T a point on a line 𝑦𝑦 = 𝑚𝑚 ⋅ 𝑥𝑥 + 𝑐𝑐 . Then we can set 𝒘𝒘 = (−𝑚𝑚, 1) T and 𝑑𝑑 = −𝑐𝑐 : 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = (−𝑚𝑚, 1)(𝑥𝑥, 𝑦𝑦) T − c = −m ⋅ 𝑥𝑥 + 𝑦𝑦 − 𝑐𝑐 = −𝑚𝑚 ⋅ 𝑥𝑥 + 𝑚𝑚 ⋅ 𝑥𝑥 + 𝑐𝑐 − 𝑐𝑐 = 0 2 For a proof, see [4] <?page no="27"?> Support Vector Machines Made Easy 27 Similarly, if 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 0 , we find 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 𝑤𝑤 1 𝑥𝑥 + 𝑤𝑤 2 𝑦𝑦 + 𝑑𝑑 = 0 𝑦𝑦 = − 𝑤𝑤 1 𝑤𝑤 2 ⁄ ������� 𝑚𝑚 𝑥𝑥 − 𝑑𝑑 𝑤𝑤 2 ⁄ ����� 𝑐𝑐 With this new representation, we have the following properties: The point 𝒙𝒙 0 is ▶ below the hyperplane if 𝒘𝒘 T 𝒙𝒙 0 + 𝑑𝑑 < 0 ▶ above the hyperplane if 𝒘𝒘 T 𝒙𝒙 0 + 𝑑𝑑 > 0 Our goal is now to find a separating hyperplane with a maximum margin. Consider the line 𝐻𝐻 with the equation 𝐻𝐻: 𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑 = 0 . Then the constraint 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 ≥ 1 forces the point 𝒑𝒑 to be above 𝐻𝐻 , whereas 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 ≤ −1 forces the point 𝒒𝒒 to be below 𝐻𝐻 . Note If ‖𝒘𝒘‖ = 1 , |𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑| is the distance of 𝒑𝒑 from 𝐻𝐻 . Furthermore, the plane 𝜆𝜆𝐻𝐻 is the same as 𝐻𝐻 : 𝜆𝜆𝐻𝐻: (𝜆𝜆𝒘𝒘) T 𝒙𝒙 + (𝜆𝜆𝑑𝑑) = 0 for 𝜆𝜆 > 0 , i.e. there are infinitely many representations of a plane.  Example Let 𝒑𝒑 = �22� , 𝒒𝒒 = �00� and 𝐻𝐻: (0,1)𝒙𝒙 − 1 = 0 . Here, 𝒒𝒒 is below 𝐻𝐻 , 𝒑𝒑 is above, since (0,1)𝒒𝒒 − 1 = (0,1) �00� − 1 = −1 and (0,1)𝒑𝒑 − 1 = (0,1) �22� − 1 = 1. Both points have distance 1 from 𝐻𝐻 , as shown in Figure 13. <?page no="28"?> 28 Fundamentals of Machine Learning Figure 13 - Separating hyperplane for two points We now want to find the hyperplane 𝐻𝐻 ′ which maximises the distances to 𝒑𝒑 and 𝒒𝒒 . Intuitively, this means that we wish to rotate 𝐻𝐻 to a new plane 𝐻𝐻 ′ orthogonal to the line [𝒑𝒑, 𝒒𝒒] and intersecting it at its center, as shown in Figure 14. Figure 14 - Maximum margin separating hyperplane for two points Clearly, 𝐻𝐻 ′ : (1,1)𝒙𝒙 − 2 = 0. The problem now is that there are infinitely many hyperplanes separating the points and that each hyperplane has infinitely many representations. We would like to compare different planes with respect to the margins they give. To be able to compare them, we want to have a unique representation for each single plane. This unique representation we obtain by scaling. As one way of selecting a unique representation of a plane equation, we could require the normal vector 𝒘𝒘 be a unit vector. However, we do not do this. Instead we rescale the equation for 𝐻𝐻 ′ (by multiplying with the appropriate 𝜆𝜆 ) such that 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 . This is called the canonical representation of 𝐻𝐻 with respect to 𝒑𝒑 and 𝒒𝒒 . In our case, 𝐻𝐻 ′ : �1 2 � , 1 2 � �𝑥𝑥 − 1 = 0 𝑥𝑥 2 𝑥𝑥 1 𝐻𝐻 𝒒𝒒 = 𝟎𝟎𝟎𝟎 𝒑𝒑 = 𝟐𝟐𝟐𝟐 𝑥𝑥 2 𝑥𝑥 1 𝐻𝐻 ′ 𝒒𝒒 = 𝟎𝟎𝟎𝟎 𝒑𝒑 = 𝟐𝟐𝟐𝟐 <?page no="29"?> Support Vector Machines Made Easy 29 is the canonical version of the equation for the same plane 𝐻𝐻 ′ , since for this version, we indeed have 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 . Notice that ‖𝒘𝒘‖ = ��1 2 � , 1 2 � � T � = 1 √2 ≠ 1, so 𝒘𝒘 is not normalized. Now we have two planes, separating 𝒑𝒑 from 𝒒𝒒 in our example namely 𝐻𝐻 and 𝐻𝐻 ′ . Remember that 𝐻𝐻 is also in canonical representation! Since they both are in a canonical representation with respect to 𝒑𝒑 and 𝒒𝒒 , we are in a position to compare them. We do this by comparing the norms of the weight vectors 𝒘𝒘 for both planes. ▶ For 𝐻𝐻 : ‖𝒘𝒘‖ = ‖(0,1) T ‖ = 1 ▶ For 𝐻𝐻 ′ : ‖𝒘𝒘‖ = ��1 2 � , 1 2 � � T � = 1 √2 ≈ 0.707 Intuitively, keeping the constraints in place, we can obtain a plane with bigger margin by reducing the norm of the weight vector 𝒘𝒘 . We will now further illustrate this last point a little, since it is one of the two most basic observations in the entire field of support vector machines. Remember that we had a hyperplane 𝐻𝐻: 𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑 = 0 and points 𝒑𝒑, 𝒒𝒒 such that 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 . Also recall that the distance of a point 𝒙𝒙 from a hyper-plane 𝐻𝐻 is given by 1 ‖𝒘𝒘‖ ⋅ |𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑| . Our goal now is to modify 𝐻𝐻 in such a way that the distance to both 𝒑𝒑 and 𝒒𝒒 becomes maximal. But since 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 , this margin is always 1 ‖𝒘𝒘‖ ⁄ . Hence, to find a plane with maximal margin amongst all planes with equations in canonical form, we must select the one with maximum value for 1 ‖𝒘𝒘‖ ⁄ , and, of course, instead of maximizing 1 ‖𝒘𝒘‖ ⁄ , we minimize ‖𝒘𝒘‖ . One last remark is necessary: Not all planes 𝐻𝐻 admit a canonical representation with respect to points 𝒑𝒑 and 𝒒𝒒 . For example, if 𝐻𝐻 contains 𝒑𝒑 , we will never be able to obtain 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 by scaling the plane equation with a factor 𝜆𝜆 since 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 0 . But keep in mind, such a plane 𝐻𝐻 , containing 𝒑𝒑 will also never be a maximum margin separator plane for the given points 𝒑𝒑 and 𝒒𝒒 . Hence, we do not need to worry about such planes! In fact, most planes will not admit a canonical representation with respect to 𝒑𝒑 and 𝒒𝒒 . In order to be able to get such a representation, we would need the plane to be at equal distance from 𝒑𝒑 and 𝒒𝒒 . But notice: if a plane 𝐻𝐻 has a distinct distance from 𝒑𝒑 and 𝒒𝒒 , i.e. - without loss of generality - <?page no="30"?> 30 Fundamentals of Machine Learning 𝑑𝑑(𝒑𝒑, 𝐻𝐻) > 𝑑𝑑(𝒒𝒒, 𝐻𝐻) , then we can be sure that such a plane 𝐻𝐻 is not a maximum margin hyperplane for 𝒑𝒑 and 𝒒𝒒 anyway. Why? Well, there will be a hyperplane 𝐻𝐻 ′ slightly closer to 𝒑𝒑 such that 𝑑𝑑(𝒑𝒑, 𝐻𝐻) > 𝑑𝑑(𝒑𝒑, 𝐻𝐻 ′ ) > 𝑑𝑑(𝒒𝒒, 𝐻𝐻 ′ ) > 𝑑𝑑(𝒒𝒒, 𝐻𝐻) . This means that, in the end, we can continue this process until we find 𝐻𝐻 opt such that 𝑑𝑑(𝒑𝒑, 𝐻𝐻) > 𝑑𝑑�𝒑𝒑, 𝐻𝐻 opt � = 𝑑𝑑�𝒒𝒒, 𝐻𝐻 opt � > 𝑑𝑑(𝒒𝒒 , 𝐻𝐻) . This is illustrated in Figure 15. Figure 15 - Finding an optimal separator. Left: arbitrary hyperplane, center: first step of moving 𝐻𝐻 closer to 𝒑𝒑 . Right: optimal separating hyperplane. At first glance, it may seem strange that we select a plane with minimum ‖𝒘𝒘‖ , since the normal vector 𝒘𝒘 of a plane can always be scaled to become arbitrarily short. However, requiring the canonical scaling 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 = 1 and 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 = −1 will make the plane equations unique. Hence, we are able to talk about the plane with shortest normal vector. We summarize: To find a maximum margin separator plane, we select the plane with minimum value for ‖𝒘𝒘‖ , amongst those planes that admit a canonical scaling with respect to 𝒑𝒑 and 𝒒𝒒 . 3.2 Maximum Margin Separator Planes This important observation is now restated as follows: The maximum margin hyperplane is obtained by Minimize ‖𝒘𝒘‖ 2 Subject to 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 ≥ 1 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 ≤ −1 𝒑𝒑 𝒒𝒒 𝐻𝐻 𝑑𝑑 𝒒𝒒, 𝐻𝐻 𝑑𝑑 𝒑𝒑, 𝐻𝐻 𝒑𝒑 𝒒𝒒 𝐻𝐻 𝑑𝑑 𝒑𝒑, 𝐻𝐻 ′ 𝐻𝐻 ′ 𝑑𝑑 𝒒𝒒, 𝐻𝐻 ′ 𝒑𝒑 𝒒𝒒 𝐻𝐻 𝑑𝑑 𝒒𝒒, 𝐻𝐻 opt = 𝑑𝑑 𝒑𝒑, 𝐻𝐻 opt 𝑑𝑑 𝒑𝒑, 𝐻𝐻 opt = 𝑑𝑑 𝒒𝒒, 𝐻𝐻 opt 𝐻𝐻 ′ 𝐻𝐻 opt <?page no="31"?> Support Vector Machines Made Easy 31 for positive samples 𝒑𝒑 and negative samples 𝒒𝒒 Notice that, for simplification purposes, we minimize ‖𝒘𝒘‖ 2 instead of ‖𝒘𝒘‖ (no square root). Thus, we can also write The maximum margin hyperplane is obtained by Minimize 12 𝒘𝒘 T 𝐈𝐈𝒘𝒘 = 𝟏𝟏𝟐𝟐 𝒘𝒘 T 𝒘𝒘 Subject to 𝒘𝒘 T 𝒑𝒑 + 𝑑𝑑 ≥ 1 𝒘𝒘 T 𝒒𝒒 + 𝑑𝑑 ≤ −1 for positive samples 𝒑𝒑 and negative samples 𝒒𝒒 . (5) This is a Quadratic Program of type (QP) where the matrix 𝐐𝐐 is the 𝑛𝑛 × 𝑛𝑛 identity matrix 𝐈𝐈 ! For now, do not worry about the factor ½ in the target minimization function. Each sample gives rise to one inequality, i.e. for 𝑘𝑘 positive and 𝑙𝑙 negative samples, we get 𝒘𝒘 T 𝒑𝒑 1 + 𝑑𝑑 ≥ 1 ⋮ 𝒘𝒘 T 𝒑𝒑 𝑘𝑘 + 𝑑𝑑 ≥ 1 𝒘𝒘 T 𝒒𝒒 1 + 𝑑𝑑 ≤ −1 ⋮ 𝒘𝒘 T 𝒒𝒒 𝑙𝑙 + 𝑑𝑑 ≤ −1. To simplify the notation, we introduce 𝑦𝑦 𝑖𝑖 which is equal to +1 or −1 , depending on whether the sample with index 𝑖𝑖 is positive (i.e. 𝒑𝒑 𝑖𝑖 ) or negative (i.e. 𝒒𝒒 𝑖𝑖−𝑘𝑘 ). We will now call all training points 𝒙𝒙 𝑖𝑖 , with 𝑦𝑦 𝑖𝑖 indicating if it’s a positive or negative sample. Then the (QP) becomes Minimize 12 𝒘𝒘 T 𝒘𝒘 Subject to 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1. (6)  Programming Exercise Write a MATLAB Program that computes a maximum margin separating line for data points in two dimensions with the MATLAB routine quadprog (). Input: Positive sample points at (0,1) T and (1,2) T , negative sample points at (1,0) T and (2,1) T , which can be separated by a line. <?page no="32"?> 32 Fundamentals of Machine Learning Output: A separating line with maximum margin. The result should look like the plot in Figure 16. Figure 16 - Optimal Separator for the above exercise 3.3 Slack Variables The programming exercise above shows that we readily compute the maximum margin separator hyperplane for two sets of points which can be separated by a hyperplane. Now consider the following: ▶ Add a fifth point at (−1,1.5) T , labelled + , to the samples ▶ Repeat the calculation - the result is shown in Figure 17, top ▶ Change the label to ∗ ▶ Repeat the calculation - the result is shown in Figure 17, bottom <?page no="33"?> Support Vector Machines Made Easy 33 Figure 17 - Inseparable points Obviously, our (QP) cannot be solved if the points cannot be separated linearly. We want to overcome this issue, since - in our example - the fifth point can be considered an outlier. We thus have to modify our optimisation problem as stated in Eq. (6) to allow for violations of the constraints, i.e. we have to allow that points may be ▶ closer to the separating hyperplane than we want, i.e. the distance of a point 𝒙𝒙 to the plane 𝐻𝐻 is between 0 and 1 if the corresponding 𝑦𝑦 is 1 or it is between −1 and 0 if the corresponding 𝑦𝑦 is −1 ; ▶ on the wrong side of the hyperplane, i.e. its distance from the hyperplane is negative when it should be positive or vice versa. <?page no="34"?> 34 Fundamentals of Machine Learning This can be formulated by including so-called slack variables, i.e. for each sample 𝒙𝒙 𝑖𝑖 , we introduce a new variable 𝜉𝜉 𝑖𝑖 ≥ 0 that tells us how much the constraint for 𝒙𝒙 𝑖𝑖 is violated: 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) + 𝜉𝜉 𝑖𝑖 ≥ 1 𝜉𝜉 𝑖𝑖 ≥ 0 The larger 𝜉𝜉 𝑖𝑖 is, the stronger the violation of the constraint is. Of course, 𝜉𝜉 𝑖𝑖 should be as small as possible, which means that we also have to adapt our objective function! 12 𝒘𝒘 T 𝒘𝒘 + � 𝜉𝜉 𝑖𝑖 𝑛𝑛 𝑖𝑖=1 But unfortunately, this is not enough - we want to control the penalty for margin and/ or classification violation to have better control over the trade-off between the size of the margin between the classes and the amount of violations we tolerate. Thus, we introduce a new parameter 𝐶𝐶 > 0 to control this! 12 𝒘𝒘 T 𝒘𝒘 + 𝐶𝐶 ⋅ � 𝜉𝜉 𝑖𝑖 𝑛𝑛 𝑖𝑖=1 Finally, we can now state our new (QP) with slack variables: Minimize 12 𝒘𝒘 T 𝒘𝒘 + 𝐶𝐶 ⋅ � 𝜉𝜉 𝑖𝑖 𝑛𝑛 𝑖𝑖=1 Subject to 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) + 𝜉𝜉 𝑖𝑖 ≥ 1 𝜉𝜉 𝑖𝑖 ≥ 0 If 𝐶𝐶 = 0 , we don’t care about slack - violations are perfectly fine and the problem becomes undetermined. If 𝐶𝐶 = ∞ , violations are forbidden - we’re back at the original problem (only possible in theory). In fact, if 𝜉𝜉 𝑖𝑖 ∈ (0,1) , the corresponding point constitutes a margin violation, i.e. it is classified correctly but is too close to the separating hyperplane. If 𝜉𝜉 𝑖𝑖 > 1 , the point is mis-classified, i.e. it lies on the wrong side of the separating hyperplane. For 𝜉𝜉 𝑖𝑖 = 1 , it is directly on the hyperplane. Let’s get back to the example. We can see that the solution plane - shown in Figure 18 - depends very much on the value chosen for 𝐶𝐶 . <?page no="35"?> Support Vector Machines Made Easy 35  Here you can find a video: https: / / bit.ly/ 2W5NYtk [Zusatzmaterial] Figure 18 - Slack Variables. Solutions for different values of 𝐶𝐶 are shown. Falsely classified points (i.e. 𝜉𝜉 𝑖𝑖 > 1 ) are marked with squares, margin violations ( 0 < 𝜉𝜉 𝑖𝑖 < 1 ) with circles.  Exercise Find a maximum separator plane for the two points (0,0,0) T and (1,1,1) T in 3D. <?page no="37"?> 4 Dualization and Support Vectors Let us look at linear programming again. We consider a standard (LP) Maximize 𝑧𝑧 𝑃𝑃 = 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 Subject to 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 Clearly, there are 𝑛𝑛 variables and 𝑚𝑚 constraints. Now, similar to the concept of points and lines (as introduced in Chapter 2.2), we want to dualize this (LP), forming its (DLP). 4.1 Duals of Linear Programs Definition: Dual Linear Program (DLP) The dual (DLP) of the (LP) given before is: Minimize 𝑧𝑧 𝐷𝐷 = 𝑏𝑏 1 𝑦𝑦 1 + ⋯ + 𝑏𝑏 𝑚𝑚 𝑦𝑦 𝑚𝑚 Subject to 𝑎𝑎 11 𝑦𝑦 1 + 𝑎𝑎 21 𝑦𝑦 2 + ⋯ + 𝑎𝑎 𝑚𝑚1 𝑦𝑦 𝑚𝑚 ≥ 𝑐𝑐 1 ⋮ 𝑎𝑎 1𝑛𝑛 𝑦𝑦 1 + 𝑎𝑎 2𝑛𝑛 𝑦𝑦 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑦𝑦 𝑚𝑚 ≥ 𝑐𝑐 𝑛𝑛 By comparing to the general (LP) before, we can see that: ▶ The dual LP has 𝑚𝑚 variables and 𝑛𝑛 equations - the primal problem has 𝑛𝑛 variables and 𝑚𝑚 equations ▶ The dual’s variables are 𝑦𝑦 1 , … , 𝑦𝑦 𝑚𝑚 - the primal’s variables are 𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ▶ The roles of 𝒃𝒃 and 𝒄𝒄 are exchanged ▶ The primal’s ‘ ≤ ’-inequalities become ≥ in the dual ▶ The primal maximizes 𝑧𝑧 𝑃𝑃 , the dual minimizes 𝑧𝑧 𝐷𝐷 ▶ The dual of the dual is the primal We can write a short form for both the primal problem and the dual problem: Primal: max( 𝒄𝒄 T 𝒙𝒙 subject to 𝑨𝑨𝒙𝒙 ≤ 𝒃𝒃 , 𝒙𝒙 ≥ 0 ) Dual: min( 𝒃𝒃 T 𝒚𝒚 subject to 𝑨𝑨 T 𝒚𝒚 ≥ 𝒄𝒄 , 𝒚𝒚 ≥ 0 ) There is an important connection between the solutions of the primal and the dual <?page no="38"?> 38 Fundamentals of Machine Learning problems! The connection between the primal and the dual of a linear program is expressed in the following standard theorem from linear programming theory: Theorem 3 If the primal problem has a unique maximum 𝒙𝒙 0 (i.e. is feasible, not unbounded, and the maximum is unique), then the dual also has a unique minimum 𝒚𝒚 0 and 𝒄𝒄 T 𝒙𝒙 0 = 𝒃𝒃 T 𝒚𝒚 0 . Thus, 𝑧𝑧 𝑃𝑃 = 𝑧𝑧 𝐷𝐷 . This tells us that it may be advantageous to solve the dual instead of the primal! 4.2 Duals of Quadratic Programs Similarly, we can also dualize a quadratic program (QP) given as Minimize 12 𝒘𝒘 T 𝒘𝒘 Subject to 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1 . Remember: 𝑦𝑦 𝑖𝑖 is the class value of the corresponding 𝒙𝒙 𝑖𝑖 ! The dual variables will be called 𝛼𝛼 1 , … , 𝛼𝛼 𝑚𝑚 (since 𝒚𝒚 is already taken) for the 𝑚𝑚 sample points 𝒙𝒙 1 , … , 𝒙𝒙 𝑚𝑚 . Then the Dual QP (DQP) is given as Maximize � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 𝒙𝒙 𝒊𝒊T 𝒙𝒙 𝑗𝑗 Subject to 𝜶𝜶 ≥ 0 � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 . Is this really a (QP), i.e. is it in the required form? To see why, we must set up a matrix 𝐐𝐐 as before and show that the second (i.e. quadratic) part of the objective function can be written as 𝜶𝜶 T 𝐐𝐐𝜶𝜶 for this matrix 𝐐𝐐 . To see this, we set 𝑞𝑞 𝑖𝑖𝑗𝑗 ≔ 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑐𝑐 ≔ (1, … ,1) 𝑇𝑇 ∈ ℝ 𝑚𝑚 . Then the matrix 𝐐𝐐 ≔ �𝑞𝑞 𝑖𝑖𝑗𝑗 � 𝑖𝑖,𝑗𝑗=1,…,𝑚𝑚 is symmetric since 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 = 𝑦𝑦 𝑗𝑗 𝑦𝑦 𝑖𝑖 𝒙𝒙 𝑗𝑗T 𝒙𝒙 𝑖𝑖 . Thus, the problem is posed as 3 For a proof, see [15] <?page no="39"?> Support Vector Machines Made Easy 39 Maximize − 12 𝜶𝜶 T 𝐐𝐐𝜶𝜶 + 𝜶𝜶 T 𝒄𝒄 Subject to 𝜶𝜶 ≥ 0 𝒚𝒚 T 𝜶𝜶 = 0. This is the general (QP) form, including an equality constraint. But how did we arrive at this form of the dual of the given (QP)? This is not obvious. The derivation is based on so-called Lagrange-Multipliers (see Chapter 5). However, for now, we do not need to worry about how the exact form of the above (DQP) is obtained. Rather, the following lemma states all we need here. In the exercises at the end of this chapter, duals of linear programs are analyzed in more detail. Lemma 4 Let 𝜶𝜶 ∗ be an optimal solution of the (DQP). Then 𝒘𝒘 ∗ = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝒙𝒙 𝑖𝑖 and 𝑑𝑑 ∗ = − 1 2 ⁄ � max 𝑦𝑦 𝑗𝑗 =−1 (𝒘𝒘 ∗ ) T 𝒙𝒙 𝑗𝑗 + min 𝑦𝑦 𝑗𝑗 =1 (𝒘𝒘 ∗ ) T 𝒙𝒙 𝑗𝑗 � define a maximal margin separating hyperplane for the samples 𝒙𝒙 𝑖𝑖 . The margin is 𝛾𝛾 = 1 ‖𝒘𝒘 ∗ ‖ ⁄ . Remarks ▶ The vector 𝜶𝜶 ∗ contains zero and non-zero elements. Exactly the non-zero elements 𝛼𝛼 𝑖𝑖 correspond to the samples 𝒙𝒙 𝑖𝑖 which support the maximum margin separating plane ▶ The value 𝑑𝑑 ∗ can also be computed using the Karush-Kuhn-Tucker conditions 3 , which state 𝛼𝛼 𝑖𝑖∗ (𝑦𝑦 𝑖𝑖 ((𝒘𝒘 ∗ ) T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑 ∗ ) − 1) = 0, i.e. the product of the 𝛼𝛼 𝑖𝑖∗ and the constraints is always zero. ▶ Hence, if 𝛼𝛼 𝑖𝑖∗ ≠ 0 , we have −(𝒘𝒘 ∗ ) T 𝒙𝒙 𝑖𝑖 + 1 𝑦𝑦 𝑖𝑖 = 𝑑𝑑 ∗ . Recall the programming exercise from Chapter 3.2, where we had positive sample points at (0,1) T and (1,2) T and negative sample points at (1,0) T and (2,1) T . 4 For a proof, see p. 96 of [4] <?page no="40"?> 40 Fundamentals of Machine Learning  Programming Exercise Write a MATLAB Program that computes a maximum margin separating line for a small number of data points in 2D with the dual quadratic programming based on the MATLAB routine quadprog() . Input: Positive and negative sample points in 2D, which can be separated by a line. Output: A separating line with maximum margin. If you use the points from above, the result will be the same as in Chapter 3.2, shown in Figure 16. 4.3 Support Vectors Why did we use the dual of the QP, if we get the same result as with (QP)? The reason is: the dual shows the so-called support vectors. They are the sample points 𝒙𝒙 𝑖𝑖 whose solution component with the same index 𝑖𝑖 (i.e. 𝜶𝜶 𝑖𝑖 ) is non-zero. They are the points which realize the minimum margin. Even more reasons will be given in chapters 5 and 6! Figure 19 - Support vectors realizing the minimum margin. Left: all points are support vectors | Right: there are only four support vectors (marked with circles) Remark The above formulation of the dual quadratic program is also called support vector expansion. A technique for deriving the dual from the primal based on Lagrange multipliers is described in the next chapter. <?page no="41"?> Support Vector Machines Made Easy 41  Programming Exercise Write an enhanced version of the above example, again based on dual quadratic programming from MATLAB. Input: Positive and negative sample points in 2D, which can be separated by a line. Output: A separating line with maximum margin, where support vectors are marked. The result should look like the plot in Figure 20. Figure 20 - Example of support vectors realizing the minimum margin  Exercise 1 Find the dual of the linear program Maximize 𝑥𝑥 2 Subject to 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 2 𝑥𝑥 1 ≥ 1. Determine 𝑥𝑥 opt for this linear program, and 𝑦𝑦 opt for the dual. <?page no="42"?> 42 Fundamentals of Machine Learning  Exercise 2 a) Find the dual of the linear program Maximize 𝑥𝑥 2 Subject to 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 2 𝑥𝑥 1 ≥ 1 𝑥𝑥 1 ≤ 3. b) Determine 𝑥𝑥 opt for this linear program. Call the implicit constraints 𝑥𝑥 1 ≥ 0 𝑥𝑥 2 ≥ 0. Determine the so-called active carrier constraints for 𝑥𝑥 opt , i.e. the constraints satisfied with equality at the optimal solution point 𝑥𝑥 opt . c) The equilibrium theorem 5 states that precisely those constraint lines in the primal problem, which are satisfied by 𝑥𝑥 opt with non-zero slack correspond to the 𝑦𝑦 𝑖𝑖 -variables of the dual problem with 𝑦𝑦 𝑖𝑖 = 0 . With other words, if the 𝑗𝑗 -th constraint of the primal is not satisfied with equality at 𝑥𝑥 opt , then 𝑦𝑦 𝑗𝑗 = 0 . The same holds vice versa for the solution of the dual. Using this fact, determine 𝑦𝑦 opt from 𝑥𝑥 opt in the above case. Notice the resemblance between the equilibrium theorem for linear programming and the Karush-Kuhn-Tucker conditions. 5 See also [22] <?page no="43"?> 5 Lagrange Multipliers and Duality Lagrange multipliers are a tool for mathematical optimization. In our case it is used to derive the dual of a quadratic program to compute a maximum margin separating plane. Let’s look at max 𝑓𝑓(𝑥𝑥, 𝑦𝑦) subject to 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 where 𝑓𝑓 and 𝑔𝑔 are functions ℝ 2 → ℝ . For example, f and g could be the functions 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑒𝑒 −𝑥𝑥 2 −𝑦𝑦 2 and 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑥𝑥 + 2𝑦𝑦. These functions are shown in Figure 21. Figure 21 - Sample function 𝑓𝑓(𝑥𝑥, 𝑦𝑦) and the maximization space given by 𝑔𝑔(𝑥𝑥, 𝑦𝑦) We can now also visualize the level contours of 𝑓𝑓 . These are curves given by the equations of the form 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 , for various constants 𝑐𝑐 , see Figure 22. <?page no="44"?> 44 Fundamentals of Machine Learning Figure 22 - Level contours of the function 𝑓𝑓(𝑥𝑥, 𝑦𝑦) Likewise, we can visualize the constraint 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 as a curve. Thus, the two curves 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 and 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 could cross each other or meet tangentially. The two cases (crossing or touching) are illustrated in the following figure. In point 𝑎𝑎 , the two curves meet as tangents, in point 𝑏𝑏 they cross. Figure 23 - Intersection of the level curves 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 1 and 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 2 with 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 The function 𝑓𝑓(𝑥𝑥, 𝑦𝑦) can only attain an extremum under the constraint 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 if they meet tangentially. Now, consider the gradient vectors 𝛻𝛻𝑓𝑓 = �𝜕𝜕𝑓𝑓 𝜕𝜕𝑥𝑥 , 𝜕𝜕𝑓𝑓 𝜕𝜕𝑦𝑦� 𝑓𝑓 𝑥𝑥, 𝑦𝑦 = 𝑐𝑐 1 𝑔𝑔 𝑥𝑥, 𝑦𝑦 = 𝑐𝑐 𝑓𝑓 𝑥𝑥, 𝑦𝑦 = 𝑐𝑐 2 𝑎𝑎 𝑏𝑏 <?page no="45"?> Support Vector Machines Made Easy 45 and 𝛻𝛻𝑔𝑔 = �𝜕𝜕𝑓𝑓 𝜕𝜕𝑥𝑥 , 𝜕𝜕𝑓𝑓 𝜕𝜕𝑦𝑦� . They are both orthogonal to their respective level curves, i.e. if 𝑓𝑓 and 𝑔𝑔 meet tangentially at 𝑎𝑎 , there is an 𝛼𝛼 ≠ 0 such that ∇ 𝑓𝑓(𝑎𝑎) = 𝛼𝛼 ⋅ ∇ 𝑔𝑔(𝑎𝑎) . This is illustrated in Figure 24. Figure 24 - Illustration of the gradients of the level curve and the minimization space The parameter 𝛼𝛼 is called the Lagrange Multiplier. Thus, there are three equations: (I) 𝜕𝜕𝑓𝑓 𝜕𝜕𝑥𝑥� 𝑎𝑎 = 𝛼𝛼 𝜕𝜕𝑔𝑔 𝜕𝜕𝑥𝑥� 𝑎𝑎 (II) 𝜕𝜕𝑓𝑓 𝜕𝜕𝑦𝑦� 𝑎𝑎 = 𝛼𝛼 𝜕𝜕𝑔𝑔 𝜕𝜕𝑦𝑦� 𝑎𝑎 (III) 𝑔𝑔(𝑎𝑎) = 𝑐𝑐 This is the so-called Lagrange system for the above optimization problem. A more compact form is obtained in the following way. Here the whole system (with now three equations) can be stated in a single equation. Set Λ(𝑥𝑥, 𝑦𝑦, 𝛼𝛼) ≔ 𝑓𝑓(𝑥𝑥, 𝑦𝑦) − 𝛼𝛼(𝑔𝑔(𝑥𝑥, 𝑦𝑦) − 𝑐𝑐). Then all three above equations are obtained by setting derivatives of Λ to 0. Specifically, the system ∇ 𝑥𝑥,𝑦𝑦,𝛼𝛼 Λ(𝑥𝑥, 𝑦𝑦, 𝑎𝑎) ≔ 0 condenses these three equations above into a single one. Note that ∇ 𝛼𝛼 Λ(𝑥𝑥, 𝑦𝑦, 𝛼𝛼) = 𝑐𝑐 − 𝑔𝑔(𝑥𝑥, 𝑦𝑦) , setting this to zero results in the original constraint 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 . Hence, ∇ 𝛼𝛼 Λ(𝑥𝑥, 𝑦𝑦, 𝑎𝑎) = 0 𝛻𝛻𝑔𝑔 𝑎𝑎 𝛻𝛻𝑓𝑓 𝑎𝑎 𝑓𝑓 𝑥𝑥, 𝑦𝑦 = 𝑐𝑐 1 𝑔𝑔 𝑥𝑥, 𝑦𝑦 = 𝑐𝑐 𝑎𝑎 <?page no="46"?> 46 Fundamentals of Machine Learning is equivalent to 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐 . The compact version of this system, namely the function Λ , has a name. It is called the Lagrangian of the given optimization problem, or also the Lagrange function. 5.1 Multidimensional functions Above we have looked at two-dimensional functions 𝑓𝑓 and 𝑔𝑔 . For 𝑛𝑛 -dimensional functions, very similar methods apply. Specifically, the optimum of max 𝑓𝑓(𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) subject to 𝑔𝑔(𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) = 0 will be achieved at points where 𝑔𝑔(𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) = 0 and 𝜕𝜕𝑓𝑓 𝜕𝜕𝑥𝑥 𝑖𝑖 (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) − 𝛼𝛼 𝜕𝜕𝑔𝑔 𝜕𝜕𝑥𝑥 𝑖𝑖 (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) = 0 in the 𝑛𝑛 + 1 variables 𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 and 𝛼𝛼 . If there are multiple constraints, i.e. max 𝑓𝑓(𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) subject to 𝑔𝑔 1 (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) = 0 ⋮ 𝑔𝑔 𝑚𝑚 (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) = 0 we must use 𝑚𝑚 Lagrange multipliers 𝛼𝛼 1 , … , 𝛼𝛼 𝑚𝑚 . In this case, our intuitive tangent condition on the gradient vectors of 𝑓𝑓 and 𝑔𝑔 (as used above) is not as obvious. It turns out that we must have ∇𝑓𝑓 = � 𝛼𝛼 𝑘𝑘 ∇𝑔𝑔 𝑘𝑘 𝑚𝑚 𝑘𝑘=1 , i.e. that the gradient of 𝑓𝑓 is a linear combination of the gradients of the 𝑔𝑔 𝑘𝑘 . Specifically, in this case, the Lagrange function is Λ(𝒙𝒙, 𝜶𝜶) = 𝑓𝑓(𝒙𝒙) − � 𝛼𝛼 𝑘𝑘 𝑔𝑔 𝑘𝑘 (𝒙𝒙) 𝑚𝑚 𝑘𝑘=1 where 𝒙𝒙 = (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) T and 𝜶𝜶 = (𝛼𝛼 1 , … , 𝛼𝛼 𝑚𝑚 ) T . At the optimum, its derivatives must vanish, i.e. <?page no="47"?> Support Vector Machines Made Easy 47 𝜕𝜕Λ 𝜕𝜕𝑥𝑥 𝑖𝑖 = 0 𝑖𝑖 = 1, … , 𝑛𝑛 𝜕𝜕Λ 𝜕𝜕𝛼𝛼 𝑘𝑘 = 0 𝑘𝑘 = 1, … , 𝑚𝑚. Lagrange functions can also be used for the case of inequality constraints. However, this case is more difficult, since the optimum may occur at the boundary of a convex region specified by inequality constraints. The Lagrange function of max 𝑓𝑓(𝑥𝑥, 𝑦𝑦) subject to 𝑔𝑔(𝑥𝑥, 𝑦𝑦) ≥ 0 is very similar to the case of the equality constraint on g. In fact, the Lagrange function is Λ(𝑥𝑥, 𝑦𝑦, 𝛼𝛼) = 𝑓𝑓(𝑥𝑥, 𝑦𝑦) − 𝛼𝛼 ⋅ 𝑔𝑔(𝑥𝑥, 𝑦𝑦) . The fact that we are now dealing with an inequality constraint on 𝑔𝑔 is reflected in the restriction 𝛼𝛼 ≥ 0 . For further details, see [1]. 5.2 Support Vector Expansion We again look at the primal form of the (QP) for finding the maximum margin separating plane: minimize 12 𝒘𝒘 T 𝒘𝒘 subject to 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1 The Lagrange function of this problem is Λ(𝒘𝒘, 𝑑𝑑, 𝜶𝜶) = 12 𝒘𝒘 T 𝒘𝒘 − � 𝛼𝛼 𝑖𝑖 [𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) − 1] 𝑚𝑚 𝑖𝑖=1 . We differentiate with respect to 𝒘𝒘 and 𝒅𝒅 , and set the resulting expressions to zero and obtain: 𝜕𝜕Λ 𝜕𝜕𝒘𝒘 = 𝒘𝒘 − � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝒙𝒙 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 and 𝜕𝜕Λ 𝜕𝜕𝑑𝑑 = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 . We solve the first equation for 𝒘𝒘 and obtain 𝒘𝒘 = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 . <?page no="48"?> 48 Fundamentals of Machine Learning Resubstituting this and the second equation into Λ gives: Λ(𝒘𝒘, 𝑑𝑑, 𝜶𝜶) = 12 𝒘𝒘 T 𝒘𝒘 − � 𝛼𝛼 𝑖𝑖 [𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) − 1] 𝑚𝑚 𝑖𝑖=1 = 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 + � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 . This is the (DQP) optimization problem as given in the previous chapter! Maximize � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 under the constraints � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 and 𝛼𝛼 𝑖𝑖 ≥ 0 In summary, to find the (DQP) version of the maximum margin classifier, we have used the following strategy: We set up the Lagrange function of the primal (QP). We differentiate this function with respect to the primal variables 𝒘𝒘 and 𝑑𝑑 , and set to 0 . This gives equations in the variables 𝒘𝒘 and 𝑑𝑑 . We then re-substitute the resulting equations into the primal. The same strategy can be used in other (related) cases, as we shall see later. 5.3 Support Vector Expansion with Slack Variables Similarly, we would like to determine the (DQP) of the (QP) with slack variables. Luckily, we can use the exact same approach! So, recall that the (QP) was stated as minimize 12 𝒘𝒘 T 𝒘𝒘 + 𝐶𝐶 ⋅ ∑ 𝜉𝜉 𝑖𝑖 𝑚𝑚𝑖𝑖=1 subject to 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) + 𝜉𝜉 𝑖𝑖 ≥ 1 . 𝜉𝜉 𝑖𝑖 ≥ 0 First, we need to setup its Lagrangian, resulting in Λ(𝒘𝒘, 𝑑𝑑, 𝝃𝝃, 𝜶𝜶, 𝝁𝝁) = 12 𝒘𝒘 T 𝒘𝒘 − � 𝛼𝛼 𝑖𝑖 [𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) + 𝜉𝜉 𝑖𝑖 − 1] 𝑚𝑚 𝑖𝑖=1 + 𝐶𝐶 � 𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − � 𝜇𝜇 𝑖𝑖 𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 . Here, the 𝛼𝛼 𝑖𝑖 and 𝜇𝜇 𝑖𝑖 are the Lagrange multipliers. Again, positivity constraints on 𝜶𝜶 and 𝝁𝝁 apply, i.e. 𝜶𝜶 ≥ 0 and 𝝁𝝁 ≥ 0 . <?page no="49"?> Support Vector Machines Made Easy 49 Next, we have to take derivatives with respect to the primal’s parameters 𝒘𝒘 and 𝑑𝑑 ! These are then set to zero, forming 𝜕𝜕Λ 𝜕𝜕𝒘𝒘 = 𝒘𝒘 − � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = 0 𝜕𝜕Λ 𝜕𝜕𝑑𝑑 = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = 0 𝜕𝜕Λ 𝜕𝜕𝜉𝜉 𝑖𝑖 = −𝛼𝛼 𝑖𝑖 + 𝐶𝐶 − 𝜇𝜇 𝑖𝑖 = 0 . From the third equation we can deduce that 𝜇𝜇 𝑖𝑖 = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖 . This, in turn, is resubstituted into Λ , forming Λ(𝒘𝒘, 𝑑𝑑, 𝝃𝝃, 𝜶𝜶) = 12 𝒘𝒘 T 𝒘𝒘 − −� 𝛼𝛼 𝑖𝑖 [𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) + 𝜉𝜉 𝑖𝑖 − 1] 𝑚𝑚 𝑖𝑖=1 + 𝐶𝐶 � 𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − �(𝐶𝐶 − 𝛼𝛼 𝑖𝑖 )𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 . This becomes Λ(𝒘𝒘, 𝑑𝑑, 𝝃𝝃, 𝜶𝜶) = 12 𝒘𝒘 T 𝒘𝒘 − � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑑𝑑 𝑚𝑚 𝑖𝑖=1 ������� =0 − � 𝛼𝛼 𝑖𝑖 𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 + � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 + � 𝛼𝛼 𝑖𝑖 𝜉𝜉 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = 12 𝒘𝒘 T 𝒘𝒘 − � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 + � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 . Now, as in the case without slack, we substitute 𝒘𝒘 = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 . Thus, Λ becomes Λ(𝒘𝒘, 𝑑𝑑, 𝜶𝜶) = 12 �� 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝒙𝒙 𝑖𝑖T 𝑚𝑚 𝑖𝑖=1 � �� 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑗𝑗=1 � − � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 �� 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑗𝑗T 𝑚𝑚 𝑗𝑗=1 � 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 + � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 . The result is maximize <?page no="50"?> 50 Fundamentals of Machine Learning � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 under the constraints � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 and 0 ≤ 𝛼𝛼 𝑖𝑖 ≤ 𝐶𝐶 . This is the same function as in the (DQP) without slack! Only difference: we have an upper bound on the 𝛼𝛼 𝑖𝑖 , from 𝜇𝜇 𝑖𝑖 = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖 ≥ 0 , i.e. 𝛼𝛼 𝑖𝑖 ≤ 𝐶𝐶 . Finally, we can now use the (DQP) to find a soft or hard margin separating hyperplane! One question remains, however: How do we know which 𝒙𝒙 𝑖𝑖 violate the margin or are falsely classified? Luckily, this is easy to answer: look at corresponding 𝛼𝛼 𝑖𝑖 (and 𝜉𝜉 )! ▶ 𝛼𝛼 𝑖𝑖 = 0 correctly classified, outside margin ▶ 0 < 𝛼𝛼 𝑖𝑖 < 𝐶𝐶 correctly classified, support vector ▶ 𝛼𝛼 𝑖𝑖 = 𝐶𝐶 margin violation/ classification error o 0 < 𝜉𝜉 𝑖𝑖 ≤ 1 margin violation o 𝜉𝜉 > 1 classification error Let’s go back to our example from Figure 20. It showed an example of a maximum margin classifier with three support vectors (repeated in Figure 25, top). If we now change two of the ∗ samples to + (shown in Figure 25, middle), regular classification will fail. Once we allow margin violations/ misclassifications, we will end up with a reasonable separation when using an adequate value for 𝐶𝐶 (see Figure 25, bottom). <?page no="51"?> Support Vector Machines Made Easy 51  Here you can find a video: https: / / bit.ly/ 2W5NYtk [Zusatzmaterial] Figure 25 - Dual Quadratic Programming with misclassifications and Support Vectors. Top: original (QP), support vectors marked with diamonds. Middle: added outliers, marked with squares. Bottom: (DQP) classification with slack, showing support vectors (diamonds), margin violations (circles) and misclassifications (squares). <?page no="52"?> 52 Fundamentals of Machine Learning  Exercise Consider the optimization problem min 𝑓𝑓(𝑥𝑥, 𝑦𝑦) = 𝑥𝑥 2 + 𝑦𝑦 2 subject to 𝑔𝑔(𝑥𝑥, 𝑦𝑦) = 𝑥𝑥 + 𝑦𝑦 = 2 a) Find the Lagrange function for this system b) Find the optimum with the help of the Lagrange function <?page no="53"?> 6 Kernel Functions 6.1 Feature Spaces Consider the mapping 𝐹𝐹 which maps points 𝒙𝒙 = (𝑥𝑥 1 , 𝑥𝑥 2 ) T in two dimensions to 𝐹𝐹: �𝑥𝑥 1 𝑥𝑥 2 � → � 𝑥𝑥 12 𝑥𝑥 22 √2 𝑥𝑥 1 𝑥𝑥 2 � Hence, 𝐹𝐹 maps from the input space (2D) to three-dimensional space. Here 3-space is called the feature space of the mapping. The reason why we would map samples from 2D space into 3D space is the following. In higher dimensions, it may be easier to separate the sample points by a plane. This is illustrated by the following example. Three points in 2D (two positive / one negative) can always be separated by a line. However, for four such points (two positive / two negative) this may not be the case. Try to find such a configuration! But in 3D, four points (two positive, two negative, not all on one plane) can always be separated by a plane. This generalizes to higher dimensions. In 𝐷𝐷 -space, 𝐷𝐷 + 1 points can always be separated. Thus, there is heuristic motivation to map the input sample points to a space of higher dimension, in order to simplify the separation. An example below will further illustrate this observation.  Example Take a large number of random points in [−1.5,1.5] 2 . Now, all points 𝒑𝒑 such that 𝑥𝑥 2 + 𝑦𝑦 2 < 1 are marked + , all others ∘ (see Figure 26). We can clearly see that it is not possible to linearly separate these points. Figure 26 - Points in 2D space which cannot be linearly separated <?page no="54"?> 54 Fundamentals of Machine Learning Now, we can apply the above mapping 𝐹𝐹 to all points shown in the previous figure. This mapping will transform a point from 2D to 3D space - and now the two classes of points can be separated by a plane, see Figure 27! Figure 27 - Feature-mapped points in 3D space (left), with separating hyperplane (right) This is an important finding: we can make points linearly separable by mapping into a higher dimensional feature space! Note It is important to realize the special mathematical relationship outlined below: 〈𝐹𝐹(𝑥𝑥), 𝐹𝐹(𝑦𝑦)〉 = 𝐹𝐹(𝑥𝑥) T 𝐹𝐹(𝑦𝑦) = �𝑥𝑥 12 , 𝑥𝑥 22 , √2𝑥𝑥 1 𝑥𝑥 2 � � 𝑦𝑦 12 𝑦𝑦 22 √2𝑦𝑦 1 𝑦𝑦 2 � = = 𝑥𝑥 12 𝑦𝑦 12 + 𝑥𝑥 22 𝑦𝑦 22 + 2𝑥𝑥 1 𝑦𝑦 1 𝑥𝑥 2 𝑦𝑦 2 = = (𝑥𝑥 1 𝑦𝑦 1 + 𝑥𝑥 2 𝑦𝑦 2 ) 2 = (𝑥𝑥 T 𝑦𝑦) 2 = 〈𝑥𝑥, 𝑦𝑦〉 2 This will become very important later in the chapter. 6.2 Feature Spaces and Quadratic Programming To find the separating plane for the example in the previous section, we could solve the (DQP) using 𝐹𝐹 , i.e.: Maximize with respect to 𝛼𝛼 𝑖𝑖 � 𝛼𝛼 𝑖𝑖 − 12 𝑚𝑚 𝑖𝑖=1 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) T 𝐹𝐹�𝒙𝒙 𝑗𝑗 � 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 <?page no="55"?> Support Vector Machines Made Easy 55 subject to 𝛼𝛼 𝑖𝑖 ≥ 0 � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = 0. We call this version I of our (DQP). But, using 𝐹𝐹(𝒙𝒙) T 𝐹𝐹(𝒚𝒚) = (𝒙𝒙 T 𝒚𝒚) 2 as we found before, we can write our (DQP) in form II: Maximize with respect to 𝛼𝛼 𝑖𝑖 � 𝛼𝛼 𝑖𝑖 − 12 𝑚𝑚 𝑖𝑖=1 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 � 2 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 subject to 𝛼𝛼 𝑖𝑖 ≥ 0 � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = 0. Now, why should we rather solve version II of the (DQP) instead of version I above? The reason is: We have 𝐹𝐹(𝒙𝒙) T 𝐹𝐹(𝒚𝒚) = (𝒙𝒙 T 𝒚𝒚) 2 Thus, the two programs are the same, and give the same solution. But: it can be hard to compute 𝐹𝐹(𝒙𝒙) and 𝐹𝐹(𝒚𝒚) explicitly, while it is often much easier to compute (𝒙𝒙 T 𝒚𝒚) 2 . Specifically, the above function 𝐹𝐹 is only one example for such a function. For other functions, 𝐹𝐹(𝒙𝒙) is much harder to compute. In general, we can see that 𝐹𝐹(𝒙𝒙) ≔ (𝑥𝑥 1 𝑥𝑥 2 , … , 𝑥𝑥 1 𝑥𝑥 𝑛𝑛 , 𝑥𝑥 2 𝑥𝑥 1 , … , 𝑥𝑥 2 𝑥𝑥 𝑛𝑛 , … 𝑥𝑥 𝑛𝑛 𝑥𝑥 𝑛𝑛 ) T gives rise to a new function, called a kernel: 𝑘𝑘(𝒙𝒙, 𝒚𝒚) = (𝒙𝒙 T 𝒚𝒚) 2 So, why is this? Let’s look at the maths! (𝒙𝒙 T 𝒚𝒚) 2 = �� 𝑥𝑥 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑛𝑛 𝑖𝑖 =1 � 2 = � 𝑥𝑥 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑛𝑛 𝑖𝑖 =1 � 𝑥𝑥 𝑗𝑗 𝑦𝑦 𝑗𝑗 𝑛𝑛 𝑗𝑗 =1 = ��𝑥𝑥 𝑖𝑖 𝑥𝑥 𝑗𝑗 ��𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 � 𝑛𝑛 𝑖𝑖 , 𝑗𝑗 =1 = = 〈�𝑥𝑥 𝑖𝑖 𝑥𝑥 𝑗𝑗 � 𝑖𝑖,𝑗𝑗=1,…,𝑛𝑛 , �𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 � 𝑖𝑖,𝑗𝑗=1,…,𝑛𝑛 〉 So, the result is the same - but evaluating the kernel is easier: it only takes 𝒪𝒪(𝑛𝑛) time whereas evaluating 𝐹𝐹 takes 𝒪𝒪(𝑛𝑛 2 ) time. Now, consider the related kernel 𝑘𝑘 𝑐𝑐 (𝒙𝒙, 𝒚𝒚) ≔ (𝒙𝒙 T 𝒚𝒚 + 𝑐𝑐) 2 , 𝑐𝑐 ∈ ℝ <?page no="56"?> 56 Fundamentals of Machine Learning Here, 𝑘𝑘 𝑐𝑐 (𝒙𝒙, 𝒚𝒚) = (𝒙𝒙 T 𝒚𝒚) 2 + 2𝒙𝒙 T 𝒚𝒚𝑐𝑐 + 𝑐𝑐 2 = � �𝑥𝑥 𝑖𝑖 𝑥𝑥 𝑗𝑗 ��𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 � 𝑛𝑛 𝑖𝑖,𝑗𝑗=1 + ��√2𝑐𝑐𝑥𝑥 𝑖𝑖 ��√2𝑐𝑐𝑦𝑦 𝑖𝑖 � 𝑛𝑛 𝑖𝑖=1 + 𝑐𝑐 2 . Now, what is the corresponding function 𝐹𝐹 ? It turns out to be quite complicated! In fact, it will be 𝐹𝐹 𝑐𝑐 (𝒙𝒙) = �𝑥𝑥 1 𝑥𝑥 2 , … , 𝑥𝑥 1 𝑥𝑥 𝑛𝑛 , 𝑥𝑥 2 𝑥𝑥 1 , … , 𝑥𝑥 2 𝑥𝑥 𝑛𝑛 , √2𝑐𝑐𝑥𝑥 1 , … , √2𝑐𝑐𝑥𝑥 𝑛𝑛 , 𝑐𝑐� T i.e. all quadratic and linear terms of the 𝑥𝑥 𝑖𝑖 , weighted by 𝑐𝑐 . Generalizing, we find the kernel 𝑘𝑘 𝑐𝑐,𝑑𝑑 (𝒙𝒙, 𝒚𝒚) ≔ (𝒙𝒙 T 𝒚𝒚 + 𝑐𝑐) 𝑑𝑑 , 𝑐𝑐 ∈ ℝ, 𝑑𝑑 ∈ ℕ corresponds to a mapping 𝐹𝐹 𝑐𝑐,𝑑𝑑 to 𝑛𝑛 𝑑𝑑 -dimensional space where 𝐹𝐹 𝑐𝑐,𝑑𝑑 (𝒙𝒙) consists of all monomials of type 𝑥𝑥 𝑖𝑖 1 𝑥𝑥 𝑖𝑖 2 ⋯ 𝑥𝑥 𝑖𝑖 𝑘𝑘 of order up to 𝑑𝑑 . In consequence, the computing time for 𝑘𝑘 𝑐𝑐,𝑑𝑑 is still 𝒪𝒪(𝑛𝑛) , but for 𝐹𝐹 𝑐𝑐,𝑑𝑑 it is 𝒪𝒪(𝑛𝑛 𝑑𝑑 ) ! So, if we can find a kernel function 𝑘𝑘 for a mapping 𝐹𝐹 , it may very well be that we can save a lot of computing time! Remark Let’s go back to our (DQP) problem. We saw that if we have a solution 𝜶𝜶 ∗ , we can find 𝒘𝒘 ∗ = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 (7) and then 𝑑𝑑 = 1 𝑦𝑦 𝑗𝑗 + 𝒘𝒘 ∗T 𝒙𝒙 𝑗𝑗 for some 𝛼𝛼 𝑗𝑗∗ ≠ 0. Note that we can now insert the expression for 𝒘𝒘 ∗ into our classification formula, 𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑 , for unknown samples 𝒙𝒙 . This will result in 𝑓𝑓(𝒙𝒙) = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝒙𝒙 𝑖𝑖 𝒙𝒙 + 𝑑𝑑 and, under 𝐹𝐹 as before, we will have 𝑓𝑓(𝒙𝒙) = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) T 𝐹𝐹(𝒙𝒙) + 𝑑𝑑 = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 �𝒙𝒙 𝑖𝑖T 𝒙𝒙� 2 + 𝑑𝑑. <?page no="57"?> Support Vector Machines Made Easy 57 So, finally, we will have 𝑑𝑑 = 1 𝑦𝑦 𝑗𝑗 + � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖=1 and under 𝐹𝐹 , this becomes 𝑑𝑑 = 1 𝑦𝑦 𝑗𝑗 + � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) T 𝐹𝐹�𝒙𝒙 𝑗𝑗 � 𝑚𝑚 𝑖𝑖=1 for some 𝑗𝑗 such that 𝛼𝛼 𝑗𝑗∗ ≠ 0 . We now see that, if we find a kernel 𝑘𝑘(𝒙𝒙, 𝒚𝒚) = 𝐹𝐹(𝒙𝒙) T 𝐹𝐹(𝒚𝒚) , we can classify an unknown sample 𝒙𝒙 without ever having to compute 𝐹𝐹(𝒙𝒙, 𝒚𝒚) ! This turns out to be a significant advantage of the dual version. As we said, we do not have to evaluate F at any stage. In the following section, we will now show that we do not even have to know 𝐹𝐹 explicitly. This is rather remarkable mathematically, and it provides further advantages. Formal Definition We will now formalize the above observations. Consider the set Φ of maps 𝐹𝐹 of the form 𝐹𝐹: 𝑋𝑋 ↦ 𝑈𝑈 where 𝑋𝑋 is the input space, and 𝑈𝑈 is the feature space, and where 𝑈𝑈 is an inner product space. Then the function 𝑘𝑘: 𝑋𝑋 × 𝑋𝑋 ↦ ℝ is called a kernel if 𝑘𝑘 has a representation 𝑘𝑘(𝒙𝒙, 𝒛𝒛) = 〈𝐹𝐹(𝒙𝒙), 𝐹𝐹(𝒛𝒛)〉 (8) for some 𝐹𝐹 from Φ . Here, 〈𝐹𝐹(𝒙𝒙), 𝐹𝐹(𝒛𝒛)〉 is the inner product of 𝐹𝐹(𝐱𝐱) and 𝐹𝐹(𝐳𝐳) in 𝑈𝑈 . Hence, according to this definition, each kernel 𝑘𝑘 must stem from a function 𝐹𝐹 , which maps from the input space to feature space. The relation between 𝑘𝑘 and 𝐹𝐹 is then given by the expression in ( 8). In the above example, we have 𝑘𝑘(𝐱𝐱, 𝐳𝐳) = 〈𝒙𝒙, 𝒛𝒛〉 2 . And this function 𝐾𝐾 is a kernel function, since it stems from 𝐹𝐹: �𝑥𝑥 1 𝑥𝑥 2 � → � 𝑥𝑥 12 𝑥𝑥 22 √2 𝑥𝑥 1 𝑥𝑥 2 � i.e. it satisfies ( 8 ) for this specific function 𝐹𝐹 . <?page no="58"?> 58 Fundamentals of Machine Learning 6.3 Kernel Matrix and Mercer’s Theorem But how do we know that some 𝑘𝑘: 𝑋𝑋 × 𝑋𝑋 ↦ ℝ is a kernel? Do we need to find 𝐹𝐹 ? No, it is not necessary! It is interesting to note that we can work with a kernel 𝑘𝑘 without even knowing what 𝐹𝐹 looks like. We will now illustrate this fact. Indeed, we will see that it suffices to know that 𝑘𝑘 is a kernel function, without knowing its corresponding 𝐹𝐹 -function. That will simplify the computations considerably. Consider some kernel 𝑘𝑘 and its corresponding feature map 𝐹𝐹 . Now, let 𝑚𝑚 > 0 and let 𝒙𝒙 1 , … , 𝒙𝒙 𝑚𝑚 be a set of points from 𝑋𝑋 . Now, even if we don’t know 𝐹𝐹 , if we assume that 𝑈𝑈 has a finite dimension (let it be 𝑙𝑙 ), there are functions 𝑓𝑓 1 , … , 𝑓𝑓 𝑙𝑙 such that 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) = �𝑓𝑓 1 (𝒙𝒙 𝑖𝑖 ), … , 𝑓𝑓 𝑙𝑙 (𝒙𝒙 𝑖𝑖 )� T . Now, we define the kernel matrix (or Gram matrix) of 𝑘𝑘 as 𝐾𝐾 ≔ �𝐾𝐾 𝑖𝑖𝑗𝑗 � 𝑖𝑖,𝑗𝑗=1,…𝑚𝑚 where 𝐾𝐾 𝑖𝑖𝑗𝑗 = 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � on some sample set 𝒙𝒙 1 , … , 𝒙𝒙 𝑚𝑚 . Now, we will show that 𝐾𝐾 is always symmetric and positive semi-definite. For the first part (symmetry) simply note that 𝑘𝑘(𝐱𝐱, 𝐳𝐳) = 〈𝐹𝐹(𝒙𝒙), 𝐹𝐹(𝒛𝒛)〉 = 〈𝐹𝐹(𝒛𝒛), 𝐹𝐹(𝒙𝒙)〉 = 𝑘𝑘(𝒛𝒛, 𝒙𝒙) . This is enough to see that 𝐾𝐾 𝑖𝑖𝑗𝑗 = 𝐾𝐾 𝑗𝑗𝑖𝑖 , whence 𝐾𝐾 is symmetric. Now, for the second part. Let 𝒛𝒛 ∈ 𝑋𝑋 be any vector. We now have to show that 𝒛𝒛 T 𝐾𝐾𝒛𝒛 ≥ 0 . To see this, simply compute: 𝒛𝒛 T 𝐾𝐾𝒛𝒛 = �� 𝑧𝑧 𝑖𝑖 𝑘𝑘 𝑖𝑖𝑗𝑗 𝑧𝑧 𝑗𝑗 𝑗𝑗 𝒊𝒊 = �� 𝑧𝑧 𝑖𝑖 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) T 𝐹𝐹�𝒙𝒙 𝑗𝑗 �𝑧𝑧 𝑗𝑗 𝑗𝑗 𝑖𝑖 = �� 𝑧𝑧 𝑖𝑖 � 𝑓𝑓 𝑘𝑘 (𝒙𝒙 𝑖𝑖 )𝑓𝑓 𝑘𝑘 �𝒙𝒙 𝑗𝑗 �𝑧𝑧 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑖𝑖 = = �� 𝑧𝑧 𝑖𝑖 � 𝑓𝑓 𝑘𝑘 (𝒙𝒙 𝑖𝑖 )𝑓𝑓 𝑘𝑘 �𝒙𝒙 𝑗𝑗 �𝑧𝑧 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑖𝑖 = ��� 𝑧𝑧 𝑖𝑖 𝑓𝑓 𝑘𝑘 (𝒙𝒙 𝑖𝑖 )𝑓𝑓 𝑘𝑘 �𝒙𝒙 𝑗𝑗 �𝑧𝑧 𝑗𝑗 𝑗𝑗 𝑖𝑖 𝑘𝑘 = = � �� 𝑧𝑧 𝑖𝑖 𝑓𝑓 𝑘𝑘 (𝒙𝒙 𝑖𝑖 ) 𝑖𝑖 � 2 𝑘𝑘 ≥ 0 Since 𝑚𝑚 and 𝒛𝒛 were arbitrary, this shows that the kernel matrix 𝐾𝐾 is positive semidefinite for all 𝑚𝑚 . It turns out that this is not only a necessary but sufficient condition on 𝐾𝐾 ! Theorem (Mercer, simplified) Let 𝑘𝑘: ℝ 𝑛𝑛 × ℝ 𝑛𝑛 ↦ ℝ be given. Then for 𝑘𝑘 to be a valid (Mercer) kernel, it is necessary and sufficient that for any {𝒙𝒙 1 , … , 𝒙𝒙 𝑚𝑚 } , ( 𝑚𝑚 < ∞ , 𝒙𝒙 𝑖𝑖 ∈ ℝ 𝑛𝑛 ), the corresponding kernel matrix 𝐾𝐾 is symmetric positive semi-definite. <?page no="59"?> Support Vector Machines Made Easy 59 This is a typical existence proof - it does not give us 𝐹𝐹 , it merely tells us that there is one! Note, though: we don’t need to know 𝐹𝐹 , we saw that knowing 𝑘𝑘 is sufficient since we don’t need 𝐹𝐹 to find 𝒘𝒘 and/ or 𝑑𝑑 and for classifying new points 𝒛𝒛 . It turns out that we can also use this kernel trick for any machine learning algorithm that can be written in terms of inner products, i.e. we can also derive a “kernel perceptron” algorithm! It turns out that the function 𝑘𝑘(𝒙𝒙, 𝐲𝐲) ≔ exp �− ‖𝒙𝒙 − 𝒚𝒚‖ 2 2𝜎𝜎 2 � also is a valid kernel (called Gaussian Kernel) with a corresponding feature map 𝐹𝐹 that maps to a space of infinite dimension. For details, see [2]. 6.4 Proof of Mercer’s Theorem To illustrate the complexity of Mercer’s theorem, we will now briefly sketch a possible proof of the sufficiency of the condition on 𝐾𝐾 . The proof will follow [3] and [4]. Basically, there are three steps we will have to deal with: Steps of the proof ➊ Definitions and prerequisites ➋ Designing the right Hilbert space ➌ The Reproducing Property For step 1, we need to clearly define the mathematical concepts we’re using. Step 1 | Definitions and Prerequisites Inner Product An inner product is a function that takes two elements from a vector space and maps them to the underlying field. It could be the usual dot product, i.e. ⟨𝒖𝒖, 𝒗𝒗⟩ = � 𝑢𝑢 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑛𝑛 𝑖𝑖=1 for 𝑢𝑢, 𝑣𝑣 ∈ ℝ 𝑛𝑛 . It can also be something much fancier, but it must satisfy three conditions. First, let 𝒖𝒖, 𝒗𝒗, 𝒘𝒘 ∈ 𝑋𝑋 and 𝛼𝛼, 𝛽𝛽 ∈ ℝ . Now, we can state the three conditions: <?page no="60"?> 60 Fundamentals of Machine Learning ▶ symmetry, i.e. ⟨𝒖𝒖, 𝒗𝒗⟩ = ⟨𝒗𝒗, 𝒖𝒖 ⟩ ▶ bilinearity, i.e. ⟨𝛼𝛼𝒖𝒖 + 𝛽𝛽𝒗𝒗, 𝒘𝒘⟩ = 𝛼𝛼⟨𝒖𝒖, 𝒘𝒘 ⟩ + 𝛽𝛽⟨𝒗𝒗, 𝒘𝒘 ⟩ ▶ strict positive definiteness, i.e. ⟨𝒖𝒖, 𝒖𝒖⟩ ≥ 0 and ⟨𝒖𝒖, 𝒖𝒖⟩ = 0 ⇔ 𝒖𝒖 = 0 Based on this definition, we can now define some very special type of vector spaces: Inner Product Space (or pre-Hilbert Space) An inner product space (or pre-Hilbert Space) is nothing exceedingly fancy - it is just a vector space with an inner product (like our standard ℝ 𝑛𝑛 with the normal dot product). Hilbert Space A Hilbert Space is something more complex, it is a complete inner product space, i.e. convergent sequences in ℋ converge to an element in ℋ . Examples include ▶ ℝ 𝑛𝑛 with the regular vector dot product ▶ The space ℓ 2 of square-summable sequences over ℝ (i.e. 𝒖𝒖 ∈ ℓ 2 ⇒ ∑ 𝑢𝑢 𝑖𝑖2 ∞𝑖𝑖=1 ∈ ℝ ) with ⟨𝒖𝒖, 𝒗𝒗⟩ = ∑ 𝑢𝑢 𝑖𝑖 𝑣𝑣 𝑖𝑖 ∞𝑖𝑖=1 ▶ ℚ is not a Hilbert space! Also, the square-integrable functions on 𝑋𝑋 , i.e. 𝐿𝐿 2 (𝑋𝑋, 𝜇𝜇) for some measure 𝜇𝜇 are a Hilbert space. A square-integrable function is a function 𝑓𝑓 such that � 𝑓𝑓(𝒙𝒙) 2 𝑑𝑑𝜇𝜇(𝒙𝒙) 𝑋𝑋 < ∞ . Now, on this space 𝐿𝐿 2 (𝑋𝑋, 𝜇𝜇) the inner product is defined as ⟨𝑓𝑓, 𝑔𝑔⟩ 𝐿𝐿 2 (𝑋𝑋,𝜇𝜇) = � 𝑓𝑓(𝒙𝒙)𝑔𝑔(𝒙𝒙) 𝑋𝑋 𝑑𝑑𝜇𝜇(𝒙𝒙). Let us now consider a very simple finite input space 𝑋𝑋 = {𝒙𝒙 1 , … 𝒙𝒙 𝑚𝑚 } and let 𝑘𝑘 be a kernel function on 𝑋𝑋 × 𝑋𝑋 . We already know that 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � = 𝑘𝑘�𝒙𝒙 𝑗𝑗 , 𝒙𝒙 𝑖𝑖 � 𝑘𝑘(𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑖𝑖 ) ≥ 0 ∀ 𝒙𝒙 𝑖𝑖 Now, we can build its full Gram matrix 𝐾𝐾 using all 𝒙𝒙 𝑖𝑖 . From before, we know that 𝐾𝐾 is positive semi-definite, i.e. we can decompose 𝐾𝐾 into 𝐾𝐾 = 𝑉𝑉Λ𝑉𝑉 T where 𝑉𝑉 is an orthogonal matrix of the eigenvectors 𝒗𝒗 𝑡𝑡 of 𝐾𝐾 and Λ is a diagonal matrix with the corresponding eigenvalues 𝜆𝜆 𝑡𝑡 . Now, let’s assume that the 𝜆𝜆 𝑡𝑡 ≥ 0 for all 𝑡𝑡 . Using this, we can define a function 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) ≔ ��𝜆𝜆 1 (𝒗𝒗 1 ) 𝑖𝑖 , … , �𝜆𝜆 𝑡𝑡 (𝒗𝒗 𝑡𝑡 ) 𝑖𝑖 , … , �𝜆𝜆 𝑚𝑚 (𝒗𝒗 𝑚𝑚 ) 𝑖𝑖 � T . Then, 𝑘𝑘 is a dot product in 𝑋𝑋 : <?page no="61"?> Support Vector Machines Made Easy 61 �𝐹𝐹(𝒙𝒙 𝑖𝑖 ) , 𝐹𝐹�𝒙𝒙 𝑗𝑗 �� 𝑋𝑋 = � 𝜆𝜆 𝑡𝑡 (𝒗𝒗 𝑡𝑡 ) 𝑖𝑖 (𝒗𝒗 𝑡𝑡 ) 𝑗𝑗 𝑚𝑚 𝑡𝑡 =1 = (𝑉𝑉 Λ 𝑉𝑉 T ) 𝑖𝑖𝑗𝑗 = 𝐾𝐾 𝑖𝑖𝑗𝑗 = 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � So far, this is quite obvious. What is not clear, however, is why we required the 𝜆𝜆 𝑡𝑡 ≥ 0 ? Well, let’s assume that there is one 𝜆𝜆 𝑠𝑠 < 0 . We can thus define 𝒛𝒛 = �(𝒗𝒗 𝑠𝑠 ) 𝑖𝑖 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 . Clearly, this 𝒛𝒛 ∈ ℝ 𝑚𝑚 . So, we can compute its norm, which has to be non-negative! ‖𝒛𝒛‖ 22 = ⟨𝒛𝒛, 𝒛𝒛⟩ ℝ 𝑚𝑚 = ��(𝒗𝒗 𝑠𝑠 ) 𝑖𝑖 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) T 𝐹𝐹�𝒙𝒙 𝑗𝑗 �(𝒗𝒗 𝑠𝑠 ) 𝑗𝑗 𝑚𝑚 𝑗𝑗=1 𝑚𝑚 𝑖𝑖=1 = = ��(𝒗𝒗 𝑠𝑠 ) 𝑖𝑖 𝑀𝑀(𝐾𝐾) 𝑖𝑖𝑗𝑗 (𝒗𝒗 𝑠𝑠 ) 𝑗𝑗 𝑚𝑚 𝑗𝑗=1 𝑚𝑚 𝑖𝑖=1 = 𝒗𝒗 𝑠𝑠T 𝑀𝑀(𝐾𝐾)𝒗𝒗 𝑠𝑠 = = 𝜆𝜆 𝑠𝑠 < 0 So, this is a contradiction, showing that all the eigenvalues of 𝐾𝐾 have to be nonnegative. This again shows that - for any function trying to be a kernel - the positive semi-definiteness is necessary and that each such kernel will give rise to a dot product. Even more, we will see that our simple definition of a kernel automatically implies two things: ▶ 𝑘𝑘(𝒖𝒖, 𝒖𝒖) ≥ 0 for all 𝒖𝒖 ▶ 𝑘𝑘(𝒖𝒖, 𝒗𝒗) ≤ �𝑘𝑘(𝒖𝒖, 𝒖𝒖) ⋅ 𝑘𝑘(𝒗𝒗, 𝒗𝒗) (this is the Cauchy-Schwarz inequality) The first is easy to see, since the Gram matrix of 𝑚𝑚 = 1 would be 𝑘𝑘(𝒖𝒖, 𝒖𝒖) and must be ≥ 0 . The second is somewhat harder: Consider the 2D Gram matrix on 𝒖𝒖 and 𝒗𝒗 , i.e. 𝐾𝐾 = �𝑘𝑘(𝒖𝒖, 𝒖𝒖) 𝑘𝑘(𝒖𝒖, 𝒗𝒗) 𝑘𝑘(𝒗𝒗, 𝒖𝒖) 𝑘𝑘(𝒗𝒗, 𝒗𝒗)� Take its determinant, which has to be positive: 0 ≤ 𝑘𝑘(𝒖𝒖, 𝒖𝒖)𝑘𝑘(𝒗𝒗, 𝒗𝒗) − 𝑘𝑘(𝒗𝒗, 𝒖𝒖)𝑘𝑘(𝒖𝒖, 𝒗𝒗) = 𝑘𝑘(𝒖𝒖, 𝒖𝒖)𝑘𝑘(𝒗𝒗, 𝒗𝒗) − 𝑘𝑘(𝒖𝒖, 𝒗𝒗) 2 And this is equivalent to the Cauchy-Schwarz inequality above! Step 2 | Designing the right Hilbert Space As the second step of our proof, we have to design a special Hilbert Space - more specifically, such that the kernel function is an inner product of this space. To do so, first let <?page no="62"?> 62 Fundamentals of Machine Learning ℝ 𝑋𝑋 ≔ {𝑓𝑓: 𝑋𝑋 → ℝ} be the set of functions mapping from 𝑋𝑋 to ℝ . Next, we specify a function 𝐹𝐹 mapping from 𝑋𝑋 to this new space: 𝐹𝐹: � 𝑋𝑋 → ℝ 𝑋𝑋 𝒙𝒙 ↦ 𝑘𝑘(⋅, 𝒙𝒙) So, F(⋅) is a function that maps points from X onto functions! This concept is illustrated in Figure 28. Figure 28 - Function mapping X onto ℝ X Our goal - as stated before - now is to show that 𝑘𝑘 is an inner product in the feature space defined by 𝐹𝐹 . We thus need a) to turn the image of 𝐹𝐹 into a vector space b) define an inner product ⟨⋅,⋅⟩ 𝐻𝐻 𝑘𝑘 c) show that 𝑘𝑘(𝒙𝒙, 𝒛𝒛) = ⟨𝐹𝐹(𝒙𝒙), 𝐹𝐹(𝒛𝒛)⟩ 𝐻𝐻 𝑘𝑘 For step a), we see that any element 𝑓𝑓 in the vector space will be in the span of the 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) , i.e. it will look like 𝑓𝑓(⋅) = � 𝛼𝛼 𝑖𝑖 𝐹𝐹(𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 = � 𝛼𝛼 𝑖𝑖 𝑘𝑘(⋅, 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 for some 𝑚𝑚 , 𝛼𝛼 𝑖𝑖 and 𝒙𝒙 1 , … , 𝒙𝒙 𝑚𝑚 ∈ 𝑋𝑋 . Note that 𝑓𝑓 is an element of the vector space, i.e. it is a vector! Addition and scalar multiplication are clear, so we obviously have a vector space. It looks like this: span(𝐹𝐹(𝒙𝒙) : 𝒙𝒙 ∈ 𝑋𝑋) = �𝑓𝑓(⋅) = � 𝛼𝛼 𝑖𝑖 𝑘𝑘(⋅, 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 : 𝑚𝑚 ∈ ℕ, 𝒙𝒙 𝑖𝑖 ∈ 𝑋𝑋, 𝛼𝛼 𝑖𝑖 ∈ ℝ� For the second step, b), we need to take two elements from this vector space, let them be 𝑓𝑓(⋅) = � 𝛼𝛼 𝑖𝑖 𝑘𝑘(⋅, 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 and 𝑔𝑔(⋅) = � 𝛽𝛽 𝑗𝑗 𝑘𝑘�⋅, 𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 . Now, we define a function 𝑥𝑥 2 𝑥𝑥 1 𝐹𝐹 Φ 𝑥𝑥 1 Φ 𝑥𝑥 2 <?page no="63"?> Support Vector Machines Made Easy 63 〈𝑓𝑓, 𝑔𝑔〉 𝐻𝐻 𝑘𝑘 ≔ � 𝛽𝛽 𝑗𝑗 𝑓𝑓�𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 = �� 𝛼𝛼 𝑖𝑖 𝛽𝛽 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 𝑚𝑚 𝑖𝑖=1 mapping from the vector space to ℝ and claim that it is an inner product. We thus have to show ▶ symmetricity, i.e. ⟨𝑓𝑓, 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 = ⟨𝑔𝑔, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 ▶ bilinearity, i.e. ⟨𝑓𝑓 1 + 𝑓𝑓 2 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 = ⟨𝑓𝑓 1 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 + ⟨𝑓𝑓 2 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 ▶ positive definiteness, i.e. ⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 ≥ 0 The first step, symmetricity, is straightforward since 𝑘𝑘 is symmetric: ⟨𝑔𝑔, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 = �� 𝛽𝛽 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝑘𝑘�𝒙𝒙 𝑗𝑗′ , 𝒙𝒙 𝑖𝑖 � 𝑚𝑚 𝑖𝑖=1 𝑚𝑚 ′ 𝑗𝑗=1 = �� 𝛼𝛼 𝑖𝑖 𝛽𝛽 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 𝑚𝑚 𝑖𝑖=1 = ⟨𝑓𝑓, 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 Similarly, bilinearity is also easy to show. Since ⟨𝑓𝑓, 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 = �� 𝛼𝛼 𝑖𝑖 𝛽𝛽 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 𝑚𝑚 𝑖𝑖=1 = � 𝛽𝛽 𝑗𝑗 � 𝛼𝛼 𝑖𝑖 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 𝑖𝑖=1 𝑚𝑚 ′ 𝑗𝑗=1 = � 𝛽𝛽 𝑗𝑗 𝑓𝑓�𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 , we see that ⟨𝑓𝑓 1 + 𝑓𝑓 2 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 = � 𝛽𝛽 𝑗𝑗 �𝑓𝑓 1 �𝒙𝒙 𝑗𝑗′ � + 𝑓𝑓 2 �𝒙𝒙 𝑗𝑗′ �� 𝑚𝑚 ′ 𝑗𝑗=1 = � 𝛽𝛽 𝑗𝑗 𝑓𝑓 1 �𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 + � 𝛽𝛽 𝑗𝑗 𝑓𝑓 2 �𝒙𝒙 𝑗𝑗′ � 𝑚𝑚 ′ 𝑗𝑗=1 = ⟨𝑓𝑓 1 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 + ⟨𝑓𝑓 2 , 𝑔𝑔⟩ 𝐻𝐻 𝑘𝑘 , showing the bilinearity. Finally, we see that the function is positive definite, since ⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 = � 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 = 𝜶𝜶 T 𝐾𝐾𝜶𝜶 ≥ 0 . So, 𝑘𝑘 (almost) gives rise to an inner product! One thing is missing, namely that ⟨⋅,⋅⟩ 𝐻𝐻 𝑘𝑘 has to be strictly positive definite. This will be treated shortly! Step 3 | The reproducing property In the third step, we will realise that our new inner product has an intriguing property: it reproduces! Have a look at ⟨𝑘𝑘(⋅, 𝒙𝒙), 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 = � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 𝑘𝑘(𝒙𝒙, 𝒙𝒙 𝑖𝑖 ) = 𝑓𝑓(𝒙𝒙) and, as a special case, ⟨𝑘𝑘(⋅, 𝒙𝒙), 𝑘𝑘(⋅, 𝒙𝒙 ′ )⟩ 𝐻𝐻 𝑘𝑘 = 𝑘𝑘(𝒙𝒙, 𝒙𝒙 ′ ) . <?page no="64"?> 64 Fundamentals of Machine Learning This is the so-called reproducing property and 𝑘𝑘 is called a reproducing kernel, turning the Hilbert Space we’re working on into a Reproducing Kernel Hilbert Space (or RKHS). To wrap up this and the previous section, we have two more items remaining: ▶ we have to show that the inner product is strictly positive definite ▶ the space has to be complete The first item is easily shown by using the reproducing property. Consider |𝑓𝑓(𝒙𝒙)| 2 = �⟨𝑘𝑘(⋅, 𝒙𝒙), 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 � 2 ≤ ⟨𝑘𝑘(⋅, 𝒙𝒙), 𝑘𝑘(⋅, 𝒙𝒙)⟩ 𝐻𝐻 𝑘𝑘 ⋅ ⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 = 𝑘𝑘(𝒙𝒙, 𝒙𝒙)⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 So, if ⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 = 0 , 𝑓𝑓(𝒙𝒙) has to be zero for all 𝒙𝒙 . To make the space complete, we have to add all limit points of sequences that converge in the norm ‖𝑓𝑓‖ 𝐻𝐻 𝑘𝑘 ≔ �⟨𝑓𝑓, 𝑓𝑓⟩ 𝐻𝐻 𝑘𝑘 . As a result, 𝐻𝐻 𝑘𝑘 ≔ �𝑓𝑓 ∶ 𝑓𝑓 = � 𝛼𝛼 𝑖𝑖 𝑘𝑘( ⋅, 𝑥𝑥 𝑖𝑖 ) 𝑖𝑖 , 𝛼𝛼 𝑖𝑖 ∈ ℝ, 𝒙𝒙 𝑖𝑖 ∈ 𝑋𝑋� is a reproducing kernel Hilbert space (RKHS). So we constructed a space where 𝑘𝑘 can be represented as a dot product of feature maps, i.e. ⟨𝑘𝑘(⋅, 𝒙𝒙), 𝑘𝑘(⋅, 𝒙𝒙 ′ )⟩ 𝐻𝐻 𝑘𝑘 = 𝑘𝑘(𝒙𝒙, 𝒙𝒙 ′ )! This shows that, if 𝑘𝑘 has a positive semidefinite Gram matrix, there is a space - in our case, 𝐻𝐻 𝑘𝑘 - with an associated inner product - here, 〈⋅,⋅〉 𝐻𝐻 𝑘𝑘 - such that 𝑘𝑘 can be expressed as a dot product of functions in this space. That’s the thing we wanted to show, i.e. 𝑘𝑘 is a valid (Mercer) kernel!  Exercise 1 Consider (𝑥𝑥 1 , 𝑥𝑥 2 ) and (𝑧𝑧 1 , 𝑧𝑧 2 ) ∈ ℝ 2 . Now show that the function 𝑘𝑘: ℝ 2 × ℝ 2 ↦ ℝ with 𝑘𝑘�(𝑥𝑥 1 , 𝑥𝑥 2 ), (𝑧𝑧 1 , 𝑧𝑧 2 )� = 〈(𝑥𝑥 1 , 𝑥𝑥 2 ), (𝑧𝑧 1 , 𝑧𝑧 2 )〉 2 is a kernel, but the function 𝐹𝐹 associated with 𝑘𝑘 does not have to be unique. Hint: Use 𝐹𝐹: ℝ 2 ↦ ℝ 4 where (𝑥𝑥 1 , 𝑥𝑥 2 ) → (𝑥𝑥 1 𝑥𝑥 1 , 𝑥𝑥 1 𝑥𝑥 2 , 𝑥𝑥 1 𝑥𝑥 2 , 𝑥𝑥 2 𝑥𝑥 2 ) <?page no="65"?> Support Vector Machines Made Easy 65  Exercise 2 Write a MATLAB program that computes a maximum margin separating line for data points in 2D with the dual quadratic programming based on the MATLAB routine quadprog() , but now using a kernel function (you can use the kernel function used above). Input: Positive and negative sample points in 2D, which cannot (always) be separated by a line. Output: A maximum margin classifier, which allows for classifying new unknown samples. Remarks for Exercise 2 In the program, 𝑑𝑑 can be computed from the Karush-Kuhn-Tucker conditions in the following way: From the above equation (7) , we have: 𝒘𝒘 ∗ = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 This inserted into the KKT-conditions 𝛼𝛼 𝑖𝑖 �𝑦𝑦 𝑖𝑖 �𝒘𝒘 ∗T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑 ∗ � − 1� = 0 gives 𝑦𝑦 𝑗𝑗 �� 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝒙𝒙 𝑖𝑖𝑇𝑇 𝒙𝒙 𝑗𝑗 + 𝑑𝑑 ∗ 𝑖𝑖∈𝑠𝑠𝑠𝑠 � = 1, where 𝑠𝑠𝑣𝑣 is the index set of the support vectors. Since we are using the kernel 𝑘𝑘 , we must replace 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 by 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � in the above formula. But 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � = 〈𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 〉 2 ! For consistency checking, repeat the computation of 𝑑𝑑 for all 𝑗𝑗 in 𝑠𝑠𝑣𝑣 in the program. <?page no="67"?> 7 The SMO Algorithm As we saw above in the exercises of chapter 7, we can implement support vector machines on the basis of a quadratic programming (QP) method. For example, this method can be taken from a standard software package, i.e. MATLAB. However, this will not give very good results, for several reasons. Firstly, QP is too general, and can solve problems for any type of quadratic function over any type of convex region. The dual QP derived above is very specific. General QP solvers cannot always make use of such specifics. Secondly, the explicit evaluation of the objective function in the above QP can be quite slow, and, storing all scalar products of the form all 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 - or 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � if we use a kernel - can require far too much space for large data sets. 7.1 Overview and Principles The Sequential Minimal Optimization (SMO) algorithm was designed specifically for solving quadratic programs of the form arising in the context of support vector machines [5]. It has many advantages, and cases have been reported, where it provides speed-ups of several orders of magnitude. SMO stands for Sequential Minimal Optimization. Here minimal means that the number of Lagrange multipliers 𝛼𝛼 𝑖𝑖 , optimized at each step is minimal. Specifically, we could try to look at the objective function as a whole, (see chapters 5 and 6): 𝑊𝑊(𝜶𝜶) ≔ � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 (9) We could now try to optimize one multiplier 𝛼𝛼 𝑖𝑖 at a time. However, this would not work since we have the additional constraints � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 and 0 ≤ 𝛼𝛼 𝑖𝑖 ≤ 𝐶𝐶. Hence after changing a single 𝛼𝛼 𝑖𝑖 , this constraint would be violated, if the constraint held before the change. Thus, the minimum number of 𝛼𝛼 𝑖𝑖 ’s we can modify at a time is two. This is exactly the idea of the SMO method. It changes the 𝛼𝛼 𝑖𝑖 , sequentially rather than all of them at once, but changing the minimum number of 𝛼𝛼 𝑖𝑖 ’s at each step. This minimum number is two. Hence the name of the algorithm. The original paper on the SMO-algorithm [5] also gives a short listing of pseudocode for this method. Given that QP modules in standard packages are very big pieces of software, the fact that SMO is typically so much faster is impressive. <?page no="68"?> 68 Fundamentals of Machine Learning Another (simplified) version of SMO is described in [4], also giving pseudo-code. An early version is due to [6]. We follow the description in [5] in this chapter. 7.2 Optimisation Step The principle of the SMO method is to perform a gradient ascent starting with a feasible point, until a stopping criterion is satisfied. Since the objective function is convex over the feasible region, this strategy converges to the optimum. Hence, we must start with a point in the feasible region. In general, this part could be difficult. For example, in linear programming, finding a feasible point is half the rent, and in fact constitutes the first of two phases of the simplex LP algorithm. In our case it is very easy to find a feasible point. The origin is one! (Check this on the constraints) It now remains to successively select and change pairs of 𝛼𝛼 𝑖𝑖 ’s, until the optimum is reached. The beauty of SMO is that this change step can be done analytically and thus very fast. Specifically, we do not need a step length parameter. However, analytic does not mean that we need to look at each variable at most once. We may have to change each 𝛼𝛼 𝑖𝑖 several times. John Platt describes his SMO algorithm in a version where the 𝛼𝛼 𝑖𝑖 ’s are not only constrained to be greater than or equal to zero, but also below a positive constant 𝐶𝐶 . Since we can always set 𝐶𝐶 to infinity, this version of the dual QP can also be used to solve the case without a constant 𝐶𝐶 . Here is an overview of the SMO algorithm: SMO Algorithm Repeat ➊ Select 𝛼𝛼 𝑖𝑖 and 𝛼𝛼 𝑗𝑗 to optimize, according to a heuristic ➋ Optimize the objective function of the dual QP with respect to 𝛼𝛼 𝑖𝑖 and 𝛼𝛼 𝑗𝑗 , while holding all other 𝛼𝛼 ’s fixed. until convergence criterion met. The convergence criterion is that the Karush-Kuhn-Tucker conditions hold (product of primal constraint 𝑖𝑖 and 𝛼𝛼 𝑖𝑖 is zero), see below. The important step here is step number 2. Let us assume we have selected 𝛼𝛼 1 and 𝛼𝛼 2 . All the other 𝛼𝛼 ’s are then regarded as fixed. We thus have 𝑦𝑦 1 𝛼𝛼 1 + 𝑦𝑦 2 𝛼𝛼 2 = −� 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=3 ������� 𝜉𝜉 ′ . <?page no="69"?> Support Vector Machines Made Easy 69 Or, for some constant 𝜉𝜉 ′ , we have 𝑦𝑦 1 𝛼𝛼 1 + 𝑦𝑦 2 𝛼𝛼 2 = 𝜉𝜉 ′ . We know that 𝑦𝑦 1,2 = ±1 , thus 𝛼𝛼 1 = (𝜉𝜉 ′ − 𝑦𝑦 2 𝛼𝛼 2 )𝑦𝑦 1 . With this, the objective function 𝑊𝑊(𝜶𝜶) can be written as 𝑊𝑊(𝜶𝜶) = 𝑊𝑊�(𝜉𝜉 ′ − 𝑦𝑦 2 𝛼𝛼 2 )𝑦𝑦 1 , 𝛼𝛼 2 , … , 𝛼𝛼 𝑚𝑚 � . This is a quadratic function of 𝛼𝛼 2 alone, since 𝛼𝛼 3 , … , 𝛼𝛼 𝑚𝑚 are fixed! Thus, we can also write 𝑊𝑊(𝜶𝜶) = 𝑎𝑎𝛼𝛼 22 + 𝑏𝑏𝛼𝛼 2 + 𝑐𝑐 (10) for some 𝑎𝑎, 𝑏𝑏, 𝑐𝑐 ∈ ℝ . In the following subsection, we will calculate the values 𝑎𝑎 , 𝑏𝑏 and 𝑐𝑐 explicitly, to prepare for the implementation of SMO. Here it suffices to note that a quadratic function of a single parameter can easily be optimized (for now we do not consider the box constraints 𝛼𝛼 1 , 𝛼𝛼 2 ∈ [0, 𝐶𝐶] ) by taking the derivative and setting it to zero. Since 𝛼𝛼 1 and 𝛼𝛼 2 are related by the line equation 𝛼𝛼 1 = 𝑦𝑦 1 𝜉𝜉 ′ − 𝑦𝑦 1 𝑦𝑦 2 𝛼𝛼 2 , the constraints that 𝛼𝛼 1 and 𝛼𝛼 2 are in the box [0, 𝐶𝐶] × [0, 𝐶𝐶] can be expressed as a constraint on 𝛼𝛼 2 alone, we must compute values 𝐿𝐿 and 𝐻𝐻 in [0, 𝐶𝐶] in such a way that 𝛼𝛼 1 is in the box whenever 𝛼𝛼 2 is below 𝐻𝐻 and above 𝐿𝐿 . Then we clip 𝛼𝛼 2 to this new constraint and compute 𝛼𝛼 1 . Hence the overall strategy for finding the new 𝛼𝛼 2 in the current iteration step is to first optimize it using 𝑊𝑊(𝛼𝛼 1 , … , 𝛼𝛼 𝑚𝑚 ) = 𝑎𝑎𝛼𝛼 22 + 𝑏𝑏𝛼𝛼 2 + 𝑐𝑐 while not regarding the box constraints. This gives a value 𝛼𝛼 2new,unclipped . We then clip the new value thus obtained to ensure it remains within the lower and upper bounds 𝐿𝐿 and 𝐻𝐻 . From 𝛼𝛼 2 , we now compute 𝛼𝛼 1 through the above line equation. 7.3 Simplified SMO The Simplified SMO Algorithm highlights the basic intuition behind John Platt’s SMO algorithm, but does not handle some of the (typically rare) special cases which may arise for some data sets. However, it is considerably simpler than the original SMO and can be stated in ready-to-compile code in less than a page. The strategy to derive the Simplified SMO is the following: We will derive the constants a, b, c from equation (10 ) explicitly. Then we will take the derivative with respect to 𝛼𝛼 2 of this quadratic expression, set it to zero, and if the second derivative is negative, we have a maximum. This implements step two of the above algorithm. We do this explicitly. Our objective function - as above - is <?page no="70"?> 70 Fundamentals of Machine Learning 𝑊𝑊(𝜶𝜶) = � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 or, when using a kernel 𝑘𝑘 , 𝑊𝑊(𝜶𝜶) = � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 . Using 𝐾𝐾 𝑖𝑖𝑗𝑗 = 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � , we rewrite as 𝑊𝑊(𝜶𝜶) = 𝛼𝛼 1 + 𝛼𝛼 2 − 12 𝛼𝛼 12 𝐾𝐾 11 − 12 𝛼𝛼 22 𝐾𝐾 22 − 𝑦𝑦 1 𝑦𝑦 2 𝐾𝐾 12 𝛼𝛼 1 𝛼𝛼 2 − 𝑦𝑦 1 𝛼𝛼 1 � 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝐾𝐾 1𝑗𝑗 𝑚𝑚 𝑗𝑗=3 ������� 𝑠𝑠 1 − 𝑦𝑦 2 𝛼𝛼 2 � 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝐾𝐾 2𝑗𝑗 𝑚𝑚 𝑗𝑗=3 ������� 𝑠𝑠 2 + � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=3 − 12 � 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑖𝑖 𝛼𝛼 𝑗𝑗 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � 𝑚𝑚 𝑖𝑖,𝑗𝑗=3 ��������������������� 𝑊𝑊 const . (11) Here we have used 𝐾𝐾 𝑖𝑖𝑗𝑗 = 𝐾𝐾 𝑗𝑗𝑖𝑖 in several places. Set 𝑣𝑣 1 = � 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝐾𝐾 1𝑗𝑗 𝑚𝑚 𝑗𝑗=3 and 𝑣𝑣 2 = � 𝑦𝑦 𝑗𝑗 𝛼𝛼 𝑗𝑗 𝐾𝐾 2𝑗𝑗 𝑚𝑚 𝑗𝑗=3 . Now, from equation (11) we have 𝑊𝑊(𝜶𝜶) = 𝛼𝛼 1 + 𝛼𝛼 2 − 12 𝛼𝛼 12 𝐾𝐾 11 − 12 𝛼𝛼 22 𝐾𝐾 22 − 𝑦𝑦 1 𝑦𝑦 2 𝐾𝐾 12 𝛼𝛼 1 𝛼𝛼 2 − 𝑦𝑦 1 𝛼𝛼 1 𝑣𝑣 1 − 𝑦𝑦 2 𝛼𝛼 2 𝑣𝑣 2 + 𝑊𝑊 const . (12) Also, for the old values, we get from the constraints that 𝑦𝑦 1 𝛼𝛼 1old + 𝑦𝑦 2 𝛼𝛼 2old is fixed and 𝛼𝛼 1old = 𝜉𝜉 − 𝑠𝑠𝛼𝛼 2old 𝑎𝑎 1 = 𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 for 𝜉𝜉 = 𝑦𝑦 1 𝜉𝜉 ′ and 𝑠𝑠 = 𝑦𝑦 1 𝑦𝑦 2 . Thus, we replace 𝛼𝛼 1 in equation (12) to have an expression in variable 𝛼𝛼 2 alone: <?page no="71"?> Support Vector Machines Made Easy 71 𝑊𝑊(𝜶𝜶) = 𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 + 𝛼𝛼 2 − 12 (𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 ) 2 𝐾𝐾 11 − 12 𝛼𝛼 22 𝐾𝐾 22 − 𝑦𝑦 1 𝑦𝑦 2 𝐾𝐾 12 (𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 )𝛼𝛼 2 − 𝑦𝑦 1 (𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 )𝑣𝑣 1 − 𝑦𝑦 2 𝛼𝛼 2 𝑣𝑣 2 + 𝑊𝑊 const Taking the derivative with respect to 𝛼𝛼 2 and setting to zero we obtain 𝑠𝑠𝐾𝐾 11 (𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 ) − 𝐾𝐾 22 𝛼𝛼 2 + 𝐾𝐾 12 𝛼𝛼 2 − 𝑠𝑠𝐾𝐾 12 (𝜉𝜉 − 𝑠𝑠𝛼𝛼 2 ) + 𝑦𝑦 2 𝑣𝑣 1 − 𝑠𝑠 − 𝑦𝑦 2 𝑣𝑣 2 + 1 = 0. (13) Its second derivative is (note that 𝑠𝑠 2 = 1 ) 𝜂𝜂 ≔ − 𝐾𝐾 11 − 𝐾𝐾 22 + 2 𝐾𝐾 12 . (14) The maximum for 𝛼𝛼 2 is reached for a value 𝛼𝛼 2new , if the second derivative is negative, and we rewrite equation (13) at 𝛼𝛼 2new to 𝛼𝛼 2new (𝐾𝐾 11 + 𝐾𝐾 22 − 2𝐾𝐾 12 ) = 𝑠𝑠(𝐾𝐾 11 − 𝐾𝐾 12 )𝜉𝜉 + 𝑦𝑦 2 (𝑣𝑣 1 − 𝑣𝑣 2 ) + 1 − 𝑠𝑠 . (15) Remember from equation (7) that, at optimum, we have 𝒘𝒘 ∗ = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝒙𝒙 𝑖𝑖 . Hence, to classify an unknown sample 𝒙𝒙 after we have trained the SVM by computing the 𝛼𝛼 𝑖𝑖∗ and 𝑑𝑑 , we evaluate the function 𝑓𝑓(𝒙𝒙) = (𝒘𝒘 ∗ ) T 𝒙𝒙 + 𝑑𝑑 (16) to see whether 𝒙𝒙 is positive or negative. This is the same as computing 𝑓𝑓(𝒙𝒙) = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝒙𝒙 𝑖𝑖T 𝒙𝒙 + 𝑑𝑑 (17) after the training phase, i.e. once 𝛼𝛼 and 𝑑𝑑 are known. If 𝑓𝑓(𝒙𝒙) is positive, 𝒙𝒙 is a positive sample, and else negative sample. If we use a kernel 𝑘𝑘 , then we can classify according to 𝑓𝑓(𝒙𝒙) = � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖∗ 𝑚𝑚 𝑖𝑖=1 𝑘𝑘(𝒙𝒙 𝑖𝑖 , 𝒙𝒙) + 𝑑𝑑. (18) <?page no="72"?> 72 Fundamentals of Machine Learning With the above notation we have 𝑣𝑣 1 = 𝑓𝑓(𝒙𝒙 1 ) old − 𝑑𝑑 old − 𝑦𝑦 1 𝛼𝛼 1old 𝐾𝐾 11 − 𝑦𝑦 2 𝛼𝛼 2old 𝐾𝐾 21 and 𝑣𝑣 2 = 𝑓𝑓(𝒙𝒙 2 ) old − 𝑑𝑑 old − 𝑦𝑦 1 𝛼𝛼 1old 𝐾𝐾 12 − 𝑦𝑦 2 𝛼𝛼 2old 𝐾𝐾 22 . Note that the original paper by John Platt uses the convention 𝑓𝑓(𝒙𝒙) = (𝒘𝒘 ∗ ) T 𝒙𝒙 − 𝑑𝑑 instead of our convention in equation (16) . We now put this into the right side of equation (15) , using 𝛼𝛼 1old = 𝜉𝜉 − 𝑠𝑠𝛼𝛼 2old (note that 𝛼𝛼 2old occurs in the definition of 𝑣𝑣 1 and 𝑣𝑣 2 as well as 𝜉𝜉 , but 𝛼𝛼 2old is a known constant). Now, we get 𝛼𝛼 2new (𝐾𝐾 11 + 𝐾𝐾 22 − 2𝐾𝐾 12 ) = 𝛼𝛼 2old (𝐾𝐾 11 + 𝐾𝐾 22 − 2𝐾𝐾 12 ) + 𝑦𝑦 2 �𝑓𝑓(𝒙𝒙 1 ) old − 𝑓𝑓(𝒙𝒙 2 ) old + 𝑦𝑦 2 − 𝑦𝑦 1 �. In the next step, we set 𝐸𝐸 1 ≔ 𝑓𝑓(𝒙𝒙 1 ) old − 𝑦𝑦 1 𝐸𝐸 2 ≔ 𝑓𝑓(𝒙𝒙 2 ) old − 𝑦𝑦 2 to finally obtain 𝑎𝑎 2new = 𝑎𝑎 2old − 𝑦𝑦 2 (𝐸𝐸 1 − 𝐸𝐸 2 ) 2𝐾𝐾 12 − 𝐾𝐾 11 − 𝐾𝐾 22 (19) as the update for 𝛼𝛼 2 . Now, 𝛼𝛼 2new may be outside of [0, 𝐶𝐶] and needs to be clipped such that both 𝛼𝛼 1new , 𝛼𝛼 2new ∈ [0, 𝐶𝐶] . The line 𝑦𝑦 1 𝛼𝛼 1 + 𝑦𝑦 2 𝛼𝛼 2 = 𝜉𝜉 ′ is a diagonal (because 𝑦𝑦 1,2 = ±1 ) crossing the box [0, 𝐶𝐶] × [0, 𝐶𝐶] . This is shown in Figure 29. Figure 29 - Update step for 𝛼𝛼 1 and 𝛼𝛼 2 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝑦𝑦 1 𝑦𝑦 2 = −1 𝑦𝑦 1 𝑦𝑦 2 = 1 <?page no="73"?> Support Vector Machines Made Easy 73 In both cases, 𝜉𝜉 ′ is the offset of the line (distance to the origin). Now, 𝛼𝛼 2new can be admissible as it is, or it may have to be clipped, as shown in Figure 30. Figure 30 - Clipping of α 2new . Left: admissible, center: clipping at 𝐻𝐻 | Right: clipping at L If 𝛼𝛼 2new is in the position as shown in the left of Figure 30, then 𝛼𝛼 2new is admissible as it is. Otherwise, we must correct the value for 𝛼𝛼 2new by a process called clipping. This is shown in the center of Figure 30. In this case, we set 𝛼𝛼 2new to the value 𝐻𝐻 , as shown in the figure. Call the new value 𝛼𝛼 2new,clipped . We can hence calculate a maximum value 𝐻𝐻 , such that 𝛼𝛼 2 must remain below 𝐻𝐻 , to stay in this box. This is illustrated in Figure 31. Figure 31 - Clipping of α 2new , upper bound 𝐻𝐻 . Left: 𝜉𝜉 < 𝐶𝐶 | Right: 𝜉𝜉 > 𝐶𝐶 Let’s assume that 𝑦𝑦 1 = 𝑦𝑦 2 = 1 . Then 𝜉𝜉 = 𝜉𝜉 ′ and 𝛼𝛼 1old + 𝛼𝛼 2old = 𝜉𝜉 . In the first case, if 𝜉𝜉 < 𝐶𝐶 , we have 𝐻𝐻 = 𝜉𝜉 = 𝛼𝛼 1old + 𝛼𝛼 2old (left side of Figure 31). In the second case, if 𝜉𝜉 ≥ 𝐶𝐶 , we have 𝐻𝐻 = 𝐶𝐶 (right side of Figure 31). To sum up, we can set 𝐻𝐻 = min�𝐶𝐶, 𝛼𝛼 1old + 𝛼𝛼 2old �. Likewise, we can find a value 𝐿𝐿 which represents the maximum value that 𝛼𝛼 2 can take, in order to clip 𝛼𝛼 2 to be in the box [0, 𝐶𝐶] × [0, 𝐶𝐶] . We set 𝐿𝐿 = max(0, 𝛼𝛼 1 + 𝛼𝛼 2 − 𝐶𝐶) . 𝐶𝐶 𝐶𝐶 𝛼𝛼 2 𝛼𝛼 2new 𝛼𝛼 1new 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝛼𝛼 2new 𝐿𝐿 𝐶𝐶 𝐶𝐶 𝛼𝛼 2 𝛼𝛼 2new 𝐻𝐻 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝐻𝐻 𝜉𝜉 𝜉𝜉 𝛼𝛼 2 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝐻𝐻 𝜉𝜉 𝜉𝜉 𝛼𝛼 1 + 𝛼𝛼 2 = 𝜉𝜉 <?page no="74"?> 74 Fundamentals of Machine Learning The derivation of the bound 𝛼𝛼 1 + 𝛼𝛼 2 − 𝐶𝐶 for 𝐿𝐿 is illustrated in Figure 32. Figure 32 - Clipping of 𝛼𝛼 2new , lower bound 𝐿𝐿 . Left: 𝜉𝜉 < 𝐶𝐶 | Right: 𝜉𝜉 > 𝐶𝐶 Remember we had assumed 𝑦𝑦 1 = 𝑦𝑦 2 = 1 . Treating the case 𝑦𝑦 1 = 𝑦𝑦 2 = −1 in much the same way, we set 𝐿𝐿 = max�0, 𝛼𝛼 1old + 𝛼𝛼 2old − 𝐶𝐶� 𝐻𝐻 = min�𝐶𝐶, 𝛼𝛼 1old + 𝛼𝛼 2old � . (20) Similarly, if 𝑦𝑦 1 ≠ 𝑦𝑦 2 , we set 𝐿𝐿 = max�0, 𝛼𝛼 2old − 𝛼𝛼 1old �, 𝐻𝐻 = min�𝐶𝐶, 𝐶𝐶 + 𝛼𝛼 2old − 𝛼𝛼 1old �. (21) Now we can clip 𝛼𝛼 2new by 𝛼𝛼 2new,clipped = � 𝐻𝐻 𝛼𝛼 2new ≥ 𝐻𝐻 𝛼𝛼 2new 𝐿𝐿 < 𝛼𝛼 2new < 𝐻𝐻 𝐿𝐿 𝛼𝛼 2new ≤ 𝐿𝐿 (22) and, having done this, we update 𝛼𝛼 1 . To this end, simply observe that the original constraint � 𝑦𝑦 𝑖𝑖 𝛼𝛼 𝑖𝑖 = 0 𝑚𝑚 𝑖𝑖=1 must hold after the update of both values 𝛼𝛼 1 and 𝛼𝛼 2 . Hence, we can directly compute 𝛼𝛼 1new from this formula, since only 𝛼𝛼 2 has been changed. Platt in his original paper [5] uses the update rule 𝛼𝛼 1new = 𝛼𝛼 1old + 𝑠𝑠�𝛼𝛼 2old − 𝛼𝛼 2new,clipped � (23) 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝐿𝐿 𝜉𝜉 𝜉𝜉 𝛼𝛼 1 + 𝛼𝛼 2 = 𝜉𝜉 𝐶𝐶 𝐶𝐶 𝛼𝛼 1 𝛼𝛼 2 𝜉𝜉 𝜉𝜉 𝐿𝐿 𝜉𝜉 − 𝐶𝐶 <?page no="75"?> Support Vector Machines Made Easy 75 which is equivalent but optimized and obviously faster for large 𝑚𝑚 . To update 𝑑𝑑 , we recall the Karush-Kuhn-Tucker (KKT) conditions. Three cases can arise for 𝛼𝛼 𝑖𝑖 after the clipping: 𝛼𝛼 𝑖𝑖 = 0 , 0 < 𝛼𝛼 𝑖𝑖 < 𝐶𝐶 and 𝛼𝛼 𝑖𝑖 = 𝐶𝐶 . The KKT conditions state that 𝛼𝛼 𝑖𝑖 = 0 ⇒ 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1 𝛼𝛼 𝑖𝑖 = 𝐶𝐶 ⇒ 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 1 0 < 𝛼𝛼 𝑖𝑖 < 𝐶𝐶 ⇒ 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) = 1 . Equivalently, we have 𝛼𝛼 𝑖𝑖 = 0 ⇒ 𝑦𝑦 𝑖𝑖 𝑓𝑓(𝒙𝒙 𝑖𝑖 ) ≥ 1 𝛼𝛼 𝑖𝑖 = 𝐶𝐶 ⇒ 𝑦𝑦 𝑖𝑖 𝑓𝑓(𝒙𝒙 𝑖𝑖 ) ≤ 1 0 < 𝛼𝛼 𝑖𝑖 < 𝐶𝐶 ⇒ 𝑦𝑦 𝑖𝑖 𝑓𝑓(𝒙𝒙 𝑖𝑖 ) = 1. Thus, we have to change 𝑑𝑑 such that these conditions are satisfied. Note that the KKT conditions closely resemble the result stated in the Equilibrium Theorem (see exercises in chapter 4). The Equilibrium Theorem for linear programming states that the dual variable must be zero, whenever the corresponding primal constraint is not satisfied with equality, and vice versa. Likewise, the KKT conditions state that whenever the dual variable 𝛼𝛼 𝑖𝑖 is not at bounds ( 0 or 𝐶𝐶 ), then the corresponding primal constraint 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1 is satisfied with equality. Notice also that both the equilibrium theorem and the KKT conditions are theoretical results from optimization theory, and we did not prove either of them. In SMO the KKT conditions are used in several places. First, they will be used to compute 𝑑𝑑 . After each step, we re-compute 𝑑𝑑 in such a way that the KKT conditions are satisfied for the dual variables 𝛼𝛼 1 and 𝛼𝛼 2 . Thus, if the new 𝛼𝛼 2 is not at bounds ( 0 or 𝐶𝐶 ), we enforce 𝑑𝑑 to be changed such that 𝑦𝑦 2 𝑓𝑓(𝒙𝒙 2 ) = 1 . (24) Notice that 𝑑𝑑 is the only variable in this equation, now that we have computed 𝛼𝛼 1 and 𝛼𝛼 2 and all the other 𝛼𝛼 -values have remained fixed. So, we can easily compute 𝑑𝑑 from equation (24) . The original paper on SMO proposes a slightly different and more efficient way to compute 𝑑𝑑 . It proposes to put the values 𝐸𝐸 𝑖𝑖 into cache memory. Then 𝑑𝑑 is computed according to the following rules: If 𝛼𝛼 1new is not at bounds ( 0 < 𝛼𝛼 1new < 𝐶𝐶 ): 𝑑𝑑 1 = 𝑑𝑑 old − 𝐸𝐸 1 − 𝑦𝑦 1 �𝛼𝛼 1new − 𝛼𝛼 1old �𝐾𝐾 11 + 𝑦𝑦 2 �𝛼𝛼 2new,clipped − 𝛼𝛼 2old �𝐾𝐾 12 (25) If 𝛼𝛼 2new,clipped is not at bounds ( 0 < 𝛼𝛼 2new,clipped < 𝐶𝐶 ): <?page no="76"?> 76 Fundamentals of Machine Learning 𝑑𝑑 2 = 𝑑𝑑 old − 𝐸𝐸 2 − 𝑦𝑦 1 �𝛼𝛼 1new − 𝛼𝛼 1old �𝐾𝐾 12 + 𝑦𝑦 2 �𝛼𝛼 2new,clipped − 𝛼𝛼 2old �𝐾𝐾 22 . (26) If both 𝛼𝛼 1 and 𝛼𝛼 2 are not at bounds, then choose either one of 𝑑𝑑 1 and 𝑑𝑑 2 . If both 𝛼𝛼 1 and 𝛼𝛼 2 are at bounds, then choose the midpoint between 𝑑𝑑 1 and 𝑑𝑑 2 . Finally, the KKT conditions are used to check for termination of the algorithm. This will be discussed below, after we look at the pseudo code for SMO. Algorithm Simplified SMO (Training) Input: 𝐶𝐶 = 1000 𝜀𝜀 = 0.0001 𝑚𝑚𝑎𝑎𝑥𝑥 𝑖𝑖𝑡𝑡 = 1000 (𝒙𝒙 1 , 𝑦𝑦 1 ), … , (𝒙𝒙 𝑚𝑚 , 𝑦𝑦 𝑚𝑚 ) : training data Output: 𝜶𝜶 : dual variables (Lagrange multipliers) 𝑑𝑑: threshold Init = 0, 𝑑𝑑 = 0, 𝑛𝑛𝑢𝑢𝑚𝑚 𝑖𝑖𝑡𝑡 = 0; While 𝑛𝑛𝑢𝑢𝑚𝑚 𝑖𝑖𝑡𝑡 < 𝑚𝑚𝑎𝑎𝑥𝑥 𝑖𝑖𝑡𝑡 𝑛𝑛𝑢𝑢𝑚𝑚 𝑖𝑖𝑡𝑡 = 𝑛𝑛𝑢𝑢𝑚𝑚 𝑖𝑖𝑡𝑡 + 1 for 𝑖𝑖 = 1, … , 𝑚𝑚 if not 𝐾𝐾𝐾𝐾𝐾𝐾(𝑖𝑖, 𝐶𝐶, 𝜀𝜀) select 𝑗𝑗 with 0 < 𝑗𝑗 < 𝑚𝑚 + 1 at random ( 𝑗𝑗 ≠ 𝑖𝑖 ) save 𝛼𝛼 𝑖𝑖 and 𝛼𝛼 𝑗𝑗 compute 𝜂𝜂 (see eq. (14) ) if 𝜂𝜂 ≥ 0 skip to next 𝑖𝑖 compute 𝐿𝐿 and 𝐻𝐻 (see eqs. (20) and (21) ) if 𝐿𝐿 = 𝐻𝐻 skip to next 𝑖𝑖 compute 𝛼𝛼 𝑗𝑗new,clipped (see eqs. (19) and (22) ) compute 𝛼𝛼 𝑖𝑖new (see eq. (23) ) compute 𝑑𝑑 new (see eqs. (25) and (26) ) endif endfor endwhile <?page no="77"?> Support Vector Machines Made Easy 77 Thus, the algorithm checks at each iteration (while-loop), if the KKT conditions are satisfied for all 𝑖𝑖 to within a tolerance 𝜀𝜀 or if 𝑚𝑚𝑎𝑎𝑥𝑥 𝑖𝑖𝑡𝑡 has been reached. It will then use the KKT conditions to compute the new 𝑑𝑑 , after having computed 𝛼𝛼 𝑗𝑗 and updated 𝛼𝛼 𝑖𝑖 accordingly. To obtain a full Support Vector Machine, we not only need a training procedure (as above), but also a production procedure, namely a routine, that will classify unknown samples 𝒙𝒙 . This is done simply by evaluating 𝑓𝑓(𝒙𝒙) , and outputting ‘positive’ or ‘negative’ depending on the sign of the result. Algorithm: Simplified SMO (Production) Input: 𝒙𝒙 : unclassified (unknown) sample (𝒙𝒙 1 , 𝑦𝑦 1 ), … , (𝒙𝒙 𝑚𝑚 , 𝑦𝑦 𝑚𝑚 ) : training data 𝜶𝜶 : dual variables (Lagrange multipliers computed in the training phase by SMO above) 𝑑𝑑 : threshold (computed in the training phase by SMO) Output: (positive, negative) result of classification of input 𝒙𝒙 . Compute 𝑓𝑓(𝒙𝒙) from eq. (17) and/ or eq. (18) depending on whether a kernel was used. Output positive, if 𝑓𝑓(𝒙𝒙) > 0 , negative else. Remark Appendix A shows a well-known method for linear programming (Simplex Algorithm). The Simplex Algorithm consists of two phases, both of which involve matrix operations and book-keeping of substantial complexity and difficulty. However, implementing the Simplex Algorithm in the form stated here would not be very useful in most applications. Commercial packages contain highly optimized code for the Simplex Algorithm, which will usually be much better. Many such packages also contain QP-modules. It is clear that QP is even more involved than Simplex-LP. In the light of this complexity, you should appreciate the simplicity of the SMO Algorithm, as well as the fact that it most likely will be significantly faster in applications than an implementation based on a QP module, even a commercial one.  Programming Exercise Implement the Simplified SMO Algorithm, and apply it to pattern recognition, on the following type of data. Generate two 16 by 16 binary bitmap arrays, showing a rectangle on the first, and a circle on the second. Add noise to both images thus obtained, and thereby generate training data. Apply the SMO method to classify rectangles and circles for this case. <?page no="79"?> 8 Regression Above we have used the values 𝑦𝑦 𝑖𝑖 as object classes and we tried to map data points 𝒙𝒙 𝑖𝑖 to their respective classes. Hence, we assumed 𝑦𝑦 𝑖𝑖 = +1 or 𝑦𝑦 𝑖𝑖 = −1 . More generally, we will now allow 𝑦𝑦 𝑖𝑖 to be arbitrary real numbers. Thus, each sample point 𝒙𝒙 𝑖𝑖 produces a function output 𝑦𝑦 𝑖𝑖 . As opposed to classification, the goal of regression is to learn this function. Figure 33 - Example of regression. We want to find the linear function approximating the data points marked ∘ . Figure 33 shows an example where a regression line approximates the function relating size to weight. Here, size is the input ( 𝒙𝒙 𝑖𝑖 ), and weight is the output, i.e. 𝑦𝑦 𝑖𝑖 . Samples are now of the form (𝒙𝒙 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ) where the 𝑦𝑦 𝑖𝑖 are real numbers, instead of a value ±1 . The regression line (or plane) can also be found with (QP)! The corresponding (QP) is minimize 12 𝒘𝒘 T 𝒘𝒘 subject to 𝑦𝑦 𝑖𝑖 − (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 𝜀𝜀 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) − 𝑦𝑦 𝑖𝑖 ≤ 𝜀𝜀. The constraints mean that the distance between the value ( 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑 ) and the value 𝑦𝑦 𝑖𝑖 should not be bigger than 𝜀𝜀 . This means that the regression line must not have a distance more than 𝜀𝜀 from all samples (𝒙𝒙 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ) . This may not always be feasible, i.e. if ε is too small. This is illustrated in Figure 34. size weight <?page no="80"?> 80 Fundamentals of Machine Learning Figure 34 - The effect of ε for regression. Left: ε is sufficiently large | Right: one sample violates the constraint (marked ◻ ) Recall that the constraints for the classification case were similar: 𝑦𝑦 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≥ 1 8.1 Slack Variables To allow for finding a regression line, even if the points are far apart, we again introduce slack variables. One allows for a violation of the constraints. This is the same idea as for classification: we allow violation of the constraints but penalize it! Thus, instead of requiring 𝑦𝑦 𝑖𝑖 − (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 𝜀𝜀 we ask for a little less, namely 𝑦𝑦 𝑖𝑖 − (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 𝜀𝜀 + 𝜉𝜉 where 𝜉𝜉 is a variable ( 𝜀𝜀 is a constant). 𝜉𝜉 is treated just like any other QP-variable (or LP-variable), i.e. 𝜉𝜉 ≥ 0 . Clearly, we would like 𝜉𝜉 to be as small as possible, since it is the amount by which the original constraint is violated. Thus, we include the minimization of the 𝜉𝜉 -values into the above QP: minimize 12 𝒘𝒘 T 𝒘𝒘 + 𝜉𝜉 1 + 𝜉𝜉 1′ + 𝜉𝜉 2 + 𝜉𝜉 2′ + ⋯ subject to 𝑦𝑦 1 − (𝒘𝒘 T 𝒙𝒙 1 + 𝑑𝑑) ≤ 𝜀𝜀 + 𝜉𝜉 1 (𝒘𝒘 T 𝒙𝒙 1 + 𝑑𝑑) − 𝑦𝑦 1 ≤ 𝜀𝜀 + 𝜉𝜉 1′ 𝑦𝑦 2 − (𝒘𝒘 T 𝒙𝒙 2 + 𝑑𝑑) ≤ 𝜀𝜀 + 𝜉𝜉 2 (𝒘𝒘 T 𝒙𝒙 2 + 𝑑𝑑) − 𝑦𝑦 2 ≤ 𝜀𝜀 + 𝜉𝜉 2′ ⋮ 𝜉𝜉 𝑖𝑖 , 𝜉𝜉 𝑖𝑖′ ≥ 0 But soon we will combine slack variables, QPs, duals and kernels! We now - just as size weight 𝜀𝜀𝜀𝜀 size weight 𝜀𝜀𝜀𝜀 <?page no="81"?> Support Vector Machines Made Easy 81 for classification - introduce a constant 𝐶𝐶 , which represents the trade-off between the slack variables and the quadratic term in the minimization function. Thus, we rewrite the last QP: minimize 12 𝒘𝒘 T 𝒘𝒘 + 𝐶𝐶 ∑ (𝜉𝜉 𝑖𝑖 + 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚𝑖𝑖=1 subject to 𝑦𝑦 𝑖𝑖 − (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 𝜀𝜀 + 𝜉𝜉 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) − 𝑦𝑦 𝑖𝑖 ≤ 𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ This is the same idea we used for classification - with only one difference: we now have two slack variables 𝜉𝜉 𝑖𝑖 and 𝜉𝜉 𝑖𝑖′ for each sample. 8.2 Duality, Kernels and Regression As above in the case of classification, we can use kernel functions here for regression. Before doing that, we must dualize the above QP. In much the same way as for classification, we can derive the dual form of the regression problem as maximize − 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − 𝜀𝜀 �(𝛼𝛼 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 + � 𝑦𝑦 𝑖𝑖 (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 subject to ∑ (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚𝑖𝑖=1 = 0 𝛼𝛼 𝑖𝑖 , 𝛼𝛼 𝑖𝑖′ ∈ [0, 𝐶𝐶] where 𝛼𝛼 𝑖𝑖 , 𝛼𝛼 𝑖𝑖′ are the variables. To obtain this dual QP, Lagrange multipliers are used as in the case of classification (see below for the derivation of this dual form). Similar to the case of classification, we have 𝒘𝒘 = �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 and 𝑑𝑑 = 𝑦𝑦 𝑗𝑗 − �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖=1 ± 𝜀𝜀 where + holds for 𝛼𝛼 𝑗𝑗 ∈ (0, 𝐶𝐶) and − holds for 𝛼𝛼 𝑗𝑗′ ∈ (0, 𝐶𝐶) . And to predict the function value 𝑓𝑓(𝒙𝒙) of an unknown sample point 𝒙𝒙 after training, we have 𝑓𝑓(𝒙𝒙) = 𝒘𝒘 T 𝒙𝒙 + 𝑑𝑑 = �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑚𝑚 𝑖𝑖=1 + 𝑑𝑑 <?page no="82"?> 82 Fundamentals of Machine Learning Notice: whenever sample data 𝒙𝒙 𝑖𝑖 or unknown points 𝒙𝒙 appear in the above formulas for 𝑑𝑑 , for 𝑓𝑓(𝒙𝒙) and also in the actual QP, they appear as product 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝒋𝒋 or as product 𝒙𝒙 𝑖𝑖T 𝒙𝒙 . Thus, we can replace the 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝒋𝒋 by 𝑘𝑘�𝒙𝒙 𝑖𝑖 , 𝒙𝒙 𝑗𝑗 � everywhere, for an appropriate kernel 𝑘𝑘 . That means, we can train non-linear regression functions as well.  Programming Exercise Write a MATLAB Program that finds a regression line for sample points in 2D with dual quadratic programming. For the program, use the variables 𝜃𝜃 1 , … , 𝜃𝜃 2𝑚𝑚 by setting: 𝜃𝜃 𝑖𝑖 = 𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ 𝜃𝜃 𝑖𝑖+𝑚𝑚 = 𝛼𝛼 𝑖𝑖′ for 𝑖𝑖 = 1, … , 𝑚𝑚 . Then, 𝛼𝛼 𝑖𝑖 = 𝜃𝜃 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ Thus, for example, the constraint 𝛼𝛼 1 ≥ 0 is written as 𝜃𝜃 1 + 𝛼𝛼 1′ ≥ 0 . For the case 𝑚𝑚 = 4 , as a constraint line in the matrix inequality 𝐀𝐀𝒙𝒙 ≤ 𝒃𝒃 , this constraint gives the line −1 0 0 0 −1 0 0 0 as a line in the matrix 𝐀𝐀 and 0 as an entry in the vector 𝒃𝒃. We can now use a kernel function as in the case of classification. As an intermediate step before using a kernel function, we will look at Support Vector Regression with more sample points. Regression with kernels allows for non-linear function modeling. As in the case of classification, we can use the function 𝐹𝐹: (𝑥𝑥 1 , 𝑥𝑥 2 ) → �𝑥𝑥 12 , 𝑥𝑥 22 , √2𝑥𝑥 1 𝑥𝑥 2 � .  Programming Exercise Write a MATLAB Program that computes a regression function (not necessarily a line), for a set of sample points in 2D, using the above function 𝐹𝐹 as a basis for a kernel. 8.3 Deriving the Dual form of the QP for Regression As in the case of classification, Lagrange multipliers are used to derive the dual form of the QP for regression. The primal QP is given by: <?page no="83"?> Support Vector Machines Made Easy 83 minimize 12 𝒘𝒘 T 𝒘𝒘 + 𝐶𝐶 ∑ (𝜉𝜉 𝑖𝑖 + 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚𝑖𝑖=1 subject to 𝑦𝑦 𝑖𝑖 − (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) ≤ 𝜀𝜀 + 𝜉𝜉 𝑖𝑖 (𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) − 𝑦𝑦 𝑖𝑖 ≤ 𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ 𝜉𝜉 𝑖𝑖 , 𝜉𝜉 𝑖𝑖′ ≥ 0 We can now build its Lagrangian using multipliers 𝜶𝜶, 𝜶𝜶 ′ , 𝜼𝜼, 𝜼𝜼′ (see also chapter 5)! This Lagrangian looks as follows: 𝐿𝐿 = ‖𝑤𝑤‖ 2 2 + 𝐶𝐶 �(𝜉𝜉 𝑖𝑖 + 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 − �(𝜂𝜂 𝑖𝑖 𝜉𝜉 𝑖𝑖 + 𝜂𝜂 𝑖𝑖′ 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 + 𝜉𝜉 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ + 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 (27) For all dual variables, positivity constraints apply, since they refer to primal inequalities (see chapter 5). Thus, we have 𝛼𝛼 𝑖𝑖 , 𝛼𝛼 𝑖𝑖′ , 𝜂𝜂 𝑖𝑖 , 𝜂𝜂 𝑖𝑖′ ≥ 0. Taking derivatives with respect to 𝒘𝒘 , 𝑑𝑑 , 𝜉𝜉 𝑖𝑖 and 𝜉𝜉 𝑖𝑖′ and setting to zero we obtain the four equations: 𝜕𝜕 𝑑𝑑 𝐿𝐿 = �(𝛼𝛼 𝑖𝑖′ − 𝛼𝛼 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 ≔ 0 𝜕𝜕 𝒘𝒘 𝐿𝐿 = 𝒘𝒘 − �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 ≔ 0 𝜕𝜕 𝜉𝜉 𝑖𝑖 𝐿𝐿 = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖 − 𝜂𝜂 𝑖𝑖 ≔ 0 𝜕𝜕 𝜉𝜉 𝑖𝑖′ 𝐿𝐿 = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖′ − 𝜂𝜂 𝑖𝑖′ ≔ 0 (28) We rewrite the last two equations to 𝜂𝜂 𝑖𝑖 = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖 𝜂𝜂 𝑖𝑖′ = 𝐶𝐶 − 𝛼𝛼 𝑖𝑖′ (29) and replace all occurrences of 𝜂𝜂 𝑖𝑖 , 𝜂𝜂 𝑖𝑖′ in the Lagrange function (27) using (29) and obtain a new form of the Lagrange function 𝐿𝐿 : <?page no="84"?> 84 Fundamentals of Machine Learning 𝐿𝐿 = 12 ‖𝒘𝒘‖ 2 + 𝐶𝐶 �(𝜉𝜉 𝑖𝑖 + 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 − �(𝐶𝐶𝜉𝜉 𝑖𝑖 + 𝐶𝐶𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 + �(𝛼𝛼 𝑖𝑖 𝜉𝜉 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ 𝜉𝜉 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 + 𝜉𝜉 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ + 𝑦𝑦 𝑖𝑖 − 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 − 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 Now, the first and second terms cancel out and we obtain 𝐿𝐿 = 12 ‖𝒘𝒘‖ 2 + �(𝛼𝛼 𝑖𝑖 𝜉𝜉 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ 𝜉𝜉 𝑖𝑖′ ) − 𝑚𝑚 𝑖𝑖=1 � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 + 𝜉𝜉 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 + 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 −� 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ + 𝑦𝑦 𝑖𝑖 − 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 − 𝑑𝑑) 𝑚𝑚 𝑖𝑖=1 . Using the first equation from (28) , we obtain �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 = 0 and use this to further simplify our Lagrangian to 𝐿𝐿 = 12 ‖𝒘𝒘‖ 2 + �(𝛼𝛼 𝑖𝑖 𝜉𝜉 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ 𝜉𝜉 𝑖𝑖′ ) − 𝑚𝑚 𝑖𝑖=1 � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 + 𝜉𝜉 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 −� 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝜉𝜉 𝑖𝑖′ + 𝑦𝑦 𝑖𝑖 − 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 . This, in turn, can be rewritten to 𝐿𝐿 = 12 ‖𝒘𝒘‖ 2 − � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) − 𝑚𝑚 𝑖𝑖=1 � 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝑦𝑦 𝑖𝑖 − 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 . Finally, using the second equation in (28) , we have 𝒘𝒘 = �(𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 and substitute this into the Lagrangian to obtain 𝐿𝐿 = 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 ′ 𝑖𝑖 )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 ′𝑗𝑗 � 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 − � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 − 𝑦𝑦 𝑖𝑖 + 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) − 𝑚𝑚 𝑖𝑖=1 � 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝑦𝑦 𝑖𝑖 − 𝒘𝒘 T 𝒙𝒙 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 from which, again using 𝒘𝒘 = ∑ (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )𝒙𝒙 𝑖𝑖 𝑚𝑚𝑖𝑖=1 , we get <?page no="85"?> Support Vector Machines Made Easy 85 𝐿𝐿 = 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − � 𝛼𝛼 𝑖𝑖 �𝜀𝜀 − 𝑦𝑦 𝑖𝑖 + ��𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑗𝑗T 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑗𝑗=1 � 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖′ �𝜀𝜀 + 𝑦𝑦 𝑖𝑖 − ��𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑗𝑗T 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑗𝑗=1 � 𝑚𝑚 𝑖𝑖=1 which can be rewritten to 𝐿𝐿 = 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − � 𝛼𝛼 𝑖𝑖 (𝜀𝜀 − 𝑦𝑦 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖′ (𝜀𝜀 + 𝑦𝑦 𝑖𝑖 ) 𝑚𝑚 𝑖𝑖=1 − � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑗𝑗T 𝒙𝒙 𝑖𝑖 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 = − 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − 𝜀𝜀 � 𝛼𝛼 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − 𝜀𝜀 � 𝛼𝛼 𝑖𝑖′ 𝑚𝑚 𝑖𝑖=1 + � 𝛼𝛼 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 − � 𝛼𝛼 𝑖𝑖′ 𝑦𝑦 𝑖𝑖 𝑚𝑚 𝑖𝑖=1 = = − 12 � (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ )�𝛼𝛼 𝑗𝑗 − 𝛼𝛼 𝑗𝑗′ �𝒙𝒙 𝑖𝑖T 𝒙𝒙 𝑗𝑗 𝑚𝑚 𝑖𝑖,𝑗𝑗=1 − 𝜀𝜀 �(𝛼𝛼 𝑖𝑖 + 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 + � 𝑦𝑦 𝑖𝑖 (𝛼𝛼 𝑖𝑖 − 𝛼𝛼 𝑖𝑖′ ) 𝑚𝑚 𝑖𝑖=1 which is the above final form of the dual objective function. The first equation in (28) is the dual constraint, and 𝛼𝛼 𝑖𝑖 , 𝛼𝛼 𝑖𝑖′ must both be positive. In summary, the overall strategy to derive the dual QP from the primal QP is the following: ▶ Start from the primal QP ▶ Set up the Lagrange function 𝐿𝐿 ▶ Take derivatives of 𝐿𝐿 with respect to the primal variables ▶ Set the resulting expressions to zero ▶ Re-substitute the resulting equations into the Lagrange function 𝐿𝐿 <?page no="87"?> Part II | Beyond Support Vectors <?page no="89"?> 9 Perceptrons, Neural Networks and Genetic Algorithms 9.1 Perceptrons Perceptrons were first presented in 1958 by Frank Rosenblatt, see [7]. The goal is to provide a model for information processing in the brain. The idea is to regard the brain as a network with nodes and edges. Nodes should represent neurons, while edges are models for synapses. Edges have adjustable weights, represented by real numbers. The entire network will first undergo a training phase in which such weight values are computed. After training, the network is then asked to classify unknown data. Some examples for classification tasks encountered in daily life will be presented next.  Example Decide which is fish and which is airplane Figure 35 - Example images for Classification  Example Classify patterns in order to minimize waste Figure 36 - Left: pattern fits well, small amount of waste. Right: patterns do not fit well <?page no="90"?> 90 Fundamentals of Machine Learning The perceptron was designed to allow for arbitrary classification tasks Figure 37 - Schematic of a Perceptron Input values are 𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 . The threshold value 𝑆𝑆 is a real number. The weights 𝑔𝑔 1 , … , 𝑔𝑔 𝑛𝑛 are real. The principle is: add the input values 𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 , weighted by 𝑔𝑔 1 , … , 𝑔𝑔 𝑛𝑛 , then check whether the sum is greater or equal to 𝑆𝑆 . If so, output 1 , else output 0 . Hence, the output is binary. Thus, the overall computation performed by a perceptron is to check whether the value ∑ 𝑥𝑥 𝑖𝑖 𝑔𝑔 𝑖𝑖 𝑖𝑖 is greater than or equal to S. One could think of a classification task in image processing. Pixel values provide input. The image is then classified according to the perceptron processing method outlined above. Figure 38 - Using a Perceptron to process an image It is important to notice that finding the values 𝑔𝑔 𝑖𝑖 and 𝑆𝑆 (training phase) is equivalent to finding a separating plane. To see why, let 𝑷𝑷 be the set of positive training samples and 𝑵𝑵 be the set of negative training samples. ≥ 𝑆𝑆 𝑥𝑥 1 + 𝑥𝑥 2 𝑥𝑥 𝑛𝑛 ⋮ 𝑔𝑔 1 𝑔𝑔 𝑛𝑛 yes no 10 𝑙𝑙 3 𝑙𝑙 2 𝑙𝑙 1 ≥ 𝑆𝑆 + 𝑔𝑔 1 𝑔𝑔 3 𝑔𝑔 2 <?page no="91"?> Support Vector Machines Made Easy 91 The goal is now to find values 𝑔𝑔 1 , … , 𝑔𝑔 𝑛𝑛 , 𝑆𝑆 , such that 𝑝𝑝 1 𝑔𝑔 1 + ⋯ + 𝑝𝑝 𝑛𝑛 𝑔𝑔 𝑛𝑛 ≥ 𝑆𝑆 for each 𝒑𝒑 in 𝑷𝑷 (here 𝒑𝒑 has the coordinates 𝒑𝒑 = (𝑝𝑝 1 , … , 𝑝𝑝 𝑛𝑛 ) T ) and 𝑞𝑞 1 𝑔𝑔 1 + ⋯ + 𝑞𝑞 𝑛𝑛 𝑔𝑔 𝑛𝑛 < 𝑆𝑆 for each 𝒒𝒒 in 𝑵𝑵 . Thus, training can be reduced to a feasibility test (F). We next show that we can assume that S = 0. To this end, we extend the perceptron by an extra input channel which always receives the input 1, and which has the weight −𝑆𝑆 : Figure 39 - Extended Perceptron with bias 𝑆𝑆 Hence, we are looking for a plane such that 𝑝𝑝 1 𝑔𝑔 1 + ⋯ + 𝑝𝑝 𝑛𝑛 𝑔𝑔 𝑛𝑛 ≥ 0 for each 𝒑𝒑 in 𝑷𝑷 and 𝑞𝑞 1 𝑔𝑔 1 + ⋯ + 𝑞𝑞 𝑛𝑛 𝑔𝑔 𝑛𝑛 < 0 for 𝒒𝒒 in 𝑵𝑵 . Now we admit the value −1 as input value, and negate all 𝑞𝑞 in 𝑵𝑵 . The set of all negated points in 𝑵𝑵 will be called 𝑵𝑵 ′ . The set M is defined as 𝑴𝑴 ≔ 𝑵𝑵 ′ ∪ 𝑷𝑷 . Thus, we are seeking a plane containing 0, which is below all elements in 𝑴𝑴 . We can now assume that all vectors in 𝑴𝑴 have norm 1. Why? Perceptron-Algorithm The following famous algorithm indeed can find a plane containing 0, that is below all elements in a given set of points, if such a plane exists. Given the utter simplicity of this algorithm, this is very surprising. ≥ 𝑆𝑆 𝑥𝑥 1 + 𝑥𝑥 2 𝑥𝑥 𝑛𝑛 ⋮ 𝑔𝑔 1 𝑔𝑔 𝑛𝑛 yes no 10 1 −𝑆𝑆 <?page no="92"?> 92 Fundamentals of Machine Learning Choose a vector 𝒈𝒈 with random coordinates Repeat Test: for all 𝒎𝒎 ∈ 𝑴𝑴 check whether 〈𝒈𝒈, 𝒎𝒎〉 ≤ 0 If 𝒎𝒎 0 is the first such point, set 𝒈𝒈 ≔ 𝒈𝒈 + 𝒎𝒎 0 , go to Test: if there is no such point in 𝑴𝑴 set 𝑒𝑒𝑛𝑛𝑑𝑑 ≔ 1 until 𝑒𝑒𝑛𝑛𝑑𝑑 = 1 To understand why this is so surprising, note that the above little algorithm indeed solves a feasibility test, but with one little caveat: in the case of infeasibility the algorithm will loop forever. Perceptron-Lemma and Convergence To prove that the perceptron-algorithm will always find a plane below M, if there is one, we set 𝜀𝜀 = min{⟨𝒈𝒈, 𝒎𝒎⟩|𝒎𝒎 ∈ 𝑴𝑴} . The value 𝜀𝜀 will be called the margin. It can be shown that the above simple algorithm will converge, if a separating plane with 𝜀𝜀 > 0 exists. Proof Let 𝐸𝐸 be the separating plane, and 𝒈𝒈 ∗ be the normal vector of 𝐸𝐸 . We assume: 𝜀𝜀 > 0 . Hence, ⟨𝒈𝒈 ∗ , 𝒎𝒎⟩ ≥ 0 for 𝒎𝒎 ∈ 𝑴𝑴 . Let 𝒈𝒈 𝑖𝑖 be the weight vector computed in the 𝑖𝑖 th step. Now, we will show that the angle between 𝒈𝒈 ∗ and 𝒈𝒈 𝑖𝑖 is reduced in each step. We can assume 𝒈𝒈 0 = (0, … ,0) T . Let 𝛼𝛼 be the angle between 𝒈𝒈 ∗ and 𝒈𝒈 𝑖𝑖+1 . Then, since 𝒈𝒈 ∗ has norm 1 , cos(𝛼𝛼) = ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖+1 ⟩ ‖𝒈𝒈 𝑖𝑖+1 ‖ . We thus have: ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖+1 ⟩ = ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖 + 𝒎𝒎 𝑖𝑖 ⟩ = ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖 ⟩ + ⟨𝒈𝒈 ∗ , 𝒎𝒎 𝑖𝑖 ⟩ ≥ ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖 ⟩ + 𝜀𝜀 Repeated application yields: ⟨𝒈𝒈 ∗ , 𝒈𝒈 𝑖𝑖+1 ⟩ ≥ ⟨𝒈𝒈 ∗ , 𝒈𝒈 0 ⟩ + (𝑖𝑖 + 1)𝜀𝜀 We look at the square of the denominator of the above expression for cos 𝛼𝛼 . It is ‖𝒈𝒈 𝑖𝑖+1 ‖ 2 = ⟨𝒈𝒈 𝑖𝑖 + 𝒎𝒎 𝑖𝑖 , 𝒈𝒈 𝑖𝑖 + 𝒎𝒎 𝑖𝑖 ⟩ = ‖𝒈𝒈 𝑖𝑖 ‖ 2 + 2⟨𝒈𝒈 𝑖𝑖 , 𝒎𝒎 𝑖𝑖 ⟩ + ‖𝒎𝒎 𝑖𝑖 ‖ 2 . <?page no="93"?> Support Vector Machines Made Easy 93 But 2⟨𝒈𝒈 𝑖𝑖 , 𝒎𝒎 𝑖𝑖 ⟩ ≤ 0 . Hence, ‖𝒈𝒈 𝑖𝑖+1 ‖ 2 ≤ ‖𝒈𝒈 𝑖𝑖 ‖ 2 + ‖𝒎𝒎 𝑖𝑖 ‖ 2 = ‖𝒈𝒈 𝑖𝑖 ‖ 2 + 1 . Repeating the last step yields: ‖𝒈𝒈 𝑖𝑖+1 ‖ 2 ≤ ‖𝒈𝒈 0 ‖ 2 + 𝑖𝑖 + 1 Inserting into the above expression for the cos : cos(𝛼𝛼) ≥ (⟨𝒈𝒈, 𝒈𝒈 0 ⟩ + (𝑖𝑖 + 1)𝜀𝜀) �‖𝒈𝒈 0 ‖ 2 + (𝑖𝑖 + 1) Since 𝒈𝒈 0 = (0, … ,0) T , the right-hand side becomes approximately equal to 𝜀𝜀√𝑖𝑖 for large values of 𝑖𝑖 . Thus, the right-hand side tends to infinity for large 𝑖𝑖 . But cos(𝛼𝛼) ≤ 1! This means: the assumption, that the procedure does not terminate although there is a separating plane with normal vector 𝒈𝒈 ∗ leads to a contradiction. End of proof. Perceptrons and Linear Feasibility Testing Thus, the perceptron-algorithm can solve the linear feasibility problem. But it cannot detect infeasibility. In the above version, the algorithm will not terminate, in case of infeasibility. To repair this, the bounds in the above proof can be used, to stop the algorithm in case of infeasibility. A question arising in this context is: can the above algorithm also solve the maximization involved in linear programming. The answer is yes! We choose the vector 𝑐𝑐 = (1,0, … ,0) T (possibly by rotating the entire configuration). Then we only have to find an extremal point in the 𝑥𝑥 1 -direction, by repeatedly applying feasibility checking. This is done by bisection over 𝑥𝑥 1 , where we check for feasibility of the (𝑛𝑛 − 1) dimensional polyhedron 𝑃𝑃 0 . <?page no="94"?> 94 Fundamentals of Machine Learning Figure 40 - The polyhedron 𝑃𝑃 0  Programming Exercise Write a MATLAB program that implements the Perceptron Algorithm. Input: Positive and negative sample points in 2D, which can be separated by a line. Output: A separating line. 9.2 Neural Networks Neural networks are an alternative to support vector machines. One can use them in the same way to classify unknown sample after a training phase. Training instances are pre-classified, i.e. each training instance is marked as a positive or a negative example for the feature to be learned. We will briefly review the basics, in order to be able to compare support vector approaches to neural networks. Neural networks generalize the perceptrons discussed above. A typical structure is shown in the next figure. There are three layers: an input layer, a hidden layer and an output layer. 𝑥𝑥 2 , … , 𝑥𝑥 𝑛𝑛 𝑥𝑥 1 𝐻𝐻 𝑃𝑃 0 𝑃𝑃 <?page no="95"?> Support Vector Machines Made Easy 95 Figure 41 - A neural network As for the case of perceptrons (see above), we can transform the threshold values 𝑆𝑆 into additional weights: Figure 42 - A neural network with bias inputs Forward Propagation The procedure for the calculation of an output for a given input is called forward propagation: ➊ The weights 𝑔𝑔 𝑖𝑖𝑗𝑗 are first multiplied with the input parameters. ➋ In the hidden layer, the sum of the values 𝑥𝑥 𝑖𝑖 ⋅ 𝑔𝑔 𝑖𝑖 is then compared to the corresponding threshold parameter 𝑆𝑆 . ➌ The result is then propagated into the output layer and is multiplied with the weights ℎ 𝑖𝑖𝑗𝑗 . ➍ A second threshold value comparison is performed in the output layer. ≥ 𝑆𝑆 1 ≥ 𝑆𝑆 2 ≥ 𝑆𝑆 𝑛𝑛 ≥ 𝑆𝑆 1′ ≥ 𝑆𝑆 2′ ≥ 𝑆𝑆 𝑛𝑛′ input hidden layer output ≥ 0 ≥ 0 ≥ 0 ≥ 0 ≥ 0 ≥ 0 1 1 𝑔𝑔 11 ℎ 11 <?page no="96"?> 96 Fundamentals of Machine Learning Training and Error Backpropagation Typically, we are given instances, or training samples. The goal is then to calculate the weights 𝑔𝑔 𝑖𝑖𝑗𝑗 and ℎ 𝑖𝑖𝑗𝑗 that best represent the properties to be learned from the training samples. The method to do this is called error back-propagation. Thus: In the beginning all weights are random values. We begin with the first input vector 𝒙𝒙 = (𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ) T of the training set. This vector is propagated through the network in forward direction (i.e. from left to right). This gives an output vector 𝒚𝒚 = (𝑦𝑦 1 , … , 𝑦𝑦 𝑚𝑚 ) T . Now, the desired output corresponding to the input 𝒙𝒙 is given with the training data set! We compare the actual output 𝒚𝒚 to the desired output 𝝍𝝍 = (𝜓𝜓 1 , … , 𝜓𝜓 𝑚𝑚 ) T . The squared difference between actual and desired output will be called the error 𝐹𝐹 = (|𝑦𝑦 1 − 𝜓𝜓 1 | 2 , … , |𝑦𝑦 𝑚𝑚 − 𝜓𝜓 𝑚𝑚 | 2 ) T . We must adapt weights 𝑔𝑔 𝑖𝑖𝑗𝑗 and ℎ 𝑖𝑖𝑗𝑗 in such a way, that the error 𝐹𝐹 is minimized. Thus, we consider the value 𝐹𝐹 1 = |𝑦𝑦 1 − 𝜓𝜓 1 | 2 at the first output cell 𝑎𝑎 1 . We must adapt the weight value ℎ 11 such that the error value |𝑦𝑦 1 − 𝜓𝜓 1 | 2 is reduced. As a short cut, we write ℎ instead of ℎ 11 . We consider the function 𝐹𝐹(ℎ, 𝒙𝒙) in the rectangular sector shown in the next figure. Figure 43 - The objective function for backpropagation The error function 𝐹𝐹(𝒙𝒙, ℎ) can be transformed into a differentiable function by substituting the threshold value function at node 𝑎𝑎 1 . I.e. the previous simple 0-1-threshold value computation is substituted by a differentiable function in the following way: We replace the step function by 𝑦𝑦 = 1 1+𝑒𝑒 −𝑥𝑥 , i.e. ≥ 0 ≥ 0 ≥ 0 ≥ 0 ≥ 0 ≥ 0 1 1 𝑔𝑔 11 ℎ 𝑎𝑎 1 ≥ 0 𝒙𝒙 ℎ 𝑎𝑎 1 𝑦𝑦 1 − 𝜓𝜓 1 2 <?page no="97"?> Support Vector Machines Made Easy 97 Figure 44 - Activation functions for neural networks. Left: step function | Right: 𝑦𝑦 = 1 1+ 𝑒𝑒 −𝑥𝑥 Of course, different choices for the activation function are also possible. We must now adapt ℎ . We determine the (one-dimensional) derivative 𝐹𝐹 ′ (𝒙𝒙, ℎ) of 𝐹𝐹(𝒙𝒙, ℎ) with respect to ℎ . 𝒙𝒙 is regarded as a constant here, thus we can write 𝐹𝐹 𝒙𝒙 (ℎ) instead of 𝐹𝐹(𝒙𝒙, ℎ) . To reduce the error, simply subtract the value 𝐹𝐹 𝒙𝒙 (ℎ) from ℎ , i.e. set ℎ 𝑛𝑛𝑒𝑒𝑛𝑛 = ℎ − 𝐹𝐹 𝒙𝒙 (ℎ) . The calculation of the derivative of 𝐹𝐹 𝒙𝒙 (ℎ) can be simplified somewhat: Set 𝑟𝑟 𝑥𝑥 (ℎ) ≔ ℎ ⋅ 𝑥𝑥 𝑠𝑠(𝑢𝑢) ≔ 1 1 + 𝑒𝑒 −𝑢𝑢 𝑡𝑡 𝜓𝜓 (𝑦𝑦) ≔ (𝑦𝑦 − 𝜓𝜓) 2 . Then, 𝐹𝐹 𝑥𝑥 (ℎ) = 𝑡𝑡 𝜓𝜓 �𝑠𝑠�𝑟𝑟 𝑥𝑥 (ℎ)�� . Note that 𝜓𝜓 and 𝒙𝒙 are constant! Then 𝑠𝑠 ′ (𝑢𝑢) = 𝑠𝑠(𝑢𝑢)�1 − 𝑠𝑠(𝑢𝑢)� . Therefore, using 𝑢𝑢 = 𝑟𝑟 𝑥𝑥 (ℎ) , 𝐹𝐹 𝑥𝑥′ (ℎ) = 𝑡𝑡 ′ �𝑠𝑠�𝑟𝑟 𝒙𝒙 (ℎ)�� 𝑠𝑠 ′ �𝑟𝑟 𝒙𝒙 (ℎ)�𝑟𝑟 𝒙𝒙′ (ℎ) = = 2�𝑠𝑠�𝑟𝑟 𝒙𝒙 (ℎ)� − 𝜓𝜓�𝑠𝑠 ′ �𝑟𝑟 𝒙𝒙 (ℎ)�𝒙𝒙 = = 2(𝑠𝑠(𝑢𝑢) − 𝜓𝜓) ⋅ 𝑠𝑠(𝑢𝑢) ⋅ �1 − 𝑠𝑠(𝑢𝑢)� ⋅ 𝑥𝑥 <?page no="98"?> 98 Fundamentals of Machine Learning The extension to the overall network is straightforward: ▶ Extend the error function F to the entire network ▶ Take the (partial) derivative of 𝐹𝐹 𝒙𝒙 (𝑔𝑔 11 , … , 𝑔𝑔 𝑘𝑘𝑙𝑙 , ℎ 11 , … , ℎ 𝑚𝑚𝑛𝑛 ) with respect to each individual 𝑔𝑔 𝑖𝑖𝑗𝑗 (resp. ℎ 𝑖𝑖𝑗𝑗 ) ▶ Change the weight vector into direction of the negative gradient, i.e. ‘in direction of' 𝐺𝐺 new = 𝐺𝐺 − ∇𝐹𝐹 𝑥𝑥 where 𝐺𝐺 = (𝑔𝑔 11 , … , 𝑔𝑔 𝑘𝑘𝑙𝑙 , ℎ 11 , … , ℎ 𝑚𝑚𝑛𝑛 ) T . Notice the similarity to the case of the perceptron algorithm. The perceptron algorithm works as follows: check all samples, if you find one that is classified in the wrong way, add this sample to the normal vector of the current hyperplane. In the case of neural networks, we simply add the gradient vector of the error function to the current weight vector. 9.3 Genetic Algorithms Classification of unknown samples and learning of features can also be done with genetic algorithms. Learning is again regarded as an optimization process. The difference to other methods is that gradient information about the optimization function is not used in the process. We consider binary strings (i.e. strings containing only the letters ‘0’ and ‘1’). And we are given an evaluation function 𝑓𝑓 , which operates on such strings. Thus, for each such binary input string, we can evaluate its function value under 𝑓𝑓 . 𝑃𝑃(𝑖𝑖) is the set of current (binary) strings in iteration 𝑖𝑖 (also called generation 𝑖𝑖 ). The method is the following: Set 𝑖𝑖 = 1 Initialise 𝑃𝑃(1) as a set of random strings of fixed length 𝑛𝑛 . Repeat 𝐺𝐺 ≔ 𝑘𝑘 best elements (under 𝑓𝑓 ) in 𝑃𝑃(𝑖𝑖) . 𝐺𝐺 ≔ 𝐺𝐺 + set of 𝑚𝑚 elements chosen from 𝑃𝑃(𝑖𝑖) at random. 𝑖𝑖 ≔ 𝑖𝑖 + 1. Generate new 𝑃𝑃(𝑖𝑖) by ‘recombining’ elements from 𝐺𝐺 . Recombination Each element of 𝑃𝑃(𝑖𝑖) is represented as a binary string. Generate one new string from two old ones by cutting both of the old ones in the middle and joining two of the resulting pieces crosswise. Termination condition <?page no="99"?> Support Vector Machines Made Easy 99 The termination condition is: an element of 𝑃𝑃(𝑖𝑖) reaches a given threshold evaluation under 𝑓𝑓 . Mutation Flip some values in the strings at random. Conclusions The gradient information of 𝑓𝑓 is not used in this method. Instead it relies on randomization. Randomization can be useful in appropriate applications. However, genetic algorithms will not always be adequate. 9.4 Conclusion Support vector machines require more work to implement and understand than neural networks. But: support vector machines are based on quadratic programming, which is globally convergent, and finds a global optimum, while gradient search in neural networks only has local convergence. <?page no="101"?> 10 Bayesian Regression This chapter presents an introduction to probabilistic machine learning. The introduction is based on [8] and a more detailed description can be found in [9] and [10]. In general, probabilistic approaches are based on Bayes’ theorem named after Thomas Bayes (1701-1761), who first showed how to use new evidence (observations) to update a prior belief. Bayes’ rule and relevant terms are presented and illustrated on examples in section 10.1. In section 10.2, Bayes’ rule is used to solve a probabilistic linear regression task and comparisons are made to least mean square solutions and rigid linear regression. Sections 10.3 and 10.4 present one of the most frequently used probabilistic methods, Gaussian processes (GP), and its extension to multivariate signals (Multi-task Gaussian processes (MTGP). Even though these methods can be used for classification and regression tasks, we focus here on regression and prediction problems. 10.1 Bayesian Learning We illustrate basic probability terms on a simple example. We assume that a solid, dashed, and dotted box are given. Ref to figure 45. The set of boxes is specified by 𝑌𝑌 = {𝑠𝑠𝑠𝑠𝑙𝑙𝑖𝑖𝑑𝑑, 𝑑𝑑𝑎𝑎𝑠𝑠ℎ𝑒𝑒𝑑𝑑, 𝑑𝑑𝑠𝑠𝑡𝑡𝑡𝑡𝑒𝑒𝑑𝑑} with 𝑦𝑦 𝑗𝑗 𝜖𝜖 𝑌𝑌 and 𝑗𝑗 𝜖𝜖 {1, 2, 3} . All three boxes contain two different kinds of objects (cubes and balls) which define the set 𝑋𝑋 = {𝑐𝑐𝑢𝑢𝑏𝑏𝑒𝑒, 𝑏𝑏𝑎𝑎𝑙𝑙𝑙𝑙} with 𝑥𝑥 𝑖𝑖 𝜖𝜖 𝑋𝑋 and 𝑖𝑖 𝜖𝜖 {1, 2} . The setting is illustrated in the following figure. Figure 45 - Illustration of discrete probability example; a solid, dashed or dotted box contain a different number of cubes and balls. The probability 𝑝𝑝(𝑥𝑥) can be interpreted as the likeliness of a particular event 𝑥𝑥 with 𝑝𝑝 𝜖𝜖 [0, 1] . In this discrete example, the probability 𝑝𝑝 can be computed as a simple fraction. For example, the probability 𝑝𝑝 of taking out one specific object 𝑥𝑥 𝑖𝑖 from a specific box 𝑦𝑦 𝑗𝑗 is defined as 𝑝𝑝�𝑋𝑋 = 𝑥𝑥 𝑖𝑖 , 𝑌𝑌 = 𝑦𝑦 𝑗𝑗 � = 𝑝𝑝�𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 � = 𝑛𝑛 𝑖𝑖𝑗𝑗 𝑁𝑁 with 𝑛𝑛 𝑖𝑖𝑗𝑗 being the number of objects 𝑥𝑥 𝑖𝑖 in box 𝑦𝑦 𝑗𝑗 and 𝑁𝑁 the total number of objects in all boxes. Note, we assume that the objects are chosen blindly with replacement. <?page no="102"?> 102 Fundamentals of Machine Learning The probability of 𝑝𝑝�𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 � is known as the joint probability. The six possible joint probabilities of the example are highlighted in Figure 46. Figure 46 - Joint and marginal probabilities 𝑝𝑝�𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 � (center), 𝑝𝑝(𝑥𝑥 𝑖𝑖 ) (right), and 𝑝𝑝�𝑦𝑦 𝑗𝑗 � (bottom) In the general case of a continuous variable 𝑥𝑥 , the probability 𝑝𝑝(𝑥𝑥) with 𝑥𝑥 𝜖𝜖 [𝑎𝑎, 𝑏𝑏] is defined as 𝑝𝑝(𝑎𝑎 ≤ 𝑥𝑥 ≤ 𝑏𝑏) = � 𝑝𝑝(𝑥𝑥) 𝑏𝑏 𝑎𝑎 𝑑𝑑𝑥𝑥. By definition, the probability function 𝑝𝑝(𝑥𝑥) must fulfill two conditions: 𝑝𝑝(𝑥𝑥) ≥ 0, � 𝑝𝑝(𝑥𝑥) +∞ −∞ 𝑑𝑑𝑥𝑥 = 1. The probability of taking out one specific object 𝑥𝑥 𝑖𝑖 irrespectively of the boxes is known as the marginal probability 𝑝𝑝(𝑥𝑥 𝑖𝑖 ) and can be computed by applying the “sum rule of probabilities” 𝑝𝑝(𝑥𝑥 𝑖𝑖 ) = � 𝑝𝑝(𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 ) 3 𝑗𝑗=1 or, for continuous variables, as 𝑝𝑝(𝑥𝑥) = � 𝑝𝑝(𝑥𝑥, 𝑦𝑦) 𝑑𝑑𝑦𝑦. This means the influence of, i.e., variable 𝑦𝑦 is integrated out. The marginal probability of 𝑝𝑝(𝑥𝑥 𝑖𝑖 ) and 𝑝𝑝(𝑦𝑦 𝑗𝑗 ) is shown in Figure 46 bottom and right, respectively. If the box is known before ( 𝑌𝑌 = 𝑦𝑦 𝑗𝑗 ), the conditional probability 𝑝𝑝�𝑥𝑥 𝑖𝑖 �𝑦𝑦 𝑗𝑗 � can be computed as 𝑝𝑝�𝑥𝑥 𝑖𝑖 | 𝑦𝑦 𝑗𝑗 � = 𝑛𝑛 𝑖𝑖𝑗𝑗 𝑛𝑛 𝑗𝑗 with 𝑛𝑛 𝑗𝑗 being the number of objects in box 𝑦𝑦 𝑗𝑗 . The conditional probability can be derived by applying the “product rule of probabilities” 𝑝𝑝�𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 � = 𝑝𝑝�𝑦𝑦 𝑗𝑗 | 𝑥𝑥 𝑖𝑖 �𝑝𝑝(𝑥𝑥 𝑖𝑖 ). Figure 47 shows the conditional probability 𝑝𝑝�𝑦𝑦 𝑗𝑗 | 𝑥𝑥 𝑖𝑖 �. 𝑌𝑌 𝑋𝑋 solid dashed dotted cube ball 1 12 ⁄ 1 4 ⁄ 1 12 ⁄ 5 12 ⁄ 1 12 ⁄ 1 6 ⁄ 5 3 ⁄ 1 3 ⁄ 1 4 ⁄ 1 3 ⁄ 7 12 ⁄ <?page no="103"?> Support Vector Machines Made Easy 103 Figure 47 - Conditional probability 𝑝𝑝�𝑦𝑦 𝑗𝑗 �𝑥𝑥 𝑖𝑖 � The product rule allows the derivation of the Bayes’ theorem which is defined for discrete variables as 𝑝𝑝�𝑥𝑥 𝑖𝑖 | 𝑦𝑦 𝑗𝑗 � = 𝑝𝑝�𝑦𝑦 𝑗𝑗 | 𝑥𝑥 𝑖𝑖 �𝑝𝑝(𝑥𝑥 𝑖𝑖 ) 𝑝𝑝�𝑦𝑦 𝑗𝑗 � . (30) The general definition for continuous variables is 𝑝𝑝(𝑥𝑥| 𝑦𝑦) = 𝑝𝑝(𝑦𝑦| 𝑥𝑥)𝑝𝑝(𝑥𝑥) 𝑝𝑝(𝑦𝑦) . The probability 𝑝𝑝(𝑥𝑥) is referred to as prior probability as it does specify the initial belief about the parameters without observing any data. If some training data are observed, the Bayes’ rule can be used to update our belief by computing 𝑝𝑝(𝑥𝑥| 𝑦𝑦) . In this context, the conditional probability 𝑝𝑝(𝑥𝑥| 𝑦𝑦) is referred to as posterior probability. The term 𝑝𝑝(𝑦𝑦| 𝑥𝑥) is referred to as the likelihood and specifies the probability of observing 𝑦𝑦 given 𝑥𝑥 . Note 𝑝𝑝(𝑦𝑦| 𝑥𝑥) is not a probability distribution over 𝑥𝑥 as the integral over 𝑥𝑥 does not need to be equal to one. This can be observed in Figure 47, where ∑ 𝑝𝑝(𝑦𝑦 = 𝑠𝑠𝑠𝑠𝑙𝑙𝑖𝑖𝑑𝑑| 𝑥𝑥 𝑖𝑖 ) = 2𝑖𝑖=1 0.77 . As a consequence, a normalization constant is required to ensure that the posterior distribution is a valid probability distribution. The normalization is realized by the marginal likelihood 𝑝𝑝(𝑦𝑦) . 10.2 Probabilistic Linear Regression We want to extend the previous framework to use it within a linear regression model. We assume that 𝑚𝑚 = 10 training points are given (see Figure 48 ). A linear model should be trained according to 𝑓𝑓(𝑥𝑥) = 𝑤𝑤 1 + 𝑤𝑤 2 𝑡𝑡 = 𝒙𝒙 T 𝒘𝒘, 𝒘𝒘 = [𝑤𝑤 1 𝑤𝑤 2 ] T , 𝒙𝒙 = [1 𝑡𝑡] T . A trivial solution is the least mean square (LMS) solution whose result is shown as a solid line in Figure 48. The drawback of this approach is that, e.g., measurement noise is not considered. One solution to overcome this problem was presented in chapter 8 by introducing slack variables within the support vector algorithm. Using Bayes’ rule, an alternative formulation is possible by considering the measurement noise as a Gaussian distribution, which is often the case in real world applications. The model can be extended to 𝑌𝑌 𝑋𝑋 solid dashed dotted cube ball 1 5 ⁄ 3 5 ⁄ 1 5 ⁄ 1 7 ⁄ 2 7 ⁄ 4 7 ⁄ <?page no="104"?> 104 Fundamentals of Machine Learning 𝑦𝑦 = 𝑓𝑓(𝑥𝑥) + 𝜀𝜀 = 𝒙𝒙 T 𝒘𝒘 + 𝜀𝜀 (31) with 𝜀𝜀 ~𝑁𝑁(0, 𝜎𝜎 𝑛𝑛2 ) . The measurement variance is set to 𝜎𝜎 𝑛𝑛2 = 1 in this example. Figure 48 - Training data and least mean square solution According to Bayes’ rule (see Eq. (30) ), the learning problem can be formulated as follows 𝑝𝑝(𝒘𝒘|𝒚𝒚, 𝑿𝑿) = 𝑝𝑝(𝒚𝒚|𝑿𝑿, 𝒘𝒘)𝑝𝑝(𝒘𝒘) 𝑝𝑝(𝒚𝒚|𝑿𝑿) , 𝑝𝑝𝑠𝑠𝑠𝑠𝑡𝑡𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑟𝑟 = 𝑙𝑙𝑖𝑖𝑘𝑘𝑒𝑒𝑙𝑙𝑖𝑖ℎ𝑠𝑠𝑠𝑠𝑑𝑑 × 𝑝𝑝𝑟𝑟𝑖𝑖𝑠𝑠𝑟𝑟 𝑚𝑚𝑎𝑎𝑟𝑟𝑔𝑔𝑖𝑖𝑛𝑛𝑎𝑎𝑙𝑙 with 𝒚𝒚 = [𝑦𝑦 1 … 𝑦𝑦 𝑚𝑚 ] T being a vector with the training data and 𝑿𝑿 = [𝒙𝒙 1 … 𝒙𝒙 𝑚𝑚 ] . The aim is to compute the posterior distribution over the weights 𝒘𝒘 given the training data. In a first step, a prior distribution over the weights must be specified. This distribution represents our initial belief on the weights. According to Occam's razor, one frequently used prior distribution is 𝒘𝒘~𝑁𝑁�𝟎𝟎, 𝚺𝚺 p � which defines a multivariate Gaussian distribution with a mean vector of zero and a covariance matrix 𝚺𝚺 p defined as 𝚺𝚺 p = �𝜎𝜎 12 0 0 𝜎𝜎 22 �. The prior distribution is shown as a contour plot in Figure 49 for 𝜎𝜎 12 = 𝜎𝜎 22 = 1. <?page no="105"?> Support Vector Machines Made Easy 105 Figure 49 - Prior distribution of the weights with a mean of zero for w 1 and w 2 and a variance 𝜎𝜎 12 = 𝜎𝜎 22 = 1 . After observing some (independently sampled) training data, the likelihood can be computed using the specified model and noise assumption (Eq. (31) ) as 𝑝𝑝(𝒚𝒚|𝑿𝑿, 𝒘𝒘) = �𝑝𝑝(𝑦𝑦 𝑖𝑖 |𝒙𝒙 𝑖𝑖 , 𝒘𝒘) = � 1 √2𝜋𝜋𝜎𝜎 𝑛𝑛 𝑒𝑒𝑥𝑥𝑝𝑝 �− (𝑦𝑦 𝑖𝑖 − 𝒙𝒙 𝑖𝑖T 𝒘𝒘) 2 2𝜎𝜎 𝑛𝑛2 � 𝑚𝑚 𝑖𝑖=1 𝑚𝑚 𝑖𝑖=1 , 𝑝𝑝(𝒚𝒚|𝑿𝑿, 𝒘𝒘) = 𝑁𝑁(𝑿𝑿 T 𝒘𝒘, 𝜎𝜎 𝑛𝑛2 𝑰𝑰), with 𝑰𝑰 ∈ ℝ 𝑛𝑛×𝑛𝑛 being the identity matrix. The likelihood distribution using the training data of Figure 48 is shown as a contour plot in F igure 50. The maximum of the weights 𝒘𝒘 of the mean vector is equivalent to the LMS solution. It can be observed that the slope parameter 𝑤𝑤 2 is better determined than the offset parameter 𝑤𝑤 1 . Figure 50 - Likelihood of the weights after observing the training data using a linear model with Gaussian noise 𝜀𝜀 = 𝑁𝑁(0,1) . <?page no="106"?> 106 Fundamentals of Machine Learning The posterior distribution is a combination of the likelihood and the prior distribution and captures everything we know about the parameters 𝒘𝒘 . The marginal distribution 𝑝𝑝(𝒚𝒚|𝑿𝑿) is independent of 𝒘𝒘 . At this point, this distribution can be ignored as it is only a normalization constant. The resulting posterior is defined as 𝑝𝑝(𝒘𝒘|𝑿𝑿, 𝒚𝒚) ∝ 𝑒𝑒𝑥𝑥𝑝𝑝 �− (𝒚𝒚 − 𝑿𝑿 T 𝒘𝒘) T (𝒚𝒚 − 𝑿𝑿 T 𝒘𝒘) 2𝜎𝜎 𝑛𝑛2 � 𝑒𝑒𝑥𝑥𝑝𝑝 �− 12 𝒘𝒘 𝑇𝑇 𝜮𝜮 𝑝𝑝−1 𝒘𝒘�, 𝑝𝑝(𝒘𝒘|𝑿𝑿, 𝒚𝒚)~𝑁𝑁 � 1 𝜎𝜎 𝑛𝑛2 𝑨𝑨 −1 𝑿𝑿𝒚𝒚, 𝑨𝑨 −1 �, with 𝑨𝑨 = 𝜎𝜎 𝑛𝑛−2 𝑿𝑿𝑿𝑿 T + 𝚺𝚺 𝑝𝑝 . The posterior distribution is shown in Figure 52. It can be observed that the posterior distribution is a mixture of the likelihood and the prior. Figure 51 - Posterior distribution The maximum of the posterior distribution is known as the maximum a posteriori (MAP) estimate of 𝒘𝒘 . In this example, the prior distribution can be interpreted as a penalty term, as it shifts the MAP solution towards zero depending on the specified weight variance 𝜎𝜎 12 and 𝜎𝜎 22 . Note, in a non-Bayesian setting, the solution is equivalent to a ridge regression model, where the cost function is optimized with a quadratic penalty term on the weights. Figure 52 shows the results using the weights 𝒘𝒘 estimated by MAP. The estimate function has a smaller slope compared to the least mean squares solution. Note, the MAP solution represents only the most likely solution of the weights 𝒘𝒘 . As the weights are normal distributed, alternative solutions are also possible, which have a lower probability compared to the MAP solution. Three alternative solutions are shown in Figure 53. <?page no="107"?> Support Vector Machines Made Easy 107 Figure 52 - Prediction result of the maximum a posterior (MAP) solution of a linear probabilistic regression model. Figure 53 - Example of alternative solutions (AS) which have a lower probability compared to the MAP solution. By averaging over all possible parameters 𝒘𝒘 , weighted by their posterior probability, it is possible to compute a probability distribution for each function value 𝑓𝑓(𝐱𝐱 ∗ ) depending on the test case 𝐱𝐱 ∗ = [1 𝑡𝑡 ∗ ] T . 𝑝𝑝(𝑓𝑓(𝐱𝐱 ∗ )|𝐱𝐱 ∗ , 𝑿𝑿, 𝒚𝒚) = � 𝑝𝑝(𝑓𝑓(𝐱𝐱 ∗ )|𝐱𝐱 ∗ , 𝒘𝒘) 𝑝𝑝(𝒘𝒘|𝑿𝑿, 𝒚𝒚)𝑑𝑑𝒘𝒘, 𝑝𝑝(𝑓𝑓(𝐱𝐱 ∗ )|𝐱𝐱 ∗ , 𝑿𝑿, 𝒚𝒚) = 𝑁𝑁 � 1 𝜎𝜎 𝑛𝑛2 𝐱𝐱 ∗T 𝑨𝑨 −1 𝑿𝑿𝒚𝒚, 𝐱𝐱 ∗T 𝑨𝑨 −1 𝐱𝐱 ∗ � (32) Figure 54 shows the results of the mean function values 𝑓𝑓(𝐱𝐱 ∗ ) for the test case 𝐱𝐱 ∗ ∈ (0,10) . The solution is equivalent to the MAP solution of Figure 52 . However, now it is possible to also plot the variance for each function value 𝑓𝑓(𝐱𝐱 ∗ ) . In Figure 54, the dashed lines represent the mean prediction plus / minus two times the standard <?page no="108"?> 108 Fundamentals of Machine Learning deviation for a noise-free model (without considering 𝜎𝜎 𝑛𝑛2 ). The dash-dotted lines represent the mean prediction plus/ minus two times the standard deviation for a model including measurement noise. Figure 54 - Predicted mean solution and plus/ minus two standard deviations of the noise-free (dashed) and noise (dashed-dotted) model. 10.3 Gaussian Process Models The above presented linear probabilistic model represents a simple linear Gaussian process model (GP). Similar to chapter 6, it is possible to extend the GP model from a parametric model (a model which is based on an explicitly specified function, e.g., 𝑓𝑓(𝒙𝒙) = 𝒙𝒙 T 𝒘𝒘 ) to a non-parametric model (a model where the function values are only specified by the training data) through a mapping function 𝜑𝜑: 𝐷𝐷 → 𝑈𝑈 of the input data from the input space D into a higher dimensional feature space U. 𝑓𝑓(𝒙𝒙) = 𝜑𝜑(𝒙𝒙) T 𝒘𝒘 Equivalent to chapter 6, a kernel function 𝑘𝑘(𝒙𝒙, 𝒙𝒙’) can be specified according to 𝑘𝑘(𝐱𝐱, 𝐱𝐱′) = 𝜑𝜑(𝐱𝐱) T 𝜑𝜑(𝐱𝐱′) with 𝑘𝑘(𝐱𝐱, 𝐱𝐱′) ∈ ℝ . In the context of Bayesian modelling, these functions are referred to as covariance functions. It can be shown that the input 𝐱𝐱 and test data 𝐱𝐱 ∗ of 𝑝𝑝(𝑓𝑓(𝐱𝐱 ∗ )|𝐱𝐱 ∗ , 𝑿𝑿, 𝒚𝒚) (Eq. (32) ) can be completely expressed as scalar products. Consequently, if a mapping function 𝜑𝜑 is used, 𝜑𝜑(𝐱𝐱) and 𝜑𝜑(𝐱𝐱 ∗ ) do not have to be computed explicitly. Instead, their scalar products can be replaced by a kernel function. More details can be found in [10]. Here, an alternative and more intuitive introduction to non-parametric GP models shall be presented, which leads to the exact same result by considering inference directly in function space. <?page no="109"?> Support Vector Machines Made Easy 109 Definition of GP “A Gaussian process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.” [10] A GP model assumes that the training and test observations are drawn from a joint Gaussian distribution. As a consequence, each GP model is completely defined by its mean vector 𝝁𝝁 and covariance matrix 𝚺𝚺 . Predictions can be made by exploiting the marginal and conditional properties of Gaussian distributions. Let us assume a 2D Gaussian distribution is given by 𝑝𝑝 ��𝑦𝑦 1 𝑦𝑦 2 �� = 𝑁𝑁 ��𝜇𝜇 1 𝜇𝜇 2 � , �𝜎𝜎 11 𝜎𝜎 12 𝜎𝜎 21 𝜎𝜎 22 ��. The marginal probability of 𝑝𝑝(𝑦𝑦 1 ) and 𝑝𝑝(𝑦𝑦 2 ) is defined as: 𝑝𝑝(𝑦𝑦 1 ) = 𝑁𝑁(𝜇𝜇 1 , 𝜎𝜎 11 ), 𝑝𝑝(𝑦𝑦 2 ) = 𝑁𝑁(𝜇𝜇 2 , 𝜎𝜎 22 ). In the case that, e.g., variable 𝑦𝑦 2 has already been observed, the conditional probability 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) can be computed by 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) = 𝑁𝑁(𝜇𝜇 1 − 𝜎𝜎 12 𝜎𝜎 22−1 (𝑦𝑦 2 − 𝜇𝜇 2 ), 𝜎𝜎 11 − 𝜎𝜎 12 𝜎𝜎 22−1 𝜎𝜎 21 ). (33 Figure 55 shows as an example three two-dimensional Gaussian distributions. The distributions are specified by a) 𝑁𝑁 ��00� , �1 0 0 1��, b) 𝑁𝑁 ��00� , � 1 0.5 0.5 1 ��, c) 𝑁𝑁 ��−1 0 � , � 1 −0.8 −0.8 1 �� Figure 55 - Examples of three two-dimensional joint Gaussian distributions with the marginal distribution 𝑝𝑝(𝑥𝑥 1 ) and conditional distribution 𝑝𝑝(𝑥𝑥 1 |𝑥𝑥 2 ) . It can be observed that the off-diagonal elements 𝜎𝜎 12 and 𝜎𝜎 21 describe the correlation between 𝑦𝑦 1 and 𝑦𝑦 2 . Let us assume that we want to predict 𝑦𝑦 1 after observing 𝑦𝑦 2 . The solid line at the bottom of each figure represents the marginal distribution 𝑝𝑝(𝑦𝑦 1 ) of <?page no="110"?> 110 Fundamentals of Machine Learning the three joint distributions. These probabilities represent our prior belief of 𝑦𝑦 1 without observing 𝑦𝑦 2 a) 𝑝𝑝(𝑦𝑦 1 ) = 𝑁𝑁(0,1), b) 𝑝𝑝(𝑦𝑦 1 ) = 𝑁𝑁(0,0), c) 𝑝𝑝(𝑦𝑦 1 ) = 𝑁𝑁(−1,1) . Assume, we observe, e.g., 𝑦𝑦 2 = −1.5 , we can update our belief by computing 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) a) 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) = 𝑁𝑁(0,1), b) 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) = 𝑁𝑁(−0.75,0.75), c) 𝑝𝑝(𝑦𝑦 1 |𝑦𝑦 2 ) = 𝑁𝑁(0.2,0.36) . In case of GP models, this idea is further extended by assuming that the mean vector 𝝁𝝁 and the covariance matrix 𝚺𝚺 can be specified by a mean and covariance function 𝑚𝑚(𝒙𝒙) and 𝑘𝑘(𝒙𝒙, 𝒙𝒙’) . If we assume a noise-free model with 𝑦𝑦 = 𝑓𝑓(𝒙𝒙) , a Gaussian process can be defined as 𝑓𝑓(𝒙𝒙)~𝐺𝐺𝑃𝑃�𝑚𝑚(𝒙𝒙), 𝑘𝑘(𝒙𝒙, 𝒙𝒙 ′ )�. Note, equivalently to the previous section, a Gaussian process defines a probability distribution over functions, meaning that the resulting distribution takes all functions weighted by their probability into account (compare MAP solution to 𝑝𝑝(𝑓𝑓(𝐱𝐱 ∗ )|𝐱𝐱 ∗ , 𝑿𝑿, 𝒚𝒚) ). Often, the mean function is set to zero, as every mean function on can be expressed within the covariance functions. The covariance function describes the coupling between 𝒙𝒙 and 𝒙𝒙′ . A frequently used covariance function is the Gaussian kernel function as introduced in chapter 6, 𝑘𝑘(𝑥𝑥, 𝑥𝑥′) = 𝜃𝜃 𝑆𝑆2 𝑒𝑒 −�𝑥𝑥−𝑥𝑥 ′ � 2 2𝜃𝜃 𝐿𝐿2 , where 𝜃𝜃 𝑆𝑆 , 𝜃𝜃 𝐿𝐿 ∈ ℝ are hyperparameters which model the y-scaling and x-scaling, respectively. Considering a general learning problem with 𝒙𝒙 = [𝑥𝑥 1 , … , 𝑥𝑥 𝑛𝑛 ] T being the input training data, 𝒚𝒚 = [𝑦𝑦 1 , … , 𝑦𝑦 𝑛𝑛 ] T being the training observations, and 𝒙𝒙 ∗ a vector of input test data, the following GP model can be constructed: � 𝑓𝑓(𝒙𝒙) = 𝒚𝒚 𝑓𝑓(𝒙𝒙 ∗ ) = 𝒚𝒚 ∗ � ~𝑁𝑁 �𝟎𝟎, � 𝑲𝑲(𝒙𝒙, 𝒙𝒙) 𝑲𝑲(𝒙𝒙, 𝒙𝒙 ∗ ) 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙) 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙 ∗ )��. Here, 𝑲𝑲(∙,∙) refers to a covariance matrix, which is in general defined as: 𝑲𝑲(𝒗𝒗, 𝒘𝒘) = �𝑘𝑘(𝒗𝒗[1], 𝒘𝒘[1]) ⋯ 𝑘𝑘(𝒗𝒗[1], 𝒘𝒘[𝑟𝑟]) ⋮ ⋱ ⋮ 𝑘𝑘(𝒗𝒗[𝑞𝑞], 𝒘𝒘[1]) ⋯ 𝑘𝑘(𝒗𝒗[𝑞𝑞], 𝒘𝒘[𝑟𝑟])� , with 𝒗𝒗 ∈ ℝ 𝑞𝑞×1 and 𝒘𝒘 ∈ ℝ 𝑞𝑞×1 . <?page no="111"?> Support Vector Machines Made Easy 111 Using Eq. (30) , a Gaussian distribution can be computed for 𝑝𝑝(𝒚𝒚 ∗ |𝒚𝒚, 𝒙𝒙, 𝒙𝒙 ∗ ) according to: 𝑝𝑝(𝒚𝒚 ∗ |𝒚𝒚, 𝒙𝒙, 𝒙𝒙 ∗ ) = 𝑁𝑁(𝒚𝒚� ∗ , var[𝒚𝒚 ∗ ]). 𝒚𝒚� ∗ = 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙)𝑲𝑲(𝒙𝒙, 𝒙𝒙) −1 𝒚𝒚, var[𝒚𝒚 ∗ ] = 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙 ∗ ) − 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙)𝑲𝑲(𝒙𝒙, 𝒙𝒙) −1 𝑲𝑲(𝒙𝒙, 𝒙𝒙 ∗ ) Figure 56 shows the mean prediction and plus / minus two times the standard deviation of a GP model with a Gaussian covariance function applied to the data of the previous example. Figure 56 - Prediction result of a GP model using a Gaussian covariance function ( 𝜃𝜃 𝑆𝑆2 = 𝜃𝜃 𝐿𝐿2 = 2 ). 10.4 GP model with measurement noise So far, the model does not consider measurement noise. If a regression model such as 𝑦𝑦 = 𝑓𝑓(𝑥𝑥) + 𝜀𝜀, with 𝜀𝜀~𝑁𝑁(0, 𝜎𝜎 𝑛𝑛2 ) is assumed, the GP model can be extended to � 𝑓𝑓(𝒙𝒙) 𝑓𝑓(𝒙𝒙 ∗ )� ~𝑁𝑁 �𝟎𝟎, �𝑲𝑲(𝒙𝒙, 𝒙𝒙) + 𝜎𝜎 𝑛𝑛2 𝑰𝑰 𝑲𝑲(𝒙𝒙, 𝒙𝒙 ∗ ) 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙) 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙 ∗ )��. The mean and variance of 𝑝𝑝(𝒚𝒚 ∗ |𝒚𝒚, 𝒙𝒙, 𝒙𝒙 ∗ ) is specified by 𝒚𝒚� ∗ = 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙)[𝑲𝑲(𝒙𝒙, 𝒙𝒙) + 𝜎𝜎 𝑛𝑛2 𝑰𝑰] −1 𝒚𝒚, (34) 𝑣𝑣𝑎𝑎𝑟𝑟[𝒚𝒚 ∗ ] = 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙 ∗ ) − 𝑲𝑲(𝒙𝒙 ∗ , 𝒙𝒙)[𝑲𝑲(𝒙𝒙, 𝒙𝒙) + 𝜎𝜎 𝑛𝑛2 𝑰𝑰] −1 𝑲𝑲(𝒙𝒙, 𝒙𝒙 ∗ ). Figure 57 shows the mean prediction and plus / minus two times the standard deviation of a GP model considering measurement noise. <?page no="112"?> 112 Fundamentals of Machine Learning Figure 57 - Prediction result of a GP model using a Gaussian covariance function considering measurement noise ( 𝜎𝜎 𝑛𝑛2 = 1 ). Optimization of hyperparameters The prediction results depend on the selected covariance function 𝑘𝑘(∙,∙) and its hyperparameters 𝜽𝜽 . Beside standard optimization techniques (e.g., separation of the data into a training, cross validation, and test set), the Bayesian framework offers an alternative possibility to optimize the hyperparameters, by optimization of the marginal likelihood. The marginal distribution 𝑝𝑝(𝒚𝒚|𝑿𝑿) is specified by Bayes’ rule 𝑝𝑝(𝒇𝒇(𝒙𝒙 ∗ )|𝒚𝒚) = 𝑝𝑝(𝒚𝒚|𝒇𝒇(𝒙𝒙 ∗ ))𝑝𝑝(𝒇𝒇(𝒙𝒙 ∗ )) 𝑝𝑝(𝒚𝒚) , 𝑝𝑝𝑠𝑠𝑠𝑠𝑡𝑡𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑟𝑟 = 𝑙𝑙𝑖𝑖𝑘𝑘𝑒𝑒𝑙𝑙𝑖𝑖ℎ𝑠𝑠𝑠𝑠𝑑𝑑 × 𝑝𝑝𝑟𝑟𝑖𝑖𝑠𝑠𝑟𝑟 𝑚𝑚𝑎𝑎𝑟𝑟𝑔𝑔𝑖𝑖𝑛𝑛𝑎𝑎𝑙𝑙 , and represents the probability of observing 𝒚𝒚 given the input data 𝑿𝑿 . It can be computed by 𝑝𝑝(𝒚𝒚) = 𝑝𝑝(𝒚𝒚|𝑿𝑿, 𝜽𝜽) = 𝑁𝑁(𝒚𝒚|𝟎𝟎, 𝑲𝑲 𝒕𝒕 ) = 1 (2𝜋𝜋) 𝑛𝑛/ 2 |𝐾𝐾| exp �− 𝒚𝒚 T 𝑲𝑲 𝒕𝒕−𝟏𝟏 𝒚𝒚 2 � . Here, 𝑲𝑲 𝒕𝒕 represents either the training covariance matrix 𝑲𝑲 𝒕𝒕 = 𝑲𝑲(𝒙𝒙, 𝒙𝒙) in case of a noise-free model or 𝑲𝑲 𝒕𝒕 = 𝑲𝑲(𝒙𝒙, 𝒙𝒙) + 𝜎𝜎 𝑛𝑛2 𝑰𝑰 for a model with measurement noise. In general, the negative logarithmic marginal likelihood (NLML) is minimized instead of maximizing the marginal likelihood. 𝑁𝑁𝐿𝐿𝑀𝑀𝐿𝐿 = − log�𝑁𝑁(𝒚𝒚|𝟎𝟎, 𝑲𝑲 𝒕𝒕 )� = 12 log|𝑲𝑲(𝒙𝒙, 𝒙𝒙)| + 𝒚𝒚 T 𝑲𝑲 𝒕𝒕−𝟏𝟏 𝒚𝒚 2 + 𝑛𝑛2 log 2𝜋𝜋 Compared to a non-Bayesian setting, the NLML can be interpreted as a cost function which has to be minimized. The hyperparameters 𝜽𝜽 are hidden within the covariance matrix 𝑲𝑲 𝒕𝒕 and can be optimized using, e.g. gradient descent methods. The advantage of this “cost function” is that a “role” can be assigned to each term of the marginal likelihood: the model-complexity term, the data-fit term, and a <?page no="113"?> Support Vector Machines Made Easy 113 normalization constant. As a consequence through minimizing the NLML, a biasvariance tradeoff is performed. This is illustrated in Figure 58, which shows the NLML (solid) and the values of the data-fit (dashed) and model complexity (dotted) term of a GP model for different hyperparameter values 𝜃𝜃 𝐿𝐿2 . Figure 58 - Negative logarithmic marginal likelihood depending on 𝜃𝜃 𝐿𝐿2 It can be observed that the minimum of the NLML does lead to the optimal tradeoff between underand overfitting of the data. Covariance functions Covariance functions have to fulfil the Mercer theorem. Consequently, the resulting covariance matrix 𝑲𝑲(𝒙𝒙, 𝒙𝒙 ′ ) has to be symmetric and semi-positive definite. One frequently used example is the Gaussian, or squared exponential, kernel function: 𝑘𝑘 𝑆𝑆𝑆𝑆 (𝑥𝑥, 𝑥𝑥′) = 𝜃𝜃 𝑆𝑆2 𝑒𝑒 −�𝑥𝑥−𝑥𝑥 ′ � 2 2𝜃𝜃 𝐿𝐿2 = 𝜃𝜃 𝑆𝑆2 𝑒𝑒 − 𝑟𝑟 2 2𝜃𝜃 𝐿𝐿2 , Alternative covariance functions are: Polynomial functions: 𝑘𝑘 𝑃𝑃𝑃𝑃 (𝑥𝑥, 𝑥𝑥′) = (𝑥𝑥 ∙ 𝑥𝑥 ′ + 𝜃𝜃 02 ) 𝑝𝑝 , Periodic functions: 𝑘𝑘 𝑃𝑃 (𝑥𝑥, 𝑥𝑥′) = 𝜃𝜃 𝑆𝑆2 𝑒𝑒 −2𝑠𝑠𝑖𝑖𝑛𝑛 2 [𝑟𝑟𝑟𝑟/ 𝜃𝜃 𝑃𝑃 ] , with 𝑟𝑟 = ‖𝑥𝑥 − 𝑥𝑥 ′ ‖ being the Euclidean distance, 𝑝𝑝 being the order of the polynomial, and 𝜃𝜃 𝐿𝐿 , 𝜃𝜃 𝑆𝑆 , 𝜃𝜃 0 and 𝜃𝜃 𝑃𝑃 being hyperparameters which model the 𝑥𝑥 -scaling, 𝑦𝑦 -scaling, a linear offset, and the period, respectively. Beside these “basic” covariance functions, more complex covariance functions can be designed by summation, multiplication and convolution of basic covariance functions. A frequently used example is the quasi-periodic covariance function. 𝑘𝑘 𝑄𝑄𝑃𝑃 (𝑥𝑥, 𝑥𝑥′) = 𝑘𝑘 𝑆𝑆𝑆𝑆 (𝑥𝑥, 𝑥𝑥′) ∙ 𝑘𝑘 𝑃𝑃 (𝑥𝑥, 𝑥𝑥′) = 𝜃𝜃 𝑆𝑆2 𝑒𝑒 − 𝑟𝑟 2 2𝜃𝜃 𝐿𝐿2 −2𝑠𝑠𝑖𝑖𝑛𝑛 2 [𝑟𝑟𝑟𝑟/ 𝜃𝜃 𝑃𝑃 ] <?page no="114"?> 114 Fundamentals of Machine Learning Further examples can be found in [9], chapter 4. The effect of the different covariance functions is illustrated in Figure 59 . Three GP models with either a squared exponential, periodic, or quasi-periodic covariance function were trained. In case of a squared exponential covariance function, the expressive power of the model decreases, as the Euclidean distance between the test and training points increases. For x > 5, the models predicts a mean of zero and a standard deviation of one, which corresponds to the prior distribution (equivalent to the case that the model has not observed any training data). The use of a periodic covariance function results in a constant periodic pattern, which will be repeated independently of the Euclidean distance between the training and test points. By combining both covariance functions (resulting in a quasi-periodic function), the characteristic behaviour of both functions can be combined. Figure 59 - Predicted mean and ±2 standard deviation of a GP model with either a squared exponential, periodic, or quasi-periodic covariance function. 10.5 Multi-Task Gaussian Process (MTGP) Models The following section presents one possible extension of GP models, called multitask Gaussian Process models (MTGP). The basic framework of this extension will be presented and illustrated on a simple example. Further details and examples can be found in [8] chapter 5 and in [11]. Further, the examples presented here are created with an open-source MATLAB toolbox 6 . MTGP models are motivated by the problem of simultaneously modelling multiple data inputs of a sensor network, assuming that 𝑠𝑠 sensors are observed. Data of the individual sensors might be sampled at different sampling rates or being affected by temporal breakdowns of the sensors. In this context, we refer to data of one particular sensor as ‘task’. A naïve approach to model the tasks might be to learn multiple individual GP models, which is illustrated in Figure 60, left. We refer to these individual GP models as single-task GP models (STGP), as each model uses 6 Available for download from https: / / www.rob.uni-luebeck.de/ index.php? id=410 <?page no="115"?> Support Vector Machines Made Easy 115 data of one particular sensor. Figure 60 - Schematic diagram of multiple single-task Gaussian Processes (STGP) models (left) and one multi-task Gaussian process (MTGP) model (right) to learn o tasks. The obvious disadvantage of this approach is that potential correlation between the tasks is ignored. If the tasks share a common set of input features such as the time 𝑡𝑡 , all tasks can be simulated within one MTGP model (see Figure 60, right). We assume that 𝑿𝑿 = {𝑥𝑥 𝑖𝑖𝑗𝑗 |𝑗𝑗 = 1, … , 𝑠𝑠; 𝑖𝑖 = 1, … , 𝑛𝑛 𝑗𝑗 } and 𝒀𝒀 = �𝑦𝑦 𝑖𝑖𝑗𝑗 �𝑗𝑗 = 1, … , 𝑠𝑠; 𝑖𝑖 = 1, … , 𝑛𝑛 𝑗𝑗 � are the training indices and observations for the 𝑠𝑠 tasks, where task 𝑗𝑗 has 𝑛𝑛 𝑗𝑗 items of training data. To specify the affiliation of index 𝑥𝑥 𝑖𝑖𝑗𝑗 and observation 𝑦𝑦 𝑖𝑖𝑗𝑗 to task 𝑗𝑗 , a label 𝑙𝑙 𝑗𝑗 has to be added as an additional input to the model (see Figure 60 right) with 𝑙𝑙 𝑗𝑗 = 𝑗𝑗 . The extension from an STGP to an MTGP model takes place within the specification of the covariance functions. The residual assumptions and equations such as the predicted mean and variance remain unchanged, Eq. (34). As discussed in the previous section, complex covariance functions can be designed by summation, multiplication, and convolution of individual covariance functions. It might be assumed that two covariance function are defined as 𝑘𝑘 𝑀𝑀𝑇𝑇𝑀𝑀𝑃𝑃 (𝑥𝑥, 𝑥𝑥 ′ , 𝑙𝑙, 𝑙𝑙′) = 𝑘𝑘 𝑐𝑐 (𝑙𝑙, 𝑙𝑙′) ∙ 𝑘𝑘 𝑡𝑡 (𝑥𝑥, 𝑥𝑥′) where 𝑘𝑘 𝑐𝑐 is a covariance function representing the correlation between tasks and 𝑘𝑘 𝑡𝑡 is a function representing the correlation within tasks. We refer to the latter as temporal covariance function. Examples of 𝑘𝑘 𝑡𝑡 are the covariance functions discussed in the previous section. Note that 𝑘𝑘 𝑡𝑡 only depends on the input indices 𝑥𝑥 and 𝑘𝑘 𝑐𝑐 only on the labels 𝑙𝑙 . To simplify the appearance of the next equations, we assume that 𝑛𝑛 𝑗𝑗 = 𝑛𝑛 for 𝑗𝑗 = 1, … , 𝑠𝑠 . However, the MTGP framework is not restricted to this. The covariance matrix of the training features can be written as 𝐊𝐊 𝑀𝑀𝑇𝑇𝑀𝑀𝑃𝑃 (𝑿𝑿, 𝒍𝒍) = 𝐊𝐊 𝑐𝑐 (𝒍𝒍, 𝜽𝜽 𝑐𝑐 )⨂ 𝑲𝑲 𝑡𝑡 (𝑿𝑿, 𝜽𝜽 𝑡𝑡 ), where ⨂ is the Kronecker product, 𝒍𝒍 = {𝑗𝑗|𝑗𝑗 = 1, … , 𝑠𝑠} , and 𝜽𝜽 𝒄𝒄 and 𝜽𝜽 𝑡𝑡 are vectors containing hyperparameters for 𝑲𝑲 𝒄𝒄 and 𝑲𝑲 𝑡𝑡 , respectively. This leads to a matrix of size 𝑠𝑠𝑛𝑛 × 𝑠𝑠𝑛𝑛 for 𝑲𝑲 𝑀𝑀𝑇𝑇𝑀𝑀𝑃𝑃 , as 𝑲𝑲 𝑐𝑐 has a size of 𝑠𝑠 × 𝑠𝑠 and 𝑲𝑲 𝑡𝑡 of 𝑛𝑛 × 𝑛𝑛 . 𝒙𝒙 𝑖𝑖∗ 𝑙𝑙 𝑖𝑖∗ 𝑦𝑦 𝑖𝑖𝑙𝑙 ∗ 𝑦𝑦 𝑖𝑖𝑃𝑃 ∗ ⋮ 𝑦𝑦 𝑖𝑖𝑙𝑙 ∗ 𝑦𝑦 𝑖𝑖𝑃𝑃 ∗ 𝑥𝑥 𝑖𝑖1 ∗ 𝑥𝑥 𝑖𝑖𝑃𝑃 ∗ 𝐺𝐺𝑃𝑃 𝑦𝑦 1 , 𝑥𝑥 1 , 𝜽𝜽 1 𝐺𝐺𝑃𝑃 𝑦𝑦 𝑃𝑃 , 𝑥𝑥 𝑃𝑃 , 𝜽𝜽 𝑃𝑃 𝑀𝑀𝐾𝐾𝐺𝐺𝑃𝑃 𝒀𝒀, 𝑿𝑿, 𝒍𝒍, 𝜽𝜽 𝑐𝑐 , 𝜽𝜽 𝑖𝑖 <?page no="116"?> 116 Fundamentals of Machine Learning The remaining problem is the parametrization of the matrix 𝑲𝑲 𝑐𝑐 . In order to make 𝑲𝑲 𝑀𝑀𝑇𝑇𝑀𝑀𝑃𝑃 a valid covariance matrix, it has to be guaranteed that the covariance matrix 𝑲𝑲 𝑐𝑐 is positive semi-definite (Mercers’s theorem). [12] presented a so-called “freeform” parametrisation, as it allows arbitrary correlations between the tasks. It is based on the Cholesky decomposition. The hyperparameters 𝜽𝜽 𝑐𝑐 specify the elements of the lower triangular matrix 𝑳𝑳 as 𝑲𝑲 𝑐𝑐 = 𝑳𝑳𝑳𝑳 T , 𝑳𝑳 = ⎣ ⎢⎢⎡ 𝜃𝜃 𝑐𝑐,1 0 𝜃𝜃 𝑐𝑐,2 𝜃𝜃 𝑐𝑐,2 … 0 0 ⋮ 𝜃𝜃 𝑐𝑐,𝑘𝑘−𝑃𝑃+1 𝜃𝜃 𝑐𝑐,𝑘𝑘−𝑃𝑃+2 ⋱ ⋮ … 𝜃𝜃 𝑐𝑐,𝑘𝑘 ⎦⎥⎥⎤ , where the number of correlation hyperparameters 𝜽𝜽 𝑐𝑐 is 𝑘𝑘 = 𝑠𝑠(𝑠𝑠 + 1)/ 2. By multiplication of 𝑳𝑳𝑳𝑳 T , the matrix 𝑲𝑲 𝒄𝒄 is guaranteed to be positive semi-definite. As with STGPs, the hyperparameters for a MTGP may be optimized by minimizing the NLML, and predictions for test indices {𝑥𝑥 ∗ , 𝑙𝑙 ∗ } can be made by computing the conditional probability 𝑝𝑝(𝑦𝑦 ∗ |𝑥𝑥 ∗ , 𝒍𝒍, 𝑿𝑿, 𝒀𝒀) . This method has several useful properties: ▶ we may have task-specific training indices 𝑛𝑛 𝑗𝑗 (i.e., training data may be observed at task-specific times); ▶ automatic learning of the correlation within tasks occurs by fitting the covariance function; and ▶ the framework assumes that the tasks have similar temporal characteristics and hyperparameters 𝜽𝜽 𝑡𝑡 .  Example These advantages will be illustrated on one example. The example illustrates the scenario of modelling multiple tasks ( 𝑠𝑠 = 4 ) with MTGPs and how learning the correlation between tasks can improve the prediction outcome. The dataset consists of three optical markers (OM1-3) and one strain belt, attached to the chest of a subject. All sensors measure the respiratory motion of the subject, as can be observed in Figure 62.a. We assumed that the data of the OMs and strain sensor were acquired at a different sampling frequencies and that different training intervals 𝑑𝑑 𝑗𝑗 were known. Figure 61 lists acquisition parameters for each task. It can be observed that the sampling frequency of the OMs was five times higher than that of the strain sensor. OM1 and OM3 had no overlapping training interval. Furthermore, Pearson’s correlation coefficients with respect to OM1 are listed. They were computed on the complete motion fragments and indicate a high positive or negative correlation. <?page no="117"?> Support Vector Machines Made Easy 117 The aim of this experiment was to predict 𝑦𝑦 𝑂𝑂𝑀𝑀1∗ for task OM1 with the test range being 𝑥𝑥 𝑂𝑂𝑀𝑀1∗ ∈ (20𝑠𝑠, 70𝑠𝑠]. Four evaluation scenarios were considered. The first scenario (S1) assumed that only data of OM1 was known. This case is equivalent to an STGP. The training data of the OM2, the OM3, and the strain sensor were added successively into the MTGP model for scenarios S2 to S4. All tasks were considered in S4. A quasi-periodic covariance function was selected as temporal covariance function. The correlation hyperparameters were initialized assuming independent tasks (the correlation matrix 𝑲𝑲 𝑐𝑐 is equivalent to the identity matrix). All hyperparameters were optimized by minimizing the NLML. Figure 61 - Time interval 𝑑𝑑 𝑗𝑗 enclosing the training data for the 𝑗𝑗 th task, sampling frequency 𝑓𝑓 𝑠𝑠 , and Pearson’s correlation coefficient 𝑟𝑟 with respect to OM1. Figure 62 - (a) Motion fragments of three optical markers (OM1 - 3) and one respiration belt (RB); (b) - (e) Predicted position 𝑦𝑦 𝑂𝑂𝑀𝑀1∗ and confidence interval for OM1 of scenarios S1 to S4, respectively. The prediction results for 𝑦𝑦 𝑂𝑂𝑀𝑀1∗ and the 95 % confidence intervals for S1 to S4 are shown in Figure 62.b-e, respectively. The 95 % confidence intervals is defined as two times the square root of the predicted variance 𝜎𝜎 2∗ . The training data are highlighted by vertical lines. In case of S1, the difference between 𝑦𝑦 𝑂𝑂𝑀𝑀1∗ and 𝑦𝑦 𝑂𝑂𝑀𝑀1 is small OM1 OM2 OM3 Strain 𝑑𝑑 𝑗𝑗 [s] (0,20) (10,30) (25,40) (0,60) 𝑓𝑓 𝑠𝑠 [Hz] 2.6 2.6 2.6 0.52 𝑟𝑟 1 -0.96 -0.9 0.89 <?page no="118"?> 118 Fundamentals of Machine Learning within the training region ( 𝑡𝑡 < 20 𝑠𝑠 ) as these observations were part of the training set. For 𝑡𝑡 > 20 𝑠𝑠 , 𝑦𝑦 𝑂𝑂𝑀𝑀1∗ moves towards the mean function, which is the mean of the training observations. Consequently, the variance and the prediction error increase. As the distance 𝑟𝑟 between the test label and the training labels increases, the correlation 𝑘𝑘 𝑡𝑡 decreases and the model predicts the prior distribution. If we consider additional tasks (S2 to S4), it can be observed that the prediction accuracy increases. This observation is confirmed by the root mean square error (RMSE) in Figure 63. The RMSE decreases from 2.244 for S1 to 1.474 for S4. Figure 62.d and the RMSE value for S3 indicate that data of OM3 can be used to improve the prediction accuracy compared to S2. This shows that the MTGP model is able to learn the correlation between OM1 and OM3 even though they do not share an overlapping training region. The correlation is learned via the training data of OM2. Scenario S4 illustrates that also data acquired at a different sampling frequency can be used in the MTGP model to increase the prediction accuracy. These observations are confirmed by the decreased RMSE value. Figure 63 - Root mean square error (RMSE) for the different evaluation scenarios S1-S4 By using MTGP models, a spin-off product is that the hyperparameters 𝜽𝜽 𝑐𝑐 are learned automatically. Consequently, the correlation matrix 𝑲𝑲 𝑐𝑐 can be evaluated. Figure 64 shows the normalized correlation coefficients estimated by the MTGP model versus the correlation coefficients using Pearson’s correlation coefficient. Note that to compute Figure 64.b, the complete signals have been used. In contrast, the MTGP model used only the available training data. Note the computation of the Pearson’s correlation coefficient between OM1 and OM3 only on the training data would not be possible as OM1 and OM3 have no overlapping training region. Figure 64 - (a) Normalized correlation matrix 𝑲𝑲 𝑐𝑐 and (b) Pearson’s correlation coefficients for scenario S4 This example illustrated the basic framework of MTGP models. These models can be easily extended to multidimensional input data, phase shifts or varying signal characteristics between tasks. S1 S2 S3 S4 𝑅𝑀𝑀𝑆𝑆𝐸𝐸 2.244 2.005 1.805 1.474 <?page no="119"?> Support Vector Machines Made Easy 119 The main limitation of the MTGP is that the computational cost for evaluating MTGPs is 𝑂𝑂(𝑠𝑠 3 𝑛𝑛 3 ) compared with 𝑠𝑠 × 𝑂𝑂(𝑛𝑛 3 ) for STGPs. Additionally, the number of hyperparameters can increase rapidly for an increasing number of tasks which can lead to a multi-modal parameter space with no overall optimum. <?page no="121"?> 11 Bayesian Networks Bayesian networks, also called ‘causal networks’, were developed for probabilistic reasoning in expert systems. An expert system is a computer program. The program assists decision-making. It thereby helps human experts. The domain of the expert system is the area of its expertise. For example, the area of expertise could be diagnosis in oncology. In this case, inputs are laboratory and measurement data as well as image data taken from patients. The goal is to help diagnosis. Often the data are subject to measurement errors. But also, the diagnosis could be based on experience. Parts of the reasoning to find the diagnosis could rely on probabilistic assumptions. Thus, the goal of the expert system is to assist in the reasoning with probabilistic assumptions. There are surprising examples which show that human reasoning with probabilistic data is often mistaken. Bayesian networks are an active area of research. During the 1980s, the first expert systems for medical diagnosis were tested. The success of these systems was very limited, for two reasons. Firstly, the representation of (stochastic) independence turns out to be difficult. A correct (or better) representation of independence was found only after trying a number of other ways. This representation of independence became the basis of causal networks. Independence was represented in conjunction with causality, giving an elegant and intuitive way to work with independent and dependent information. As an example for dependent and independent information, we look at two symptoms of the flu (fever and headache). The two symptoms are not necessarily independent, if we look for these two symptoms in the general population. They would often occur together in the same person, meaning they are not independent. If they were, a person having one of the symptoms would have the same probability of having the other symptom as the rest of the population. The second problem in constructing probabilistic expert systems is a combinatorial one. Once a good representation of independence was found, it turned out that it would require exponential time to work with the data thus represented. Bayesian networks provide a way to overcome this second problem as well. We will illustrate the basic methods underlying Bayesian networks. We consider the following scenario. If someone breaks into your house, the burglar alarm system will sound. But the alarm system may also give false alarms. For example, an earthquake could set off the alarm system, even when no burglar is in your house. Hence, we have three separate types of information in this scenario. Alarm, Burglary, Earthquake <?page no="122"?> 122 Fundamentals of Machine Learning Assume, you are on your way home, and hear that your alarm system sounds, from outside the house. Before, on your way home, you had been listening to the radio and heard that a small earthquake has occurred. The question is now: how likely is it that you have a burglar in your house? In this case the observed facts are: alarm sounds, earthquake has occurred. The information we would like to infer is whether the burglary has occurred. To represent this information, one could construct a tree with the observable facts as roots, and the information to be inferred (probabilistically) as a leaf node. This is shown in the following picture. Figure 65 - Tree with observable facts (root nodes) and inferred leaf nodes This is the information representation in the expert system PROSPECTOR. PROSPECTOR was designed to infer geology under the earth’s surface. The input information in PROSPECTOR is the geology visible on the earth surface. Hence, information easily observable is used to compute the probability of information not immediately visible. In this way, PROSPECTOR processes the information in a diagnostic direction. A number of problems occurred with this type of information processing. For example, the occurrence of ‘earthquake’ should only reduce the probability of ‘burglary’ if the probability of burglary had previously been increased as a consequence of ‘alarm’. Since the earthquake could have set off the (false) alarm. Otherwise, the occurrence of ‘earthquake’ should not change the probability of burglary. We give names to the three events occurring in our story. 𝐸𝐸 for earthquake, 𝐵𝐵 for burglary, and 𝐴𝐴 for alarm. To represent this, we could proceed as follows: alarm burglary earthquake <?page no="123"?> Support Vector Machines Made Easy 123 Figure 66 - Combined events and their inferred leaf node Here, we must assign individual probabilities to each root node. To add a new symptom, we would have to double the number of root nodes. In this way, we would obtain a rapid growth of the number of nodes. Bayesian networks use a different way to represent the probabilistic information in this case: 𝐵𝐵 and 𝐸𝐸 are the possible causes for 𝐴𝐴 . Thus, an arc in the graph represents causality. Therefore, Bayesian networks are also called causal networks. Compare this graph to the above graph for the PROSPECTOR model. Thus, the arcs no longer represent the paths along which probability must be updated. Instead they represent causal dependency. Hence, we have the following interpretation: Earthquake and burglary are the causes for alarm. The above interpretation in PROSPECTOR was: the occurrence of the event alarm led to a modification of the probability for burglary. We introduce the following propositions. 𝑎𝑎 1 = 𝐴𝐴𝑙𝑙𝑎𝑎𝑟𝑟𝑚𝑚 𝑎𝑎 2 = - 𝐴𝐴𝑙𝑙𝑎𝑎𝑟𝑟𝑚𝑚 𝑏𝑏 1 = 𝐵𝐵𝑢𝑢𝑟𝑟𝑔𝑔𝑙𝑙𝑎𝑎𝑟𝑟𝑦𝑦 𝑏𝑏 2 = -𝐵𝐵𝑢𝑢𝑟𝑟𝑔𝑔𝑙𝑙𝑎𝑎𝑟𝑟𝑦𝑦 𝑒𝑒 1 = 𝐸𝐸𝑎𝑎𝑟𝑟𝑡𝑡ℎ𝑞𝑞𝑢𝑢𝑎𝑎𝑘𝑘𝑒𝑒 𝑒𝑒 2 = -𝐸𝐸𝑎𝑎𝑟𝑟𝑡𝑡ℎ𝑞𝑞𝑢𝑢𝑎𝑎𝑘𝑘𝑒𝑒 To process the information in the network, we store the following data: a. all a priori probabilities of root nodes, in our case only 𝑃𝑃(𝑏𝑏 1 ) 𝑃𝑃(𝑒𝑒 1 ) b. conditional probability of all other nodes under their parents, in our case 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ∧ 𝑒𝑒 1 ) 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ∧ 𝑒𝑒 2 ) 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 2 ∧ 𝑒𝑒 1 ) 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 2 ∧ 𝑒𝑒 2 ) A and E A and not E not A and E not A and not E B <?page no="124"?> 124 Fundamentals of Machine Learning Note that we can compute the value 𝑃𝑃(𝑎𝑎 2 |𝑏𝑏 2 ∧ 𝑒𝑒 1 ) directly from the given value 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ∧ 𝑒𝑒 1 ) , since 𝑃𝑃(𝑎𝑎 2 |𝑏𝑏 1 ∧ 𝑒𝑒 1 ) = 1 − 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ∧ 𝑒𝑒 1 ) Hence we can save space here. By convention, we will write 𝑃𝑃(𝑎𝑎, 𝑏𝑏) instead of 𝑃𝑃(𝑎𝑎 ∧ 𝑏𝑏) . The main idea of a Bayesian network is to store the data as above (a priori data and conditional probabilities under parents in a given non-cyclic directed graph). Then it will turn out that the entire joint distribution of all propositional variables can be captured with this comparatively sparse data, under appropriate preconditions. We will begin by defining the concept of a joint distribution of propositional variables. We are given propositions 𝑎𝑎 1 , 𝑎𝑎 2 , 𝑏𝑏 1 , 𝑏𝑏 2 , 𝑒𝑒 1 , 𝑒𝑒 2 as above. We now consider propositional variables 𝐴𝐴 , 𝐵𝐵 and 𝐸𝐸 . Here 𝐴𝐴 can take the values 𝑎𝑎 1 , 𝑎𝑎 2 . 𝐵𝐵 can take values 𝑏𝑏 1 , 𝑏𝑏 2 . And 𝐸𝐸 takes 𝑒𝑒 1 , 𝑒𝑒 2 as values. Generally, we require that the values, which can be taken by the propositional variables are mutally exclusive and exhaustive. Exhaustive means, the values cover all possibilities. For example, the above variables 𝑎𝑎 1 and 𝑎𝑎 2 are mutually exclusive and exhaustive. Furthermore, we require the number of values for each variable to be finite. The joint distribution of three propositional variables 𝐴𝐴 , 𝐵𝐵 , 𝐶𝐶 is a probability measure 𝑃𝑃 defined on events of the form: 𝑎𝑎 𝑖𝑖 and 𝑏𝑏 𝑗𝑗 and 𝑐𝑐 𝑘𝑘 for 𝑖𝑖 , 𝑗𝑗 , 𝑘𝑘 = 1,2 Hence, 𝑃𝑃 assigns values in the interval [0,1] to events of the form 𝑎𝑎 𝑖𝑖 and 𝑏𝑏 𝑗𝑗 and 𝑐𝑐 𝑘𝑘 . Hence, the joint distribution is defined solely on triplet events of the above form. We will next show that any probability involving one or more of the above three propositions can be computed, from a given joint distribution. Hence, we would like to convince ourselves that we can compute the value of 𝑃𝑃(𝑎𝑎 2 ) if we are given all values of the form 𝑃𝑃�𝑎𝑎 𝑖𝑖 , 𝑏𝑏 𝑗𝑗 , 𝑐𝑐 𝑘𝑘 � for 𝑖𝑖, 𝑗𝑗, 𝑘𝑘 = 1,2 . Likewise, we can also compute 𝑃𝑃(𝑎𝑎 2 , 𝑏𝑏 2 ) from the triplet values in the joint distribution. Or we can compute 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ) . We begin by showing that 𝑃𝑃(𝑎𝑎 1 ) can be computed from the triplet 𝑃𝑃 ’s. This simply follows from: 𝑃𝑃(𝑎𝑎 1 ) = 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 1 ) + 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 2 , 𝑐𝑐 1 ) + 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 ) + 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 2 , 𝑐𝑐 2 ) <?page no="125"?> Support Vector Machines Made Easy 125 Note that we have used the mutual exclusiveness here. Now 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 ) = 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 1 ) + 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 ) which shows that we also compute 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 ) from the triplets. But now 𝑃𝑃(𝑎𝑎 1 |𝑏𝑏 1 ) = 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 ) 𝑃𝑃(𝑎𝑎 1 ) . We have thus computed arbitrary probability values for propositions in the network from the joint distribution. But the other direction is even more important. Namely we would like to compute the joint distribution from the values stored in the net alone (see the above values, i.e. conditional probabilities under parents of nodes and a priori probabilities). Here, the joint distribution does not have to be computed explicitly for all triplets, since this would give rise to exponential computing times in the case of more than three propositional variables. We consider a node 𝐴𝐴 in a directed non-cyclic graph (dng) To define a Bayesian Network we will need the following definitions: 𝑝𝑝𝑎𝑎𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑠𝑠(𝐴𝐴) = set of direct predecessors of 𝐴𝐴 𝑠𝑠𝑢𝑢𝑐𝑐𝑐𝑐𝑒𝑒𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠(𝐴𝐴) = all (including indirect) successors of 𝐴𝐴 𝑟𝑟𝑒𝑒𝑠𝑠𝑡𝑡(𝐴𝐴) = all nodes in the graph, except 𝐴𝐴 itself As abbreviations for the sets defined above, we use 𝑒𝑒(𝐴𝐴) , 𝑠𝑠(𝐴𝐴) , 𝑟𝑟(𝐴𝐴) .  Example Figure 67 - Sample network A B C E F D <?page no="126"?> 126 Fundamentals of Machine Learning Here we have: 𝑒𝑒(𝐸𝐸) = {𝐵𝐵, 𝐶𝐶} 𝑠𝑠(𝐵𝐵) = {𝐷𝐷 , 𝐸𝐸 , 𝐹𝐹} 𝑟𝑟(𝐶𝐶) = {𝐴𝐴 , 𝐵𝐵 , 𝐷𝐷} We can now define Bayesian networks (also called causal networks). Consider a directed non-cyclic graph with node set 𝐕𝐕 and edge set 𝐄𝐄 . Each node represents a propositional variable. Definition Given a joint distribution 𝑃𝑃 on 𝐕𝐕, t hen (𝐕𝐕, 𝐄𝐄, 𝑃𝑃) is a Bayesian network, if for each node 𝐴𝐴 , and for each subset 𝐖𝐖 of 𝑟𝑟(𝐴𝐴) : 𝑃𝑃�𝐖𝐖 ∧ 𝐴𝐴�𝑒𝑒(𝐴𝐴)� = 𝑃𝑃�𝐖𝐖�𝑒𝑒(𝐴𝐴)�𝑃𝑃�𝐴𝐴�𝑒𝑒(𝐴𝐴)� Hence, a Bayesian network is defined by independence properties. Each node 𝐴𝐴 must be conditionally independent from sets 𝐖𝐖 . The conditional independence must hold with respect to the parents of 𝐴𝐴 . And the sets 𝐖𝐖 range over all subsets of 𝑟𝑟(𝐴𝐴) . At first glance, this definition may sound complicated to verify for given graphs. Because not only independence may be difficult to show in given cases, but also 𝐖𝐖 ranges over a large number of sets. Typically, the set of all subsets of a given set has an exponential number of elements. Hence we will look at a number of examples next. It will turn out that these independence assumptions fit well to the representation of causality, which is the basis of causal networks. In addition it will turn out that the independence assumptions are realistic for practical cases. Unrealistic independence assumptions were the main problem in early probabilistic expert systems such as MYCIN and PROSPECTOR: Thus a causal network is given by a graph (𝐕𝐕, 𝐄𝐄) and a joint distribution. Here the above independence assumptions must hold. A different formulation of the independence assumptions is: 𝐴𝐴 and 𝐖𝐖 are conditionally independent under the parents of 𝐴𝐴 : 𝑃𝑃�𝐴𝐴�𝐖𝐖 ∧ 𝑒𝑒(𝐴𝐴)� = 𝑃𝑃�𝐴𝐴�𝑒𝑒(𝐴𝐴)�  Exercise Show the equivalence of the two definitions. Note the meaning of the conjunction ‘and’ (or ∧ ) in this context. 𝐖𝐖 is a set of propositional variables. Then <?page no="127"?> Support Vector Machines Made Easy 127 𝑃𝑃�𝐴𝐴�𝐖𝐖 ∧ 𝑒𝑒(𝐴𝐴)� is the probability of 𝐴𝐴 , if all variables in 𝐖𝐖 , and all Variables in 𝑒𝑒(𝐴𝐴) are instantiated. Hence, 𝐖𝐖 not only ranges over a number of subsets, but also over all possible instantiations. A typical scenario in an expert system is the following. Some variables are instantiated. The rest of the variables are not instantiated A request is now: What is the probability of a certain node, given these instantiations Figure 68 - Sample network Independence assumptions carry the information contained in a Bayesian network. We will now discuss the independence assumptions hidden in the above definition for simple cases. Consider the following network: Figure 69 - Independence assumption for simple case For node 𝐴𝐴 , we have 𝑟𝑟(𝐴𝐴) = {𝑋𝑋}. Hence, 𝐖𝐖 = ∅ or 𝐖𝐖 = {𝑋𝑋} . For 𝐖𝐖 = ∅ , there is nothing to show. For 𝐖𝐖 = {𝑋𝑋} , we must have 𝑃𝑃�𝐴𝐴�𝐖𝐖 ∧ 𝑒𝑒(𝐴𝐴)� = 𝑃𝑃�𝐴𝐴�𝑒𝑒(𝐴𝐴)� . Let 𝑎𝑎 and 𝑥𝑥 be possible values for 𝐴𝐴 und 𝑋𝑋 . Inserting 𝑃𝑃(𝑎𝑎|𝑥𝑥 ∧ 𝑥𝑥) = 𝑃𝑃(𝑎𝑎|𝑥𝑥) , ... 𝐴𝐴 𝐵𝐵 𝐶𝐶 𝐷𝐷 metastasis coma brain tumor increased Calcium in serum 𝑋𝑋𝐴𝐴 <?page no="128"?> 128 Fundamentals of Machine Learning this is always satisfied. But: 𝐖𝐖 ∧ 𝑒𝑒(𝐴𝐴) runs over all possible instantiations, including the values 𝑃𝑃(𝑎𝑎|𝑥𝑥 𝑖𝑖 ∧ 𝑥𝑥 𝑖𝑖 ) where 𝑥𝑥 𝑖𝑖 ≠ 𝑥𝑥 𝑖𝑖 , which is not possible. The independence assumptions for 𝑋𝑋 are: 𝑃𝑃�𝑋𝑋�𝑊𝑊 ∧ 𝑒𝑒(𝑋𝑋)� = 𝑃𝑃�𝑋𝑋�𝑒𝑒(𝑋𝑋)� They are always satisfied. We will now turn to a more complex example. Figure 70 - Independence assumption for a more complex case For 𝐴𝐴 , we have the independence assumptions: 𝑒𝑒(𝐴𝐴) = ∅ 𝑟𝑟(𝐴𝐴) = {𝐵𝐵} Hence, let 𝐖𝐖 be a subset of {𝐵𝐵} . We must show: 𝐖𝐖 is conditionally independent of A under 𝑒𝑒(𝐴𝐴) , i.e. under ∅ . a. If 𝐖𝐖 = ∅ , 𝑃𝑃(𝐴𝐴|∅) = 𝑃𝑃(𝐴𝐴) always holds b. If 𝐖𝐖 = {𝐵𝐵} , 𝑃𝑃(𝐴𝐴|{𝐵𝐵}) = 𝑃𝑃(𝐴𝐴) does not always hold, must be shown in each specific case. Independence assumptions for 𝐵𝐵 (likewise): 𝑃𝑃(𝐵𝐵|𝐴𝐴) = 𝑃𝑃(𝐵𝐵) Independence assumptions for 𝐶𝐶 : 𝑃𝑃�𝐶𝐶�𝐖𝐖 ∧ 𝑒𝑒(𝐶𝐶)� = 𝑃𝑃�𝐶𝐶�𝑒𝑒(𝐶𝐶)� for all 𝐖𝐖 of 𝑟𝑟(𝐶𝐶) , but 𝑟𝑟(𝐶𝐶) is a subset of 𝑒𝑒(𝐶𝐶) . Hence: the independence assumptions for 𝐶𝐶 are always satisfied. Summary: We obtain the following independence assumptions for the above network ➊ 𝑃𝑃(𝐴𝐴|∅) = 𝑃𝑃(𝐴𝐴) , likewise 𝑃𝑃(𝐵𝐵| ∅ ) = 𝑃𝑃(𝐵𝐵) ➋ 𝑃𝑃(𝐴𝐴|𝐵𝐵) = 𝑃𝑃(𝐴𝐴) ➌ 𝑃𝑃(𝐵𝐵|𝐴𝐴) = 𝑃𝑃(𝐵𝐵) ➍ 𝑃𝑃�𝐶𝐶�𝐖𝐖 ∧ 𝑒𝑒(𝐶𝐶)� = 𝑃𝑃�𝐶𝐶�𝑒𝑒(𝐶𝐶)� 𝐶𝐶 𝐴𝐴 𝐵𝐵 <?page no="129"?> Support Vector Machines Made Easy 129 ( ➊ and ➍ always hold) We now consider a specific experiment: We have two coins and a bell. We throw the coins, if at least one turns up heads, the bell will ring. Set 𝑎𝑎 = 𝐶𝐶𝑠𝑠𝑖𝑖𝑛𝑛 1 = 𝐻𝐻𝑒𝑒𝑎𝑎𝑑𝑑 𝑎𝑎 ′ = 𝐶𝐶𝑠𝑠𝑖𝑖𝑛𝑛 1 = -𝐻𝐻𝑒𝑒𝑎𝑎𝑑𝑑 𝑏𝑏 = 𝐶𝐶𝑠𝑠𝑖𝑖𝑛𝑛 2 = 𝐻𝐻𝑒𝑒𝑎𝑎𝑑𝑑 𝑏𝑏′ = 𝐶𝐶𝑠𝑠𝑖𝑖𝑛𝑛 2 = -𝐻𝐻𝑒𝑒𝑎𝑎𝑑𝑑 𝑐𝑐 = 𝐵𝐵𝑒𝑒𝑙𝑙𝑙𝑙 𝑐𝑐 ′ = -𝐵𝐵𝑒𝑒𝑙𝑙𝑙𝑙 This gives rise to propositional variables 𝐴𝐴 , 𝐵𝐵 , 𝐶𝐶 . 𝐴𝐴 and 𝐵𝐵 are causes for 𝐶𝐶 . Hence the network is the same as shown in Figure 70. We have the values: 𝑃𝑃(𝑏𝑏) = 12 𝑃𝑃(𝑎𝑎) = 12 𝑃𝑃(𝑐𝑐) = 34 𝑃𝑃(𝑐𝑐 ′ ) = 14 ⋮ Thus, the joint distribution is: 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) = 14 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐 ′ ) = 0 𝑃𝑃(𝑎𝑎, 𝑏𝑏 ′ , 𝑐𝑐) = 14 𝑃𝑃(𝑎𝑎, 𝑏𝑏 ′ , 𝑐𝑐 ′ ) = 0 𝑃𝑃(𝑎𝑎 ′ , 𝑏𝑏, 𝑐𝑐) = 14 𝑃𝑃(𝑎𝑎 ′ , 𝑏𝑏, 𝑐𝑐 ′ ) = 0 𝑃𝑃(𝑎𝑎 ′ , 𝑏𝑏 ′ , 𝑐𝑐) = 0 𝑃𝑃(𝑎𝑎 ′ , 𝑏𝑏 ′ , 𝑐𝑐 ′ ) = 1 4 To check the independence assumptions resulting from the above definition, we only have to look at the cases ➋ and ➌ . <?page no="130"?> 130 Fundamentals of Machine Learning For ➋ , we must show 𝑃𝑃(𝐴𝐴|𝐵𝐵) = 𝑃𝑃(𝐴𝐴) Hence: 𝑃𝑃(𝑎𝑎|𝑏𝑏) = 𝑃𝑃(𝑎𝑎) and 𝑃𝑃(𝑎𝑎 ′ |𝑏𝑏) = 𝑃𝑃(𝑎𝑎 ′ ) (and likewise for 𝑏𝑏 ′ ). We have 𝑃𝑃(𝑎𝑎|𝑏𝑏) = 𝑃𝑃(𝑎𝑎,𝑏𝑏) 𝑃𝑃(𝑏𝑏) = � 14 � � 12 � � = 12 . The remaining verifications are very similar.  Exercise Show that the set of independence assumptions will shrink if new edges are added to the graph. Hint: In the above example, add a new edge: Figure 71 - Independence assumption with added edge We illustrate the independence assumptions and the information carried in a causal network in yet another example: Figure 72 - 𝐴𝐴 as causal relation for both 𝐵𝐵 and 𝐶𝐶 Here, 𝐴𝐴 is a causal reason for both 𝐵𝐵 and 𝐶𝐶 . Then: 𝐵𝐵 , 𝐶𝐶 are a priori not necessarily independent, but independent, if 𝐴𝐴 has already been instantiated. As an example, let 𝐴𝐴 = flu 𝐵𝐵 = fever 𝐶𝐶 = headache 𝐶𝐶 𝐴𝐴 𝐵𝐵 𝐴𝐴 𝐶𝐶 𝐵𝐵 <?page no="131"?> Support Vector Machines Made Easy 131 The definition of a causal network is consistent with this situation: 𝐵𝐵 belongs to 𝑟𝑟(𝐶𝐶) , hence 𝐵𝐵 must be independent of 𝐶𝐶 under 𝑒𝑒(𝐶𝐶) = {𝐴𝐴} The independence does not necessarily hold if 𝐴𝐴 is not instantiated. Above we noted that there are two names for the types of networks we are discussing here: Bayesian networks and causal networks. Both names are used in the literature. From the above discussion, it may become clear that the correct representation and processing of independence is the key to the processing of probabilistic information. Therefore, yet another name has been proposed for causal networks. They are sometimes also called independence networks. The above situation is consistent with the causal interpretation. As an example, consider the following experiment: 𝐴𝐴 = throwing of a coin 𝐵𝐵 = bell 1 𝐶𝐶 = bell 2 Now we first throw a coin. If the result is ‘head’, then the bells are activated, where each activation is done with probability 12 . Then 𝐵𝐵 and 𝐶𝐶 are independent, as soon as 𝐴𝐴 has occurred. But if nothing is known about 𝐴𝐴 , then 𝐵𝐵 and 𝐶𝐶 will not necessarily be independent. Hence, if we know that 𝐵𝐵 has occurred, then the probability of 𝐴𝐴 changes (because 𝐴𝐴 must then have also happened). This change is only possible, if nothing was known about 𝐴𝐴 before. In summary, the independence assumptions carry the information of the application. Less edges in the graph mean more information in the network. The independence assumption corresponds well to the interpretations in the examples. I.e., if we draw the edges in all places where we have causal dependency, then the independence assumptions correspond to the interpretation in the application. 11.1Propagation of probabilities in causal networks The following theorem makes it possible to actually compute probabilities in causal networks. Main Theorem on Causal Networks 𝑃𝑃�𝑎𝑎 𝑖𝑖 , 𝑏𝑏 𝑗𝑗 , 𝑐𝑐 𝑘𝑘 � = 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴)�𝑃𝑃 �𝑏𝑏 𝑗𝑗 �𝑒𝑒(𝐵𝐵)� 𝑃𝑃�𝑐𝑐 𝑘𝑘 �𝑒𝑒(𝐶𝐶)� with the convention 𝑃𝑃(𝑥𝑥 𝑖𝑖 |∅) = 𝑃𝑃(𝑥𝑥 𝑖𝑖 ) . <?page no="132"?> 132 Fundamentals of Machine Learning With this theorem it becomes possible to compute the entire joint distribution from the given conditional probabilities in the network. Notice The above theorem was stated for three variables 𝐴𝐴 , 𝐵𝐵 and 𝐶𝐶 . The same theorem holds for 𝑛𝑛 variables 𝐴𝐴 1 , … , 𝐴𝐴 𝑛𝑛 . We consider an example. We are given the propositional variables 𝐴𝐴 , 𝐵𝐵 , 𝐶𝐶 , 𝐷𝐷 , 𝐸𝐸 which are instantiated by 𝐴𝐴 = 𝑎𝑎 1 , 𝐵𝐵 = 𝑏𝑏 1 , 𝐶𝐶 = 𝑐𝑐 2 , 𝐷𝐷 = 𝑑𝑑 1 , 𝐸𝐸 = 𝑒𝑒 1 Figure 73 - Sample network Then we have (according to the above theorem) 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 , 𝑑𝑑 1 , 𝑒𝑒 1 ) = 𝑃𝑃(𝑎𝑎 1 )𝑃𝑃(𝑏𝑏 1 )𝑃𝑃(𝑐𝑐 2 |𝑎𝑎 1 )𝑃𝑃(𝑑𝑑 1 |𝑎𝑎 1 ∧ 𝑏𝑏 1 )𝑃𝑃(𝑒𝑒 1 |𝑑𝑑 1 ) . The general form of the above theorem is: Let (𝐕𝐕, 𝐄𝐄, 𝑃𝑃) be a causal network. Let 𝐕𝐕 = {𝐴𝐴 1 , … , 𝐴𝐴 𝑛𝑛 } and 𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 be values for 𝐴𝐴 1 , … , 𝐴𝐴 𝑛𝑛 . I.e. 𝑎𝑎 𝑖𝑖 is a single value of 𝐴𝐴 𝑖𝑖 . Then 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) = � 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴 𝑖𝑖 )� 𝑛𝑛 𝑖𝑖=1 𝑃𝑃�𝑒𝑒(𝐴𝐴𝑖𝑖)�>0 . Here again, the commas in 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) mean conjunction (i.e. ‘and’ or ∧ ). We consider a directed non-cyclic graph (𝐕𝐕, 𝐄𝐄) . A predecessor ordering on 𝐕𝐕 is a list of all nodes in 𝐕𝐕 such that each node appears in the list before any of its successors. Before proving the main theorem, we prove: 𝑎𝑎 1 𝑐𝑐 2 𝑏𝑏 1 𝑒𝑒 1 𝑑𝑑 1 <?page no="133"?> Support Vector Machines Made Easy 133 Lemma 𝐕𝐕 can always be written in a predecessor ordering. Proof of the lemma: ➊ There is always a node without predecessors. (Otherwise there would be a cycle in the graph, which is not allowed in causal networks). Call this node root node. ➋ Construction of a predecessor ordering in (𝐕𝐕, 𝐄𝐄) Mark any root node 𝑋𝑋 1 in (𝐕𝐕, 𝐄𝐄) with value 1 Delete 𝑋𝑋 1 (and all edges emerging from 𝑋𝑋 1 ) This deletion gives a reduced graph: (𝐕𝐕 1 , 𝐄𝐄 1 ) , and this new graph is directed and non-cyclic. Hence, it also has a root node. Label this root node with label 2. This process is repeated until all nodes have been marked. Notice: a node is only marked after all its predecessors have already been marked. This yields the desired predecessor ordering. We will now turn to proving the theorem. Case 1 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) ≠ 0 Then 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖 ) = � 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖 , 𝑎𝑎 𝑖𝑖+1 , … , 𝑎𝑎 𝑛𝑛 ) 𝑎𝑎 𝑗𝑗 value of 𝐴𝐴 𝑗𝑗 for 𝑗𝑗>𝑖𝑖 > 0 . Thus we have 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖 ) > 0 for any 𝑖𝑖 . The chain rule states that for arbitrary 𝑎𝑎 , 𝑏𝑏 with 𝑃𝑃(𝑎𝑎) ≠ 0 we have 𝑃𝑃(𝑏𝑏 ∧ 𝑎𝑎) = 𝑃𝑃(𝑏𝑏|𝑎𝑎)𝑃𝑃(𝑎𝑎) . Likewise, we can iterate this rule and obtain 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) = 𝑃𝑃(𝑎𝑎 𝑛𝑛 , … 𝑎𝑎 1 ) = 𝑃𝑃(𝑎𝑎 𝑛𝑛 |𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛−1 ) ⋅ (𝑎𝑎 𝑛𝑛−1 |𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛−2 ) ⋅ … ⋅ 𝑃𝑃(𝑎𝑎 2 |𝑎𝑎 1 ) ⋅ 𝑃𝑃(𝑎𝑎 1 ) (if 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖 ) ≠ 0 ). But we have: 𝑒𝑒(𝐴𝐴 𝑖𝑖 ) is a subset of {𝐴𝐴 1 , … , 𝐴𝐴 𝑖𝑖−1 } and {𝐴𝐴 1 , … , 𝐴𝐴 𝑖𝑖−1 } is a subset of 𝑟𝑟(𝐴𝐴 𝑖𝑖 ) . <?page no="134"?> 134 Fundamentals of Machine Learning Hence, we set 𝐖𝐖 ≔ {𝐴𝐴 1 , … , 𝐴𝐴 𝑖𝑖−1 } For this 𝐖𝐖 we have 𝑃𝑃(𝑎𝑎 𝑖𝑖 |𝐖𝐖) = 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑊𝑊 ∧ 𝑒𝑒(𝐴𝐴 𝑖𝑖 )� , since 𝑒𝑒(𝐴𝐴 𝑖𝑖 ) is a subset of 𝐖𝐖 = 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴 𝑖𝑖 )� , according to the independence assumptions . We substitute this into the chain rule and obtain: 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) = 𝑃𝑃(𝑎𝑎 𝑛𝑛 |𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛−1 ) ⋅ (𝑎𝑎 𝑛𝑛−1 |𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛−2 ) ⋅ … ⋅ 𝑃𝑃(𝑎𝑎 2 |𝑎𝑎 1 ) ⋅ 𝑃𝑃(𝑎𝑎 1 ) = 𝑃𝑃�𝑎𝑎 𝑛𝑛 �𝑒𝑒(𝐴𝐴 𝑛𝑛 )� ⋅ … ⋅ 𝑃𝑃�𝑎𝑎 2 �𝑒𝑒(𝐴𝐴 2 )� ⋅ 𝑃𝑃(𝑎𝑎 1 ) Since 𝑒𝑒(𝐴𝐴 1 ) = ∅ , we have 𝑃𝑃(𝑎𝑎 1 ) = 𝑃𝑃�𝑎𝑎 1 �𝑒𝑒(𝐴𝐴 1 )� . But this is what is claimed by the theorem for case 1. Case 2 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) = 0 We can assume 𝑃𝑃(𝑎𝑎 1 ) ≠ 0 , since otherwise we would be done, because 𝑃𝑃(𝑎𝑎 1 ) = 𝑃𝑃(𝑎𝑎 1 |∅) P (a 1 ). Then there is an 𝑖𝑖 with 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖−1 ) ≠ 0 and 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖−1 , 𝑎𝑎 𝑖𝑖 ) = 0 . Then 𝑃𝑃(𝑎𝑎 𝑖𝑖 |𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖−1 ) = 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖 ) 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖−1 ) = 0. As above we have 𝑃𝑃(𝑎𝑎 𝑖𝑖 |𝑎𝑎 1 , … , 𝑎𝑎 𝑖𝑖−1 ) = 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴 𝑖𝑖 )�. Hence, 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴 𝑖𝑖 )� = 0 , which is what is claimed by the theorem in case 2. This completes the proof of the theorem. A typical question for an expert system is: given some variables, which are instantiated, and others, which are non-instantiated, how likely is a certain hypothesis. Thus, it is not sufficient to know all values 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) for all combinations of values 𝑎𝑎 1 , … , 𝑎𝑎 𝑚𝑚 . The values 𝑃𝑃(𝑎𝑎 1 , … , 𝑎𝑎 𝑛𝑛 ) could be determined with the help of the above theorem from the values 𝑃𝑃�𝑎𝑎 𝑖𝑖 �𝑒𝑒(𝐴𝐴 𝑖𝑖 )� . Note that only some, but not all variables are instantiated. Let us look at an example. The variables are 𝐴𝐴 , 𝐵𝐵 , 𝐶𝐶 , 𝐷𝐷 , 𝐸𝐸 . We wish to compute 𝑃𝑃(𝑐𝑐 2 |𝑎𝑎 1 , 𝑏𝑏 1 ) . In principle, this computation is possible via the detour of the joint distribution. <?page no="135"?> Support Vector Machines Made Easy 135 Thus, we can compute 𝑃𝑃(𝑐𝑐 2 |𝑎𝑎 1 , 𝑏𝑏 1 ) = 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 ) 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 ) . Numerator: 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 ) = � 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 2 , 𝑑𝑑 𝑢𝑢 , 𝑒𝑒 𝑠𝑠 ) 𝑢𝑢,𝑠𝑠 Denominator: 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 ) = � 𝑃𝑃(𝑎𝑎 1 , 𝑏𝑏 1 , 𝑐𝑐 𝑛𝑛 , 𝑑𝑑 𝑢𝑢 , 𝑒𝑒 𝑠𝑠 ) 𝑢𝑢,𝑠𝑠,𝑛𝑛 The sum will run over all combinations 𝑢𝑢 , 𝑣𝑣 , 𝑤𝑤 . We thus have exponential computing time. We will consider a more specific example: Computation of the probability of Parkinson’s disease, given symptoms In this example the symptoms are: ➊ Trembling hand/ tremor ➋ Arm held at an angle ➌ Walking without sufficiently lifting the legs ➍ (Other typical symptoms which can readily observed externally) To simplify matters, we will only look at symptoms 1 and 2. Hence the propositional variables are 𝐴𝐴 : Parkinson 𝐵𝐵 : Symptom 1 𝐶𝐶 : Symptom 2 (The other symptoms can be processed in the same way) Then we have the same network as shown in Figure 72. From the statistics we know that 250,000 of 80,000,000 (total population of Germany) have Parkinson’s disease. Hence 𝑃𝑃(𝑎𝑎) = 0.003 and 𝑃𝑃(-𝑎𝑎) = 0.997 . Likewise, we assume the following values have been determined from medical statistics: 𝑃𝑃(𝑏𝑏|𝑎𝑎) = 0.9 and 𝑃𝑃( - 𝑏𝑏|𝑎𝑎) = 0.1 𝑃𝑃(𝑏𝑏|-𝑎𝑎) = 0.2 and 𝑃𝑃( - 𝑏𝑏| - 𝑎𝑎) = 0.8 Furthermore, 𝑃𝑃(𝑐𝑐|𝑎𝑎) = 0.7 and 𝑃𝑃( - 𝑐𝑐|𝑎𝑎) = 0.3 𝑃𝑃(𝑐𝑐|-𝑎𝑎) = 0.1 and 𝑃𝑃( - 𝑐𝑐| - 𝑎𝑎) = 0.9. <?page no="136"?> 136 Fundamentals of Machine Learning Note that the numbers in this example are fictitious. We want to show that 𝑃𝑃(𝑎𝑎|𝑏𝑏) ≠ 𝑃𝑃(𝑎𝑎|𝑏𝑏, -𝑐𝑐) . This would mean that we will obtain a different probability for Parkinson in a specific patient, if we know that symptom 𝐶𝐶 is not present, when compared to the situation where nothing is known on symptom 𝐶𝐶 . Hence, to know that -𝑐𝑐 holds must give a different result, than not knowing anything about 𝑐𝑐 . Notice that human ‘experts’ often make mistakes here. Thus: ‘No fingerprints from suspect Frank were at the crime scene, but other people’s fingerprints were found’ is different from ‘fingerprints from the crime scene could not be analyzed’. The fact that the difference in probability for the two cases is as large as a factor of 10 (as we shall soon see) should surprise you. Then we can compute 𝑃𝑃(𝑎𝑎|𝑏𝑏) from the triplet probabilities: Specifically, 𝑃𝑃(𝑎𝑎|𝑏𝑏) = 𝑃𝑃(𝑎𝑎|𝑏𝑏, 𝑐𝑐) + 𝑃𝑃(𝑎𝑎|𝑏𝑏, -𝑐𝑐) = 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) 𝑃𝑃(𝑏𝑏, 𝑐𝑐) + 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) 𝑃𝑃(𝑏𝑏, -𝑐𝑐) . Again, read comma as ‘and’ = 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) + 𝑃𝑃(-𝑎𝑎, 𝑏𝑏, 𝑐𝑐) + 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) 𝑃𝑃(𝑎𝑎, 𝑏𝑏, -𝑐𝑐) + 𝑃𝑃(-𝑎𝑎, 𝑏𝑏, -𝑐𝑐) The triplet probabilities can be computed from the main theorem above: 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) = 𝑃𝑃(𝑎𝑎)𝑃𝑃(𝑏𝑏|𝑎𝑎)𝑃𝑃(𝑐𝑐|𝑐𝑐) The probabilities on the right-hand side are all known (see above). Then we have 𝑃𝑃(𝑎𝑎, 𝑏𝑏, 𝑐𝑐) = 0.003 ⋅ 0.9 ⋅ 0.7 = 0.00189 . Likewise, we have: 𝑃𝑃(-𝑎𝑎, 𝑏𝑏, 𝑐𝑐) = 0.01994 𝑃𝑃(𝑎𝑎, 𝑏𝑏, -𝑐𝑐) = 0.00081 and 𝑃𝑃(-𝑎𝑎, 𝑏𝑏, -𝑐𝑐) = 0.17946 . Furthermore, 𝑃𝑃(𝑎𝑎, -𝑏𝑏, 𝑐𝑐) = 0.0001 𝑃𝑃( - 𝑎𝑎 , - 𝑏𝑏 , 𝑐𝑐) = 0.7976 <?page no="137"?> Support Vector Machines Made Easy 137 𝑃𝑃(𝑎𝑎 , - 𝑏𝑏 , - 𝑐𝑐) = 0.00009 and 𝑃𝑃(-𝑎𝑎, -𝑏𝑏, -𝑐𝑐) = 0.71884 . Check that the sum of all the triplet probabilities is indeed 1! This is an experimental proof of our theorem. But now we have: 𝑃𝑃(𝑎𝑎|𝑏𝑏) = 0.09410 and 𝑃𝑃(𝑎𝑎|𝑏𝑏, -𝑐𝑐) = 0.00468 . Hence, we have a difference by a factor more than 10 if 𝑐𝑐 is not instantiated! The theory of causal networks is a very active area of research. Specifically, to make causal networks practical, one has to find ways to avoid the exponential computing time resulting from the above ‘naïve’ procedure of computing the triplet probabilities. This is indeed possible, and much work has been dedicated to this goal. <?page no="139"?> Appendix - Linear Programming Linear programming occurs in several places above. We will now describe a method for linear programming, called the simplex method. To describe this method, we will only state it in its most simple terms, but attempt to describe a complete, implementable version of it. Let us restate LP: Maximize 𝑧𝑧 𝑃𝑃 = 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 Subject to 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 where 𝑥𝑥 𝑖𝑖 ≥ 0 . To describe the simplex method, we will first consider a simplified problem (LP0). LP0 is the same as LP, with the only difference being that in LP0 all values 𝑏𝑏 𝑖𝑖 ≥ 0 . Hence: LP0 has the following form: (LP0) Maximize 𝑧𝑧 𝑃𝑃 = 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 Subject to 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 where 𝑥𝑥 𝑖𝑖 ≥ 0 , 𝑏𝑏 𝑖𝑖 ≥ 0 . Notice The point (0, … ,0) T is always inside the solution polyhedron for LP0! This only holds for LP0, but not for LP! Visually (LP0): <?page no="140"?> 140 Fundamentals of Machine Learning Figure 74 - Simplified Linear Program LP0 A.1 Solving LP0 problems We consider a single inequality 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 . Instead of this, we can also write 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1 = 𝑏𝑏 1 , where 𝑠𝑠 1 ≥ 0 is an additional variable, called a slack variable. Notice: to ensure that the two versions (inequality form and equation form) are really equivalent, it is necessary to require 𝑠𝑠 1 ≥ 0 . In this way, inequalities can be written as equations, simple by introducing slack variables. We transform all inequalities in LP0 to equations by using variables 𝑠𝑠 𝑖𝑖 : 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1 = 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑚𝑚 = 𝑏𝑏 𝑚𝑚 The new variables 𝑠𝑠 𝑖𝑖 are processed in the same way as the variables 𝑥𝑥 𝑖𝑖 , i.e. 𝑠𝑠 𝑖𝑖 ≥ 0 . Hence, all inequalities can be written in equation form.  Example LP0 with one inequality and two variables 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 7 Maximize 𝑥𝑥 1 feasible region 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 <?page no="141"?> Support Vector Machines Made Easy 141 Figure 75 - Simple example of an LP0 The equation form has three variables, and can be visualized in 3D: 𝑥𝑥 1 + 𝑥𝑥 2 + 𝑠𝑠 1 = 7 Maximize 𝑥𝑥 1 Figure 76 - Simple example of an LP0, equation form The feasible polyhedron (in 3D) is still two-dimensional. The first iteration step in the example will move from one vertex of the (now 3D) solution polyhedron to the next. This is illustrated in the following figures. feasible region 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 7 7 feasible region 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 7 7 𝑠𝑠 1 7 <?page no="142"?> 142 Fundamentals of Machine Learning Figure 77 - Start vertex to find the solution of an LP0 problem Move to next vertex: Figure 78 - First step of the iteration to find the solution of an LP0 problem Computationally, this iteration is done by ➊ Step: Start at vertex given by 𝑥𝑥 1 = 0 , 𝑥𝑥 2 = 0 , 𝑠𝑠 1 = 𝑏𝑏 1 ➋ Step: Move to vertex 𝑥𝑥 1 = 𝑏𝑏 1 , 𝑥𝑥 2 = 0 , 𝑠𝑠 1 = 0 We now transform the system by solving for 𝑥𝑥 1 . Thus the equation 𝑥𝑥 1 + 𝑥𝑥 2 + 𝑠𝑠 1 = 7 becomes 𝑥𝑥 1 = 7 − 𝑥𝑥 2 − 𝑠𝑠 1 . We transform the target function in the same way: Maximize 𝑥𝑥 1 becomes Maximize 7 − 𝑥𝑥 2 − 𝑠𝑠 1 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 𝑠𝑠 1 start vertex 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 𝑠𝑠 1 first step of the iteration <?page no="143"?> Support Vector Machines Made Easy 143 The reason for transforming the system in this way will become clear below. Generally, the transformation works as follows: Given: Maximize 𝑧𝑧 𝑃𝑃 = 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 subject to 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 where 𝑥𝑥 𝑖𝑖 ≥ 0 , 𝑏𝑏 𝑖𝑖 ≥ 0 . We find a variable 𝑥𝑥 𝑗𝑗 with a positive coefficient in the target function. The value of all variables in the target function is zero in the first step. The target function will thus increase from 0 to a positive value if we increase 𝑥𝑥 𝑗𝑗 to a positive value. Now consider all values of the form 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 for those equalities in which 𝑎𝑎 𝑘𝑘𝑗𝑗 > 0 . Hence, also 𝑎𝑎 𝑘𝑘𝑗𝑗 is not zero. We then increase 𝑥𝑥 𝑗𝑗 to the smallest of these values 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 ⁄ . And we instead reduce the value of 𝑠𝑠 𝑘𝑘 to 0. Then the 𝑘𝑘 -th equation again holds (after adjusting the values of both 𝑥𝑥 𝑗𝑗 and 𝑠𝑠 𝑘𝑘 ). And the 𝑘𝑘 -th equation is now solved for 𝑥𝑥 𝑗𝑗 . The result is an expression for 𝑥𝑥 𝑗𝑗 . Now we replace 𝑥𝑥 𝑗𝑗 in all other equations and in the target function by this resulting expression. We change the values 𝑠𝑠 𝑖𝑖 accordingly, so that the other equations hold. In summary, one iteration does the following: ➊ Find a variable (here 𝑥𝑥 𝑖𝑖 ). ➋ Find an equation (here 𝑘𝑘 -th equation). ➌ Adapt all other equations accordingly. Notice The following two invariants will continue to hold after each iteration step: ▶ each equation contains exactly one variable with the value unequal to 0. ▶ all variables in the target function are zero. Since 𝑥𝑥 -variables and 𝑠𝑠 -variables are treated in the same way, the iteration step can be repeated! To repeat the iteration step, we need the system to be in the same general form as before the step. But there is a third invariant, which holds after each step: <?page no="144"?> 144 Fundamentals of Machine Learning We can ensure that all right-hand sides (i.e. values 𝑏𝑏 𝑖𝑖 ) will always remain positive! We will now show that the third invariant always holds: 𝑥𝑥 𝑗𝑗 is increased to the smallest of the values 𝑏𝑏 𝑘𝑘 / 𝑎𝑎 𝑘𝑘𝑗𝑗 . The result is: If the 𝑘𝑘 -th equation is solved for 𝑥𝑥 𝑗𝑗 , we obtain 𝑥𝑥 𝑗𝑗 = 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 − �𝑎𝑎 𝑘𝑘1 𝑎𝑎 𝑘𝑘𝑗𝑗 � 𝑥𝑥 1 − ⋯ − �𝑎𝑎 𝑘𝑘𝑛𝑛 𝑎𝑎 𝑘𝑘𝑗𝑗 � 𝑥𝑥 𝑛𝑛 The 𝑙𝑙 -th equation looks like this: 𝑥𝑥 𝑙𝑙1 𝑥𝑥 1 + 𝑎𝑎 𝑙𝑙2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑙𝑙𝑗𝑗 𝑥𝑥 𝑗𝑗 + ⋯ + 𝑎𝑎 𝑙𝑙𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑙𝑙 = 𝑏𝑏 𝑙𝑙 Inserting this for 𝑥𝑥 𝑗𝑗 into the 𝑙𝑙 -th equation, we obtain a new right-hand side in the 𝑙𝑙 th equation: 𝑏𝑏 𝑙𝑙 − 𝑎𝑎 𝑙𝑙𝑗𝑗 � 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 � But this latter expression is ≥ 0 , since 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 ⁄ ≤ 𝑏𝑏 𝑙𝑙 𝑎𝑎 𝑙𝑙𝑗𝑗 ⁄ , since we had specifically chosen 𝑏𝑏 𝑘𝑘 𝑎𝑎 𝑘𝑘𝑗𝑗 ⁄ to be the smallest such value! This shows that the third invariant always holds. At each step the target function will grow, and we move to a better vertex of the solution polyhedron. The termination condition is: all coefficients of variables in the target function are negative. If this ever happens, we can stop the iteration. In the above example we have at the end: Maximize 7 − 𝑥𝑥 2 − 𝑠𝑠 1 Here all coefficients are equal to −1 , i.e. termination condition has been reached. We now look at an example (for LP0) 7 Inequalities 𝑥𝑥 1 ≤ 120 𝑥𝑥 2 ≤ 70 𝑥𝑥 1 + 𝑥𝑥 2 ≤ 140 𝑥𝑥 1 + 2𝑥𝑥 2 ≤ 180 7 This example is based on data taken from [25] <?page no="145"?> Support Vector Machines Made Easy 145 Maximize 150𝑥𝑥 1 + 450𝑥𝑥 2 The first step is to transform this system to equation form (four inequalities result in four new variables): Equalities 𝑥𝑥 1 + 𝑠𝑠 1 = 120 𝑥𝑥 2 + 𝑠𝑠 2 = 70 𝑥𝑥 1 + 𝑥𝑥 2 + 𝑠𝑠 3 = 140 𝑥𝑥 1 + 2𝑥𝑥 2 + 𝑠𝑠 4 = 180 Maximize 150𝑥𝑥 1 + 450𝑥𝑥 2 Iteration step 1 (start) 𝑥𝑥 1 = 0 𝑥𝑥 2 = 0 𝑠𝑠 1 = 120 𝑠𝑠 2 = 70 𝑠𝑠 3 = 140 𝑠𝑠 4 = 180 Iteration step 2 𝑥𝑥 2 has a positive coefficient in the max.-expression, thus the value of 𝑥𝑥 2 can be raised to 70. Then we move 𝑠𝑠 2 down to zero ( 𝑠𝑠 2 = 0 ). After solving for 𝑥𝑥 2 , the values 𝑠𝑠 3 and 𝑠𝑠 4 must be adapted in order to keep the whole equation system valid. This yields a new vertex 𝑥𝑥 1 = 0 𝑥𝑥 2 = 70 𝑠𝑠 1 = 120 𝑠𝑠 2 = 0 𝑠𝑠 3 = 70 𝑠𝑠 4 = 40 Then the new equation system (in the second iteration step) becomes (after solving equation 2 for 𝑥𝑥 2 giving 𝑥𝑥 2 = 70 − 𝑠𝑠 2 , and inserting) 𝑥𝑥 1 + 𝑠𝑠 1 = 120 𝑥𝑥 2 + 𝑠𝑠 2 = 70 𝑥𝑥 1 + (70 − 𝑠𝑠 2 ) + 𝑠𝑠 3 = 140 𝑥𝑥 1 + 2(70 − 𝑠𝑠 2 ) + 𝑠𝑠 4 = 180 Maximize 150𝑥𝑥 1 + 450(70 − 𝑠𝑠 2 ) <?page no="146"?> 146 Fundamentals of Machine Learning Rewriting (continuing the second iteration step) gives: 𝑥𝑥 1 + 𝑠𝑠 1 = 120 𝑥𝑥 2 + 𝑠𝑠 2 = 70 𝑥𝑥 1 − 𝑠𝑠 2 + 𝑠𝑠 3 = 70 𝑥𝑥 1 + 2𝑠𝑠 2 + 𝑠𝑠 4 = 40 Maximize 150𝑥𝑥 1 + 31500 − 450𝑠𝑠 2 Note that the invariants still hold! Iteration step 3 The only remaining variable with a positive coefficient in target expression is 𝑥𝑥 1 . 𝑥𝑥 1 can now be raised from 0 to 40, instead 𝑠𝑠 4 moves down to 0. We solve the 4 th equation for 𝑥𝑥 1 , yielding 𝑥𝑥 1 = 2𝑠𝑠 2 − 𝑠𝑠 4 + 40 and insert this into the target expression. The original expression Maximize 150𝑥𝑥 1 + 31500 − 450𝑠𝑠 2 thus becomes Maximize 150(2𝑠𝑠 2 − 𝑠𝑠 4 + 40) + 31500 − 450𝑠𝑠 2 . Hence, Maximize 37500 − 150𝑠𝑠 2 − 150𝑠𝑠 4 i.e., the termination condition is reached. A.2 Schematic representation of the iteration steps To obtain a schematic representation we introduce some notation. We will call a null variable: a variable, currently having value 0 (initially 𝑥𝑥 𝑖𝑖 ) non-null variable: all remaining variables (initially 𝑠𝑠 𝑖𝑖 ) Then the starting solution contains 𝑛𝑛 null variables and 𝑚𝑚 non-null variables. In the following table, we represent the system schematically. We represent only coefficients, but not values of variables: 𝒙𝒙 𝟏𝟏 𝒙𝒙 𝟐𝟐 𝒔𝒔 𝟏𝟏 𝒔𝒔 𝟐𝟐 𝒔𝒔 𝟑𝟑 𝒔𝒔 𝟒𝟒 𝒃𝒃 1 0 1 0 0 0 120 <?page no="147"?> Support Vector Machines Made Easy 147 0 1 0 1 0 0 70 1 1 0 0 1 0 140 1 2 0 0 0 1 180 150 450 0 0 0 0 0 The null variables are 𝑥𝑥 1 , 𝑥𝑥 2 . The non-null variables are 𝑠𝑠 1 , 𝑠𝑠 2 , 𝑠𝑠 3 , 𝑠𝑠 4 . Note ▶ Columns with exactly one entry with value 1 and else only zeros belong to nonnull variables (called NN-variables) ▶ The remaining columns belong to null variables. ▶ The value of the NN-variables is always the value of 𝑏𝑏 taken from the row with the entry 1, e.g. the value of 𝑠𝑠 2 in above table is 70. An iteration step can now be summarized as follows, similar to the case of Gaussian elimination: Subtract an appropriate multiple of one line from the remaining lines, with the goal of transforming one NN-column into an N-column. Then one Ncolumn will become an NN-column. In the above example, we subtract a multiple of line 2 from the remaining lines, so that 𝑠𝑠 2 becomes an N-variable, and 𝑥𝑥 2 becomes an NN-Variable. We execute this for line 3: 𝒙𝒙 𝟏𝟏 𝒙𝒙 𝟐𝟐 𝒔𝒔 𝟏𝟏 𝒔𝒔 𝟐𝟐 𝒔𝒔 𝟑𝟑 𝒔𝒔 𝟒𝟒 𝒃𝒃 1 0 1 0 0 0 120 0 1 0 1 0 0 70 1 0 0 −1 1 0 70 1 2 0 0 0 1 180 150 450 0 0 0 0 0 After execution of the complete iteration step we have the following (NN-Variables are marked darker) 𝒙𝒙 𝟏𝟏 𝒙𝒙 𝟐𝟐 𝒔𝒔 𝟏𝟏 𝒔𝒔 𝟐𝟐 𝒔𝒔 𝟑𝟑 𝒔𝒔 𝟒𝟒 𝒃𝒃 1 0 1 0 0 0 120 0 1 0 1 0 0 70 1 0 0 −1 1 0 70 1 0 0 −2 0 1 40 150 0 0 −450 0 0 31500 <?page no="148"?> 148 Fundamentals of Machine Learning We compare this to the above computation for the same example: 𝑥𝑥 2 is now NN-variable with value 70 𝑠𝑠 2 is now N-variable (value 0) A.3 Transition from LP0 to LP Above we have described a method for solving (LP0). For (LP0) we had assumed all 𝑏𝑏 -values to be positive. We will now consider the case of a general (LP)-problem. For (LP0), the start point of the iteration can always be found directly. We have used the point (0, … ,0) T as start point. In a general (LP)-problem, the point (0, … ,0) T might not be feasible. Hence, our goal is to find a feasible point for a general (LP). Surprisingly, this can be done by applying the iteration steps introduced above for (LP0). To find the start point, such iteration steps will be applied to a modified (LP) problem. The general LP-problem has the following form: Maximize 𝑐𝑐 1 𝑥𝑥 1 + ⋯ + 𝑐𝑐 𝑛𝑛 𝑥𝑥 𝑛𝑛 under the constraints 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 ≤ 𝑏𝑏 𝑚𝑚 𝑎𝑎 11 ′ 𝑥𝑥 1 + 𝑎𝑎 12 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 ≥ 𝑏𝑏 1′ ⋮ 𝑎𝑎 𝑙𝑙1 ′ 𝑥𝑥 1 + 𝑎𝑎 𝑙𝑙2 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑙𝑙𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 ≥ 𝑏𝑏 𝑙𝑙′ . Thus: here there are ≥ -inequalities as well (all the 𝑎𝑎 𝑖𝑖𝑗𝑗 ′ -lines). But we can again assume 𝑏𝑏 𝑗𝑗 ≥ 0 ! As above, we will first transform to equation form: 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1 = 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑚𝑚 = 𝑏𝑏 𝑚𝑚 𝑎𝑎 11 ′ 𝑥𝑥 1 + 𝑎𝑎 12 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1′ = 𝑏𝑏 1′ ⋮ 𝑎𝑎 𝑙𝑙1 ′ 𝑥𝑥 1 + 𝑎𝑎 𝑙𝑙2 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑙𝑙𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑙𝑙′ = 𝑏𝑏 𝑙𝑙′ <?page no="149"?> Support Vector Machines Made Easy 149 We now consider the extended system: Maximize the expression −𝑧𝑧 1 − ⋯ − 𝑧𝑧 𝑙𝑙 under the constraints 𝑎𝑎 11 𝑥𝑥 1 + 𝑎𝑎 12 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1 = 𝑏𝑏 1 ⋮ 𝑎𝑎 𝑚𝑚1 𝑥𝑥 1 + 𝑎𝑎 𝑚𝑚2 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑚𝑚𝑛𝑛 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑚𝑚 = 𝑏𝑏 𝑚𝑚 𝑎𝑎 11 ′ 𝑥𝑥 1 + 𝑎𝑎 12 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 1𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 1′ = 𝑏𝑏 1′ − 𝑧𝑧 1 ⋮ 𝑎𝑎 𝑙𝑙1 ′ 𝑥𝑥 1 + 𝑎𝑎 𝑙𝑙2 ′ 𝑥𝑥 2 + ⋯ + 𝑎𝑎 𝑙𝑙𝑛𝑛 ′ 𝑥𝑥 𝑛𝑛 + 𝑠𝑠 𝑙𝑙′ = 𝑏𝑏 𝑙𝑙′ − 𝑧𝑧 𝑙𝑙 . This extended system has new variables 𝑧𝑧 𝑗𝑗 . Notice The extended problem thus stated is in LP0-form! If this problem admits a solution in which all 𝑧𝑧 𝑗𝑗 = 0 , then this solution can be used as the start vertex for solving the original problem. On the other hand: if max −𝑧𝑧 1 − ⋯ − 𝑧𝑧 𝑙𝑙 is not 0, then there is no feasible point for the original problem. Summary (Method for LP) The simplex algorithm has two phases: Phase 1 Find an arbitrary vertex of the solution polyhedron. Phase 2 Iteratively improve this vertex, until an extremal vertex in c-direction is reached. The process for phase 2 is illustrated in the following figure. <?page no="150"?> 150 Fundamentals of Machine Learning Figure 79 - Concept of phase 2 of the Simplex Algorithm Steps ▶ A subset of 𝑛𝑛 inequalities (implicit or explicit) determines a vertex. ▶ At the end of phase 1 we have a vertex of the solution polyhedron, if one exists. ▶ The method will always remain on the boundary of the solution polyhedron and move from vertex to vertex. A.4 Computing time and complexity issues With commercial implementations, systems with well more than 10,000 - 20,000 inequalities can be solved. The observed growth of computing time with the number 𝑚𝑚 of inequalities is 𝑚𝑚 3 . The growth is roughly linear in the number of variables Karmakar-Method The simplex method crawls on the surface of the feasible polyhedron. The Karmarkar-method moves inside the polyhedron. Hence, it is called an interior point method. The complexity of linear programming was determined by Karmarkar in 1984. It is 𝒪𝒪(𝐿𝐿𝑑𝑑 3.5 ) , where 𝐿𝐿 is the sum of the lengths of the input coefficients and 𝑑𝑑 is the number of variables. If the length of each individual coefficient �𝑎𝑎 𝑖𝑖𝑗𝑗 , 𝑏𝑏 𝑖𝑖 , 𝑐𝑐 𝑖𝑖 � T is constant (i.e. bounded by a constant), then 𝐿𝐿 = 𝒪𝒪(𝑚𝑚𝑑𝑑), where 𝑚𝑚 is the number of inequalities. Hence: The total complexity is 𝒪𝒪(𝑚𝑚𝑑𝑑 4.5 ) if the length of the coefficients is constant. Ongoing research on the combinatorial complexity for the LP-problem tries to find a bound in 𝑚𝑚 and 𝑑𝑑 alone, where the above assumption (constant coefficient length) is not needed. Surprisingly, no such combinatorially polynomial bounds (in both 𝑚𝑚 and 𝑑𝑑 ) are known to date, despite the above theorem by Karmarkar. 𝑥𝑥 1 𝑥𝑥 2 𝒄𝒄 point from phase 1 <?page no="151"?> References [1] G. Gordon, “Linear Programming, Lagrange Multipliers and Duality,” Carnegie Mellon University, Pittsburgh, PA, 2009. [2] M. G. Genton, “Classes of Kernels for Machine Learning: A Statistics Perspective,” Journal of Machine Learning Research, vol. 2, pp. 299-312, 2001. [3] B. Schölkopf, R. Herbrich und A. J. Smola, “A Generalized Representer Theorem,” in International Conference on Computational Learning Theory, Amsterdam, 2001. Vol. 2111 of LNCS, pp. 416-426. [4] J. Shawe-Taylor und N. Cristianini, Support Vector Machines, Cambridge, MA: Cambridge University Press, 2000. [5] J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization,” in Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, MIT Press, 1998, pp. 185-208. [6] L. M. Bregman, “The Relaxation Method for Finding the Common Point of Convex Sets and its Application to the Solution of Problems in Convex Programming,” USSR Computational Mathematics and Mathematical Physics, vol. 7, iss. 3, pp. 200-217, 1967. [7] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychological Reviews, vol. 65, pp. 386-408, 1958. [8] R. Dürichen, From univariate to multivariate respiratory motion compensation: a Bayesian way to increase treatment accuracy in robotic radiotherapy, Dissertation, University of Lübeck, 2015. [9] C. M. Bishop, Pattern Recognition and Machine Learning, New York: Springer, 2006. [10] C. E. Rasmussen und C. K. I. Williams, “Gaussian Processes in Machine Learning,” in Advanced Lectures in Machine Learning, vol. 3176, New York, Springer, 2004, pp. 63-71. [11] R. Dürichen, M. A. F. Pimentel, L. Clifton, A. Schweikard und D. A. Clifton, “Multi-task Gaussian Processes for Multivariate Physiological Time-Series Analysis,” IEEE Transactions on Biomedical Engineering, vol. 62, iss. 1, pp. 314- 322, 2014. [12] E. V. Bonilla, K. M. A. Chai und C. K. I. Williams, “Multi-task Gaussian Process Prediction,” in Advances in Neural Information Processing Systems (NIPS), <?page no="152"?> 152 References Vancouver, Canada, 2008, pp. 153-160. [13] P. H. Winston, Artificial Intelligence, 3 rd ed., Reading, MA: Addison Wesley, 1990. [14] V. N. Vapnik, The Nature of Statistical Learning Theory, New York: Springer, 2000. [15] R. J. Vanderbei, Linear Programming. Foundations and Extensions., 3 rd ed., Berlin, Heidelberg, New York: Springer, 2010. [16] A. Smola und B. Schölkopf, “A Tutorial on Support Vector Regression,” 1998. [17] S. Russel und P. Norvig, Artificial Intelligence: A modern approach, 3 rd ed., Upper Saddle River, NJ: Prentice Hall, 2003. [18] R. Neapolitan, Probabilistic Reasoning in Expert Systems. Theory and Applications., New York: Wiley, 1990. [19] J. Ma, J. Theiler und S. Perkins, “Accurate Online Support Vector Regression,” Neural Computation, vol. 15, iss. 11, pp. 2683-2703, 2003. [20] T. Ferguson, “Linear Programming - A Concise Introduction,” University of California at Los Angeles, Los Angeles, 2015. Electronic resource: www.math.ucla.edu/ ∼tom/ LP.pdf [21] F. Ernst und A. Schweikard, “Forecasting Respiratory Motion with Accurate Online Support Vector Regression (SVRpred),” International Journal of Computer Assisted Radiology and Surgery, vol. 4, iss. 5, pp. 439-447, 2009. [22] C. Cortes und V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, iss. 3, pp. 273-297, 1995. [23] G. Cauwenberghs und T. Poggio, “Incremental an Decremental Support Vector Machine Learning,” in NIPS'00 Proceedings of the 13 th International Conference on Neural Information Processing Systems, Denver, CO, 2000, pp. 388-394. [24] B. E. Boser, I. M. Guyon und V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in COLT '92 Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, PA, 1992, pp. 144-152. [25] F. Reinhardt und H. Soeder, „dtv-Atlas zur Mathematik,“ 1974. <?page no="153"?> Index A attribute 11 B Bayes’ theorem 101, 103 Bayesian network 121 C Cauchy-Schwarz inequality 61 causal networks 121 D DLP Dual Linear Program 37 DQP 38 dual 37 dual LP 37 E equilibrium theorem 42 error back-propagation 96 exhaustive 124 exhaustive means 124 expert system 121 F F (abbreviation) 17 feasible 16 feature space 53 forward propagation 95 G Gaussian distributions 109 Gaussian Kernel 59 Gaussian process model (GP) 108 genetic algorithms 98 GP 108 Gram matrix 58 H hidden layer 95 Hilbert Space 60 hyperparameters 112 I identity matrix 26 independence assumptions 126 inner product 59 inner product space 60 J joint distribution 124 K Karmarkar 150 Karmarkar-method 150 Karush-Kuhn-Tucker conditions 39 kernel 57 kernel matrix 58 L Lagrange Multiplier 45 Lagrangian 46 least mean square (LMS) 103 likelihood 103 linear feasibilty test 17 linear inequalities 16 Linear Programming 16 LMS 103 LP 139 LP linear progrmming 19 LP0 139 M MAP 106 margin 92 margin violation 34 <?page no="154"?> 154 Index maximum a posteriori (MAP) 106 maximum margin 27 measurement noise 111 Mercer’s theorem 59 MTGP 114, 115 multi-task Gaussian Process models (MTGP) 114 mutation 99 mutually exclusive 124 N negative logarithmic marginal likelihood (NLML) 112 neural networks 94 NLML 112, 116 non-null variable 146 null variable 146 P Perceptron 89 positive definite 26 positive semidefinite 26 pre-Hilbert Space 60 primal 37 propositional variables 124 PROSPECTOR 122 Q QP quadratic programming 25 R recombination 98 Reproducing Kernel Hilbert Space 64 reproducing property 64 RKHS 64 RMSE 118 root mean square error (RMSE) 118 S sample 11 Sequential Minimal Optimization (SMO) 67 simplex method 139 single-task GP models (STGP) 114 slack variables 34, 80, 81 solution polyhedron 19, 139 squared exponential kernel function 113 STGP 114, 115 support vector expansion 40 support vectors 40, 65 symmetric 26 T termination condition 144 training phase 11 V vertex 21 W weight vector 26 <?page no="155"?> uistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechtswissenschaft \ Historische Sprach senschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik schaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Stat \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschichte \ anagement \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschicht Spracherwerb \ Philosophie \ Medien- und Kommunikationswissenschaft \ Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidak acherwerb \ Philosophie \ Medien- und Kommunikationswissenschaft \ Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendid tik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechtswissenschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ \ DaF \ Germanistik \ Literaturwissenschaft \ Rechtswissenschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtscha Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ rismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundhe Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschichte \ Spracherwerb \ Philosophie \ Medien- und Kommunika manistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschichte \ Spracherwerb \ Philosophie \ Medien- und Kommun tionswissenschaft \ Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechts swissenschaft \ Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rec wissenschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ senschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenscha Elektrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ ktrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziolog Theaterwissenschaft Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechts aterwissenschaft Linguistik \ Literaturgeschichte \ Anglistik \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rec wissenschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ senschaft \ Historische Sprachwissenschaft \ Slawistik \ Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenscha Elektrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie ktrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziolo \ Theaterwissenschaft \ Geschichte \ Spracherwerb \ Philosophie \ Medien- und Kommunikationswissenschaft \ Linguistik \ Literaturgeschichte \ Anglistik eaterwissenschaft \ Geschichte \ Spracherwerb \ Philosophie \ Medien- und Kommunikationswissenschaft \ Linguistik \ Literaturgeschichte \ Angl \ Bauwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechtswissenschaft \ Historische Sprachwissenschaft \ Slawistik \ auwesen \ Fremdsprachendidaktik \ DaF \ Germanistik \ Literaturwissenschaft \ Rechtswissenschaft \ Historische Sprachwissenschaft \ Slawist Skandinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik \ Management \ ndinavistik \ BWL \ Wirtschaft \ Tourismus \ VWL \ Maschinenbau \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik \ Manageme Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschichte \ Spracherwerb \ hilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissenschaften \ Soziologie \ Theaterwissenschaft \ Geschichte \ Spracherwer u \ Politikwissenschaft \ Elektrotechnik \ Mathematik & Statistik \ Management \ Altphilologie \ Sport \ Gesundheit \ Romanistik \ Theologie \ Kulturwissensc Gernot Brauer Die Bit-Revolution Künstliche Intelligenz steuert uns alle in Wirtschaft, Politik und Gesellschaft 2019, 340 Seiten €[D] 24,99 ISBN 978-3-7720-8699-1 e ISBN 978-3-7720-5699-4 BUCHTIPP Wie tickt heute die Welt? Ist die Künstliche Intelligenz schon schlauer als wir? Entscheiden Maschinen intelligenter? Wie viele Menschen werden sie arbeitslos machen? Hebeln Computer Handel und Wettbewerb aus? Werden wir eine bessere Medizin mit unseren Daten bezahlen? Gibt es für Privatheit noch eine Chance? Und ist das ewige Leben kein bloßer Traum, sondern schon bald Realität? Diese Fragen beantwortet dieses Buch. Nach einer Übersicht über die alles ändernde Datenflut zeigt es an einer konkreten Software-Entwicklung, die Regierungen, Verbände und Firmen mit unvorstellbar genauen Datenanalysen und Prognosen versorgt, was Big Data und was Künstliche Intelligenz können, wie ihre Experten denken und handeln, welches Geschäftsmodell sie entwickeln und was das für uns bedeutet. UVK / Narr Francke Attempto Verlag GmbH + Co. KG \ Dischingerweg 5 \ 72070 Tübingen \ Germany Tel. +49 (0)7071 9797-0 \ Fax +49 (0)7071 97 97-11 \ info@narr.de \ www.narr.de <?page no="156"?> Künstliche Intelligenz wird unser Leben nachhaltig verändern - sowohl im Job als auch im Privaten. Doch wie funktioniert maschinelles Lernen eigentlich genau? Dieser Frage gehen zwei Lübecker Professoren in ihrem englischsprachigen Lehrbuch nach. Definitionen sind im Buch hervorgehoben und Aufgaben laden die LeserInnen zum Mitdenken ein. Das Lehrbuch richtet sich an Studierende der Informatik, Technik und Naturwissenschaften, insbesondere aus den Bereichen Robotik, Artificial Intelligence und Mathematik. Artificial intelligence will change our lives forever both at work and in our private lives. But how exactly does machine learning work? Two professors from Lübeck explore this question. In their English textbook they teach the necessary basics for the use of Support Vector Machines, for example, by explaining linear programming, the Lagrange multiplier, kernels and the SMO algorithm. They also deal with neural networks, evolutionary algorithms and Bayesian networks. Definitions are highlighted in the book and tasks invite readers to actively participate. The textbook is aimed at students of computer science, engineering and natural sciences, especially in the fields of robotics, artificial intelligence and mathematics. Informatik | Naturwissenschaften ,! 7ID8C5-cfcfbj! ISBN 978-3-8252-5251-9 Dies ist ein utb-Band aus dem UVK Verlag. utb ist eine Kooperation von Verlagen mit einem gemeinsamen Ziel: Lehrbücher und Lernmedien für das erfolgreiche Studium zu veröffentlichen. utb-shop.de QR-Code für mehr Infos und Bewertungen zu diesem Titel