SVM and Complementary Slackness David Rosenberg New York University February 21, 2017 David Rosenberg (New York University) DS-GA 1003 February 21, 2017 1 / 20
SVM Review: Primal and Dual Formulations SVM Review: Primal and Dual Formulations David Rosenberg (New York University) DS-GA 1003 February 21, 2017 2 / 20
SVM Review: Primal and Dual Formulations Support Vector Machine Hypothesis space F = { f (x) = w T x + b w R d,b R }. l 2 regularization (Tikhonov style) Loss l(m) = max{1 m,0} = (1 m) + The SVM prediction function is the solution to min w R d,b R 1 2 w 2 + c n max ( [ 0,1 y i w T x i + b ]). David Rosenberg (New York University) DS-GA 1003 February 21, 2017 3 / 20
SVM Review: Primal and Dual Formulations SVM as a Quadratic Program The SVM optimization problem is equivalent to minimize subject to 1 2 w 2 + c n ξ i ξ i 0 for i = 1,...,n ( 1 yi [ w T x i + b ]) ξ i 0 for i = 1,...,n Differentiable objective function 2n affine constraints. A quadratic program that can be solved by any off-the-shelf QP solver. Let s learn more by examining the dual. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 4 / 20
SVM Review: Primal and Dual Formulations SVM Lagrange Multipliers minimize subject to 1 2 w 2 + c n ξ i ξ i 0 for i = 1,...,n ( 1 yi [ w T x i + b ]) ξ i 0 for i = 1,...,n Lagrange Multiplier L(w,b,ξ,α,λ) = 1 2 w 2 + c n Constraint λ i -ξ i 0 α i ( 1 yi [ w T x i + b ]) ξ i 0 ξ i + ( [ α i 1 yi w T x i + b ] ) ξ i + λ i ( ξ i ) David Rosenberg (New York University) DS-GA 1003 February 21, 2017 5 / 20
SVM Review: Primal and Dual Formulations SVM Primal and Dual Lagrangian: L(w,b,ξ,α,λ) = 1 2 w 2 + c n Primal and dual formulations: p = inf sup w,ξ,b α,λ 0 ξ i + ( [ α i 1 yi w T x i + b ] ) ξ i + λ i ( ξ i ) L(w,b,ξ,α,λ) sup } {{ } primal objective α,λ 0 inf L(w,b,ξ,α,λ) = d w,b,ξ }{{} dual objective g(α,λ) Constraints are satisfied by w = b = 0 and ξ i = 1 for i = 1,...,n. So we have strong duality by Slater s conditions. That is: p = d. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 6 / 20
SVM Review: Primal and Dual Formulations First Order Conditions (KKT) Suppose (w,b ) are primal optimal and (ξ,α,λ ) are dual optimal. By strong duality, p = d, and by sandwich proof, p = d = L(w,b,ξ,α,λ ). Since L is differentiable, we must have first order conditions: w L(w,b,ξ,α,λ ) = 0 (shown last week) w = α i y i x i Conclude that w is in the span of the data i.e. a linear combination of x 1,...,x n. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 7 / 20
SVM Review: Primal and Dual Formulations The SVM Dual Problem We found the SVM dual problem can be written as: sup α s.t. α i 1 2 α i y i = 0 α i α i α j y i y j xj T i,j=1 [ 0, c ] i = 1,...,n. n x i Given solution α to dual, primal solution is w = n α i y ix i. Note α i [0, c n ]. So c controls max weight on each example. (Robustness!) David Rosenberg (New York University) DS-GA 1003 February 21, 2017 8 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Insights From Complementary Slackness: Margin and Support Vectors David Rosenberg (New York University) DS-GA 1003 February 21, 2017 9 / 20
Insights From Complementary Slackness:, Margin and Support Vectors The Margin and Some Terminology For notational convenience, define f (x) = x T w + b. Margin yf (x) Incorrect classification: yf (x) 0. Margin error: yf (x) < 1. On the margin : yf (x) = 1. Good side of the margin : yf (x) > 1. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 10 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Support Vectors and The Margin Recall slack variable ξ i = max(0,1 y i f (x i )) is the hinge loss on (x i,y i ). Suppose ξ i = 0. Then y i f (x i ) 1 on the margin (= 1), or on the good side (> 1) David Rosenberg (New York University) DS-GA 1003 February 21, 2017 11 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Complementary Slackness Conditions Recall our primal constraints and Lagrange multipliers: Lagrange Multiplier Constraint λ i -ξ i 0 α i (1 y i f (x i )) ξ i 0 Recall first order condition ξi L = 0 gave us λ i = c n α i. By strong duality, we must have complementary slackness: α i (1 y i f (x i ) ξ i ( ) = 0 c ) λ i ξ i = n α i ξ i = 0 David Rosenberg (New York University) DS-GA 1003 February 21, 2017 12 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Consequences of Complementary Slackness By strong duality, we must have complementary slackness: α i (1 y i f (x i ) ξ i ) = 0 ( c n α i ) ξ i = 0 If y i f (x i ) > 1 then the margin loss is ξ i = 0, and we get α i = 0. If y i f (x i ) < 1 then the margin loss is ξ i > 0, so α i = c n. If α i = 0, then ξ i = 0, which implies no loss, so y i f (x) 1. If α i ( 0, c n), then ξ i = 0, which implies 1 y i f (x i ) = 0. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 13 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Complementary Slackness Results: Summary α i = 0 = y i f (x i ) 1 ( α i 0, c ) = y i f (x i ) = 1 n α i = c n = y i f (x i ) 1 y i f (x i ) < 1 = α i = c [ n y i f (x i ) = 1 = α i 0, c ] n y i f (x i ) > 1 = α i = 0 David Rosenberg (New York University) DS-GA 1003 February 21, 2017 14 / 20
Insights From Complementary Slackness:, Margin and Support Vectors Support Vectors If α is a solution to the dual problem, then primal solution is w = with α i [0, c n ]. α i y i x i The x i s corresponding to α i > 0 are called support vectors. Few margin errors or on the margin examples = sparsity in input examples. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 15 / 20
Complementary Slackness To Get b Complementary Slackness To Get b David Rosenberg (New York University) DS-GA 1003 February 21, 2017 16 / 20
Complementary Slackness To Get b The Bias Term: b For our SVM primal, the complementary slackness conditions are: α i ( [ 1 yi x T i w + b ] ξ ) i = 0 (1) ( c ) λ i ξ i = n α i ξ i = 0 (2) Suppose there s an i such that α i ( 0, n) c. (2) implies ξ i = 0. (1) implies [ y i x T i w + b ] = 1 xi T w + b = y i (use y i { 1,1}) b = y i xi T w David Rosenberg (New York University) DS-GA 1003 February 21, 2017 17 / 20
Complementary Slackness To Get b The Bias Term: b The optimal b is b = y i x T i w We get the same b for any choice of i with α i ( 0, c n With exact calculations! With numerical error, more robust to average over all eligible i s: { ( b = mean y i xi T w α i 0, c )}. n If there are no α i ( 0, c n)? Then we have a degenerate SVM training problem 1 (w = 0). ) 1 See Rifkin et al. s A Note on Support Vector Machine Degeneracy, an MIT AI Lab Technical Report. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 18 / 20
Teaser for Kernelization Teaser for Kernelization David Rosenberg (New York University) DS-GA 1003 February 21, 2017 19 / 20
Teaser for Kernelization Dual Problem: Dependence on x through inner products SVM Dual Problem: sup α s.t. α i 1 2 α i y i = 0 α i α i α j y i y j xj T i,j=1 [ 0, c ] i = 1,...,n. n x i Note that all dependence on inputs x i and x j is through their inner product: x j,x i = xj T x i. We can replace xj T x i by any other product... This is a kernelized objective function. David Rosenberg (New York University) DS-GA 1003 February 21, 2017 20 / 20