Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVM from scratch and comparison with that of Sklearn #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 29 additions & 41 deletions assignment_3/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,34 @@
## Problem Statement
# SVM From Scratch
Approach
1. Generated the random dataset by defining a center and plotting random points at a fixed distance from it

Create a Linear SVM Classifier from scratch using Numpy or similar libraries.
Train your model on a Toy Dataset (say it has 500 datapoints each for binary classification) which are linearly separable.
2. Assigned one of the cluster as value 1 and another as -1

Compare your implementation in terms of runtime and accuracy with the one in [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
xi.w + b <= -1 if yi = -1 (belongs to -ve class)
xi.w + b >= +1 if yi = +1 (belongs to +ve class)

You may refer to [this](https://datascience.stackexchange.com/questions/39071/create-a-binary-classification-dataset-python-sklearn-datasets-make-classifica) for building a dataset on similar grounds.

## Submission Guidelines

1. Make a proper readme.md including:
- Approach
- Dataset
- Model
- Results (Including Comparision with sklearn)
- Steps to Run

2. Include the Requirements.txt file.

3. Make a predict.py file so that user may test the model on dataset.

4. Strictly submit the project scripts with .py file. (You may use notebooks but then export them to .py files before submitting)

5. Add function string comments wherever possible.

Example: A function which applies some gaussian distance filter taking two args.

def foo(arg1, arg2):

"""
Apply Gaussian distance filter to a numpy distance array
3. Made a class named SupportVecMac with 3 functions.
-fit to find the required w and b using training data
-predict to predict the outcomes on new data
-visualize to plot the data


Algorithm

1.Started with random big value of w in this case 10 times max feature value and decreased it slowly

2.Initially selected step size as w0*0.1

3. For b took a small value of b,
b will range from (-b0 < b < +b0, step = step*b_multiple)

4. Checked for points xi in dataset:
Checked constraint for all transformation of w like (w,w), (-w,w), (w,-w), (- w,-w)
if not yi(xi.w+b)>=1 for all points then break
Else find |w| and put it in dictionary as key and (w,b) as values

If w<=0 then decrease step size, no need to check for negative as transformation already cover that

Parameters
----------
arg1 : np.array
Description of arg1
arg2 : int
Description of arg2
On comparing the time taken it is found that time taken by SKlearn's SVM is much less than the one created from scratch

Returns
-------
expanded_distance: shape (n+1)-d array
Expanded distance matrix with the last dimension of length
len(self.filter)
"""

147 changes: 147 additions & 0 deletions assignment_3/SVMassignment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np


center1 = (70, 60)
center2 = (90, 20)
distance = 10

import numpy as np
x1 = np.random.uniform(center1[0], center1[0] + distance, size=(50,))
y1 = np.random.normal(center1[1], distance, size=(50,))

x2 = np.random.uniform(center2[0], center2[0] + distance, size=(50,))
y2 = np.random.normal(center2[1], distance, size=(50,))


class SupportVecMac:
def __init__(self, visualization=True):
self.visualization = visualization
self.colors = {1:'r', -1:'b'}
if self.visualization:
self.fig = plt.figure()
self.axis = self.fig.add_subplot(1,1,1)

#Method for training the dataset
def fit(self, data):
self.data = data
#opt_dict is a dictionary of { ||w|| : [w,b]} the norm is the index
opt_dict = {}

transforms = [[1,1],[-1,1],[1,-1],[-1,-1]]
all_data = []

for yi in self.data:
for featureset in self.data[yi]:
for feature in featureset:
all_data.append(feature)

self.max_feature_value = max(all_data)
self.min_feature_value = min(all_data)
all_data = None

step_sizes = [self.max_feature_value * 0.1, self.max_feature_value * 0.01, self.max_feature_value * 0.001]
b_range_multiple = 5
b_multiple = 5
latest_optimum = self.max_feature_value * 10

for step in step_sizes:
w = np.array([latest_optimum, latest_optimum])
optimized_flag = 0
while not optimized_flag:
for b in np.arange(-1*(self.max_feature_value*b_range_multiple), 1*(self.max_feature_value*b_range_multiple), step*b_multiple):
for transform in transforms:
w_t = w * transform

optimpoints = False
for yi in self.data:
for xi in self.data[yi]:
if not yi*(np.dot(xi, w_t)+b) >= 1:
optimpoints = True
if not (optimpoints):
opt_dict[np.linalg.norm(w_t)] = [w_t, b]

if w[0] < 0:
optimized_flag = 1
print('Reducing Step Size')
else:
w = w-step

norms = sorted([mod_w for mod_w in opt_dict])
opt_choice = opt_dict[norms[0]]
self.w = opt_choice[0]
self.b = opt_choice[1]
latest_optimum = opt_choice[0][0]+step*2


#Method for prediction of new data points
def predict(self, features):
#Predict the sign of (x.w+b)
classify = np.sign(np.dot(np.array(features), self.w)+self.b)
return classify

def visualize(self, data):
[[plt.scatter(x[0],x[1],s=100,c=self.colors[i]) for x in data_dict[i]] for i in data_dict]
def hyperplane(x,w,b,v):
return (-w[0]*x-b+v)/w[1]

hyp_x_min= self.min_feature_value*0.9
hyp_x_max = self.max_feature_value*1.1

# (w.x+b)=1
# positive support vector hyperplane
pav1 = hyperplane(hyp_x_min,self.w,self.b,1)
pav2 = hyperplane(hyp_x_max,self.w,self.b,1)
plt.plot([hyp_x_min,hyp_x_max],[pav1,pav2],'k')

# (w.x+b)=-1
# negative support vector hyperplane
nav1 = hyperplane(hyp_x_min,self.w,self.b,-1)
nav2 = hyperplane(hyp_x_max,self.w,self.b,-1)
plt.plot([hyp_x_min,hyp_x_max],[nav1,nav2],'k')

# (w.x+b)=0
# db support vector hyperplane
db1 = hyperplane(hyp_x_min,self.w,self.b,0)
db2 = hyperplane(hyp_x_max,self.w,self.b,0)
plt.plot([hyp_x_min,hyp_x_max],[db1,db2],'y--')





plt.scatter(x1, y1)
plt.scatter(x2, y2)
plt.show()

c1=np.column_stack((x1, y1))
c2=np.column_stack((x2, y2))
data_dict = {-1:c1,1:c2}
print(data_dict)


import time
startime=time.time()
svm =SupportVecMac() # Linear Kernel
svm.fit(data=data_dict)
svm.visualize(data=data_dict)
endtime=time.time()
print("SVM Made from Scratch"+str(endtime-startime))



#Comparison with SKlearn
X=np.row_stack((c1, c2))
print(X)
k=np.ones((50,1))*-1
y=np.row_stack((np.ones((50, 1)),k))


startt=time.time()
from sklearn.svm import SVC
Classifier=SVC(kernel='linear')
Classifier.fit(X,y.ravel())
endd=time.time()
print("SKLEARNS SVM"+str(1000*(endd-startt)))
print("Time is much less in case of Sklearn ")