This notebook will be used for the Assignment of week 5 of capstone course for Applied Data Science on Coursera

Author: Rajiv Ranjan Singh

1. Introduction

In this project for the capstone course for Applied Data Science, we shall cluster and compare neighbourhoods in Toronto and New York City on the basis of popular venues, major crime indicators and area of the neighbourhood. At the end of the exercise, a desirable output shall be a table of similar neighbourhoods across New York and Toronto. This information shall be useful for anyone who is doing business in any of these cities and wants to expand to the other city. It shall also be useful for professionals who are looking to change jobs within New York or Toronto or from one city to another.

1.1 Background

The City of New York, usually called either New York City (NYC) or simply New York (NY), is the most populous city in the United States. With an estimated 2018 population of 8,398,748 distributed over a land area of about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States.Located at the southern tip of the state of New York, the city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass and one of the world's most populous megacities. A global power city, New York City has been described as the cultural,financial and media capital of the world, and exerts a significant impact upon commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports.

Similarly, Toronto is the provincial capital of Ontario and the most populous city in Canada, with a population of 2,731,571 in 2016. The diverse population of Toronto reflects its current and historical role as an important destination for immigrants to Canada. Its varied cultural institutions, which include numerous museums and galleries, festivals and public events, entertainment districts, national historic sites, and sports activities,attract over 43 million tourists each year. Toronto is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.

1.2 Problem Description

Both cities have a large and diverse population of both Toronto and New York, including . Every year hundreds of thousands of immigrants, businessmen and professionals visit, migrate to or settle in these cities for work, livelihood and tourism. Due to the large area, several neighbourhoods, income differences, and variations in quality of life from one neighbourhood to another, it is often a tedious task to find neighbouhoods suitable to one's preferences. Businessmen often need information on neighbourhood to reloacte or to open new businesses. Further, anyone moving from New York to Toronto or vie versa would want to move to more or less similar neighbourhood. Also many people prefer to avoid crime-prone areas of a city whether for residence or business. Therefore the crime data for each neighbourhood is also relevant to categorization of similar neighbourhoods.

Therefore the problem is to group neighbourhoods within and across New York and Toronto and categorize them based on popular venues, businesses, area and crime rate.

1.3 Target Audience

Businesses looking for expansion in New York and Toronto
Professional looking for relocation
Students looking for relocation
House buyers

1.4 Success criteria

A good categorization of neighbourhoods between Toronto and New York and their prominent features.

2. Data

For our project we will need the following data for both Toronto and New York:

Major Crime Indicators for each neighbourhood/precinct
Area, Latitude and Longitutde for each neighbourhood/precinct
List of popular venues for each neighbourhood/precinct

2.1 Data for Toronto

For crime statistics of Toronto we shall use the data provided here. This dataset provides a number of features including, neighbourhood name, neighbouhood id, major crime indicators from 2014-2018, average of major crimes for last 5 years, area and length of boundaries of each neighbourhood and population. For our analysis, we shall retain the data on
- Neighbourhood
- Assault_AVG
- AutoTheft_AVG
- BreakandEnter_AVG
- Robbery_AVG
- TheftOver_AVG
- Homicide_AVG
- Shape__Area
For latitude and longitude of each neighbourhood, we shall fetch the data for each neighbourhood using 'Here' geocoder from geopy library in Python.
The data of venues and venue categories for each neighbourhood will be fetched using Foursquare API.

2.2 Data for New York

For crime statistics of New York we shall use the data provided here. This dataset provides a number of features including precinct id, major crimes from 2001-2018. For our analysis, we shall retain the data on

Precinct id
Average of Crime figures from 2014 to 2018

Using dataset cleaning and feature engineering, we shall compute the average of major crimes for each precinct and rename the columns to align the dataset with toronto dataset. For further analysis, each precinct shall be treated as a neighbourhood.

For location data of New York Precincts, open data available at this link will be used. This dataset provides the boundary data and shape_area for each precinct. The centroid (latitude and longitude) of each precinct will be calculated by extracting the boundary data from above dataset and used as neighbourhood latitude and longitude. Shape_area feature will be used as it is.

During coding, the performance of several free geocoders in fetching location data for precincts was observed to be poor as many precincts could not be located by free geocoders available in geopy.

The data of venues and venue categories for each neighbourhood will be fetched using Foursquare API.

3. Methodology

3.1 Data Retrieval,Cleaning and Feature Engineering

3.1.1 Toronto

First of all crime data of Toronto was retrieved from here. in json format. First, json dataset was flattened and saved as pandas Dataframe.

Then column names were cleaned and on following 8 columns were retained for further analysis.
- Neighbourhood
- Assault_AVG
- AutoTheft_AVG
- BreakandEnter_AVG
- Robbery_AVG
- TheftOver_AVG
- Homicide_AVG
- Shape__Area

It was observed that Homicide averages were 'NA' for some neighbourhoods where 0 homocides were recorded. hence 35 such values were accordingly replaced with 0 (int).

The cleaned dataframe had 140 rows and 8 columns.

Then latitudes and longitudes of each neighbourhood was fetched using Geocoder from Geopy package. It was observed that Nominatim geocoder was not able to give location data for all the neighbourhoods in tha dataset. After trials , it was observed that Here geocoder had best results among all the free geocoders enabled in Geopy. Further, in order to prevent blocking by "Here" api due to repeated calls RateLimiter library from Geopy was used. Using 'tqdm' package and progress_apply function, a progress bar was used to monitor the geocoding of each neighbourhood.

Finally, rows with any possible non retrived locations were dropped and cleaned dtaframe had shape of 140 X 10.

3.1.2 New York

The crime data for New York was much harder to process. We used the data provided here. This dataset provides a number of features including precinct id, major crimes from 2001-2018 IN EXCEL FILE FORMAT. As can be seen, columns contain yearwise crime figures ffrom 2000 to 2018 while crime heads are mentioned in successive rows for each precinct. Further, only every 7th row has valid precinct ID data and intermediate cells are 'NaN' due to inability to parse merged cells from xls file.

Data Cleaning: A lot of data cleaning was used to remove garbage data and get AVG of major crime heads for 2014-2018 for each precinct.

Finally Dataframe.pivot and renaming of columns was used to get the dataset in same order as Toronto crime data.

Since, neighbourhood wise crime data for new York was not available freely, precincts were treated as neighbourhoods. Retreving the location ( lat,long) for each precinct turned out to be difficult. The open dataset available as csv file here. It featured precinct ID, Shape_area, Shape_length and the all the locations forming the boundary geometry of each precinct. A custom function was defined to extract all the locations from boundary geometry and return the mean of latitudes and longitudes of all boundary points as the central latitude and longitude for precinct.

Finally some columns renaming and reorientation and merging with crime data of New York to get both Toronto and new York Datasets in same feature name and size.

Finally, the central latitude and longitude for each precinct area were converted to custom neighbourhood names by using reverse geocoding from Here geocoder in geopy. The custom neighbourhood names were cleaned using regex substition.

3.2 Exploratory Data Analysis

3.2.1 Toronto

Plot Toronto neighbourhoods on map using Nominatim and Folium

The Toronto Neighbourhood Map

Similarly plot new York neighbourhoods using Folium and Nominatim

The New York Neighbourhood Map

Explore First neighbourhood in Toronto by fetching 100 top venues within a radius of 500 m using Foursquare api.

We can see that only 45 popular venues were returned by Foursquare for Yonge-St.Clair

Define a custom function get_nearby_venues to get Venue, Venue Latitude, Venue Longitude, Venue Category for each neighbourhood in Toronto Dataset. Apply the function to all rows of toronto dataset.

Check the number of venues returned for each neighbourhood and count the total number of categories of all venues in Toronto dataset. We can see that total 280 categories were returned for Toronto neighbourhood popular venues.

Plot the number of venues for each neighbourhood. We can that their is huge variation in number of venues for each neighbourhood of Toronto.

Converting venue category in numeric variable using One Hot Encoding( pandas dummy variables) and then grouping cateogry-wise total venues for each neighbourhood. Then this venue dataframe would be merged with crime dataset to get final dataframe ready for clustering and analysis.

3.2.2 New York

Entire process from steps 1 to 8 is repeated for new York dataframe too beginning with getting venues using Foursquare.

Final grouped dataset of crime and category wise venues for New York Neighbourhoods

Plot the number of venues for each neighbourhood. We can that several neighbourhoods in New York have hit the ceiling of 100 popular venues per neighbourhood.

3.2.3 Dropping Sparse Columns of venue cateogory which are present in 2 or less than 2 neighbourhoods

Two many features in a dataset can affect machine learning and present difficulties in proper clustering. Hence features ('venue categories') which are present (non-zero) for two or less than two neighbourhoods were dropped to condense the dataset.

We can see that 101(401 - 300) columns were dropped from NYC dataset and 50 (290-240) columns were dropped from Toronto dataset.

3.3 Inferential Statistical Testing

3.3.1 Plotting distance matrices for Toronto and New York

We define a custom function to plot distance matrix of Neighbourhood Observations using cdist() function in scipy.spatial.distance library and plot the resulting distance matrix using seaborn.heatmap(). Anohter subplot is plotted that shows the heatmap of distance matrix sorted along rows and columns

Plotting Toronto dataset using plot_dist_matrix

Plotting similar distance matrices for New York data

On comparison of distance matrix plots of New York and Toronto we can clearly observe that pairwise distances between neighbourhoods in Toronto are in much smaller range ( almost 70% in 0-3) compared to pairwise distances of New York neighbourhoods which are in a larger range ( almost 70 % in 4.5 and above). We can infer that similarity between neighbourhoods in Toronto is higher than similarity between those in New York.

3.3.2 Finding optimum value of n_cluster for kmeans algorithm

Before using kmeans for clustering of similar neighbourhoods, we needs to find optimum number of clusters to get the best results. In order to find ideal number of clusters we need to run kmeans for varying range of clusters. For the purpose of this exercise, we use range of 2-15. We then define a custom function to plot silhouette scores for each test value of K. Note that features are scaled prior to clustering fit so that no features biases the distance matrix.

As per the graph of silhouette scores, any values cluster(K) in the range of 2-8 will give a good score. We choose kclusters to be 6 for clsuering of Toronto Neighbourhoods.

Same process of plotting silhouette scores was followed for finding optimum value of n_clusters for Kmeans for New York neighbourhoods. The resulting plot is

We can see that best silhouette scores are obtained for k in the range of 2 to 5. However, the best silhouette scores are not more than 0.09 which is way too low than corresponding scores for Toronto neighobourhoods ( 0.30 - 0.35). This shows that with the presently generated features , clustering for New York neighbourhoods may not be optimum. This can also be understood from the heatmap of distance matrix in section 3.3.1 (1) where we can see that very few neighbourhoods are closely located (~ few dark shades) and most of the neihgbourhoods are distant in feature space. However, for the purpose of this exercise,, we will use value of 5 for n_clusters.

3.4 Machine Learning

3.4.1 Clustering Toronto neighbourhoods

First of all, we cluster the neighbourhoods within the city of Toronto using kmeans. Initially, kmeans was run with parameters values:-

n_clusters=kclusters (6)
random_state=0,
n_init = N_INIT(20)
algorithm='auto'

During code execution it was observed that clusters labels generated by kmeans were changing from run to run. This is probably because neighbourhoods are tightly packed in the feature-space as observed in the heatmap plot of distance matrix ( section 3.3.1). In order to get consistent cluster labels, we increased the n_init parameter to 200 to enable maximum possible exploration through random initialization. Also algorithm used was forced as 'full'. The documentation for sklearn.cluster.KMeans says the following:- _algorithm“auto”, “full” or “elkan”, default=”auto” _K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.used the follwing parameters for final run of KMeans.

Since our dataset is highly sparse, hence algorithm used was forced as 'full'.

The final parameters used were as follows:-

n_clusters=kclusters (6)
random_state=0,
n_init = N_INIT(200)
algorithm='full'

After clustering, the cluster labels were merged in Toronto neighbourhood dataset and neighbourhoods were plotted using Folium with color-coding of markers as per cluster labels.

3.4.2 Clustering New York neighbourhoods

The Kmeans clustering of New York neighbourhoods was run with the same parameters as for Toronto neighbourhoods clustering except for n_cluster parameter. n_cluster parameter was set to 5 as determined during silhouette analysis.

The final parameters used were as follows:-

n_clusters=kclusters (5)
random_state=0,
n_init = N_INIT(200)
algorithm='full'

After clustering, the cluster labels were merged in New York neighbourhood dataset and neighbourhoods were plotted using Folium with color-coding of markers as per cluster labels.

3.4.3 Compare Neighbourhoods between New York and Toronto

In order to compare and find similar neighbourhoods between New York and Toronto, we ensure that both datasets have same column names ("Venue Category") so that the two dataframes can be joined together. Therefore we find the columns ( venue categories) in present in Toronto dataframe but not present in New York dataframe ( using numpy.setdiff1d function) and add the missing columns to new York dataframe. The new columns are initialized to 0(zero).

Same process was repeated for Toronto neighbourhood dataset too.

Some feature engineering was done. A column of city labels was added to each dtatset before conactenation. Also since latitudes and longitudes of both New York and Toronto are different ranges, centering of latitudes and longitudes of neighbourhoods in each dataset was done prior to concatenation. Centering was done by subtracting the lat/long of respective city from lat and long of each neighbourhood. Susequently sparse columns with less than two non-zero values were dropped.

The final combined dataframe has a 212 rows and 326 columns.

We can see that 36 sparse columns ( 362 -326) were dropped from the combined dataset.

Silhouette analysis for combined dataframe was done using plot_silhouette custom function with test values of n_clusters from 2-30.

We can see from the silhoutte score plot that for k ranging between 2-20 silhouette score remain roughly between 0.28 to 0.32. However, there is a sharp drop at k values of 8. We also know that best values of K for New York dataset ranged from 2-5 only. Hence, we decide to run KMeans clustering for combined dataframe with n_cluster value of 5 only.

After KMeans clustering, cluster labels so obtained were merged with the combined dataframe and neighbourhoods were grouped according to cluster labels and city labels in to the follwing pivot table. IPython.html and Dataframe.style.set_properties() were used for pretty rendering of pivot table.

City	NYC	Toronto
Cluster Labels
0	Hugh L Carey Tunnel, New York, NY 10004 744 Greenwich St, New York, NY 10014 119 Avenue B, New York, NY 10009 7 Peter Cooper Rd, New York, NY 10010 107 W 37th St, New York, NY 10018 489 E 41st St, New York, NY 10017 548 W 53rd St, New York, NY 10019 338 E 80th St, New York, NY 10075 307 W 71st St, New York, NY 10023 327 E 105th St, New York, NY 10029 824 West End Ave, New York, NY 10025 3096 Broadway, New York, NY 10027 141 W 118th St, New York, NY 10026 193 Fort Washington Ave, New York, NY 10032 608 W 204th St, New York, NY 10034 356 83rd St, Brooklyn, NY 11209 654 E 17th St, Brooklyn, NY 11230 Brooklyn, NY 11232 147 Richards St, Brooklyn, NY 11231 175 New York Ave, Brooklyn, NY 11216 541 1st St, Brooklyn, NY 11215 673 Lafayette Ave, Brooklyn, NY 11216 885 Gates Ave, Brooklyn, NY 11221 389 Central Ave, Brooklyn, NY 11221 225 Cadman Plz E, Brooklyn, NY 11201 8th St, Brooklyn, NY 11251 342 Devoe St, Brooklyn, NY 11211 230 Java St, Brooklyn, NY 11222 25-48 50th Ave, Long Island City, NY 11101 106-16 70th Ave, Forest Hills, NY 11375	Church-Yonge Corridor Kensington-Chinatown Bay Street Corridor
1	505 E 120th St, New York, NY 10035 350 Powers Ave, Bronx, NY 10454 1984 Randall Ave, Bronx, NY 10473 85 McClellan St, Bronx, NY 10452 1891 Harrison Ave, Bronx, NY 10453 1125 E 231st St, Bronx, NY 10466 710 E 180th St, Bronx, NY 10457 2053 Yates Ave, Bronx, NY 10461 Bronx, NY 10463 200 E 196th St, Bronx, NY 10458 2824 W 17th St, Brooklyn, NY 11224 2151 Bath Ave, Brooklyn, NY 11214 483 E 48th St, Brooklyn, NY 11203 1504 E New York Ave, Brooklyn, NY 11212 85-21 111th St, Richmond Hill, NY 11418 Jamaica, NY 11433 5873 57th St, Maspeth, NY 11378 139-02 233rd St, Rosedale, NY 11422 158-42 86th St, Howard Beach, NY 11414 164-30 73rd Ave, Fresh Meadows, NY 11366 144-22 11th Ave, Whitestone, NY 11357 39-14 112th St, Corona, NY 11368	nan
2	639 W 142nd St, New York, NY 10031 52 Macombs Pl, New York, NY 10039	nan
3	7 Catherine St, New York, NY 10038 271 Henry St, New York, NY 10002	nan
4	643 W 24th St, New York, NY 10011 New York, NY 10028 420 Tiffany St, Bronx, NY 10474 641 E 169th St, Bronx, NY 10456 1430 Outlook Ave, Bronx, NY 10465 5 Frank Ct, Brooklyn, NY 11229 11N, Brooklyn, NY 11234 4317 17th Ave, Brooklyn, NY 11204 430 Lefferts Ave, Brooklyn, NY 11225 Brooklyn, NY 11236 Rockaway Park, NY 11694 10-29 Bay 31st St, Far Rockaway, NY 11691 Cross Island Pkwy, Oakland Gardens, NY 11364 Perimeter Rd, Jamaica, NY 11430 Astoria, NY 11105 Flushing, NY 11371 Staten Island, NY 10301 Gulf Ave, Staten Island, NY 10314 391 Fairbanks Ave, Staten Island, NY 10306 170 Sharrotts Ln, Staten Island, NY 10309	Yonge-St.Clair York University Heights Lansing-Westgate Yorkdale-Glen Park Stonegate-Queensway Tam O'Shanter-Sullivan The Beaches Thistletown-Beaumond Heights Thorncliffe Park Danforth East York Humewood-Cedarvale Islington-City Centre West Scarborough Village South Parkdale South Riverdale St.Andrew-Windfields Taylor-Massey Humber Summit Humbermede Centennial Scarborough Clairlea-Birchmount Cliffcrest Flemingdon Park Corso Italia-Davenport Ionview Junction Area Broadview North Princess-Rosethorn North Riverdale Etobicoke West Mall Forest Hill North Glenfield-Jane Heights Greenwood-Coxwell Guildwood Trinity-Bellwoods Victoria Village Waterfront Communities-The Island West Hill West Humber-Clairville Westminster-Branson Kennedy Park Kingsview Village-The Westway Bayview Woods-Steeles Clanton Park Keelesdale-Eglinton West O'Connor-Parkview Old East York Casa Loma Kingsway South Runnymede-Bloor West Village Forest Hill South Henry Farm Annex Caledonia-Fairbank Humber Heights-Westmount Roncesvalles University Hillcrest Village Mount Dennis Dorset Park Edenbridge-Humber Valley Dovercourt-Wallace Emerson-Junction Newtonbrook West Niagara Beechborough-Greenbrook High Park North High Park-Swansea Highland Creek North St.James Town Oakridge Rosedale-Moore Park Oakwood Village Wexford/Maryvale Eglinton East Elms-Old Rexdale Agincourt North Agincourt South-Malvern West Englemount-Lawrence Eringate-Centennial-West Deane L'Amoreaux Banbury-Don Mills Bathurst Manor Regent Park Bendale Birchcliffe-Cliffside Weston-Pellam Park Downsview-Roding-CFB Lambton Baby Point Black Creek Willowdale East Willowdale West Rouge Mount Olive-Silverstone-Jamestown Cabbagetown-South St.James Town Mount Pleasant East Mount Pleasant West Blake-Jones Rexdale-Kipling East End-Danforth New Toronto Palmerston-Little Italy Parkwoods-Donalda Pelmo Park-Humberlea Playter Estates-Danforth Willowridge-Martingrove-Richview Woburn Woodbine-Lumsden Bayview Village Bedford Park-Nortown Rockcliffe-Smythe Bridle Path-Sunnybrook-York Mills Don Valley Village Weston Lawrence Park South Long Branch Malvern Dufferin Grove Maple Leaf Markland Wood Steeles Lawrence Park North Yonge-Eglinton Morningside Moss Park Little Portugal Woodbine Corridor Newtonbrook East Milliken Pleasant View Wychwood Leaside-Bennington Briar Hill-Belgravia Mimico

We can see from the table above that clusters 2 and 3 have only one neighbourhood from New York and none from Toronto. In fact all neighbourhoods of Toronto except one have been grouped in cluster 4. This could be because neighbourhoods of Toronto were much tighly packed in n-dimensional feature space compared to neighbourhood distances ( in feature space) of New York.

3.4.4 Comparing various clustering algorithms on combined dataframe

We can see in previous section that performance of KMeans algorithm in finding similar neighbourhoods between New York and Toronto was not satisfactory. Hence, it was decided to a test run of all available clustering algorithms in sklearn.cluster library on the combined and compare their performance in clustering similar neighbourhoods between the two cities.

The following algorithms were tried:-

MiniBatchKMeans
AffinityPropagation
MeanShift
SpectralClustering
Ward
AgglomerativeClustering
DBSCAN
OPTICS
Birch
GaussianMixture
SpectralBiclustering
SpectralCoclustering

The final pivot table output containing number of neighbourhoods in each cluster for two cities for each algorithm is displayed below.

RunningMiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
                init_size=None, max_iter=100, max_no_improvement=10,
                n_clusters=5, n_init=3, random_state=None,
                reassignment_ratio=0.01, tol=0.0, verbose=0)

City	NYC	Toronto
Cluster Labels
0	9	nan
1	4	nan
2	2	nan
3	19	133
4	42	3

RunningAffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.9, max_iter=200, preference=-1200, verbose=False)

City	NYC	Toronto
Cluster Labels
0	1	nan
1	1	nan
2	1	nan
3	nan	1
4	73	135

RunningMeanShift(bandwidth=16.97339757396255, bin_seeding=True, cluster_all=True,
          min_bin_freq=1, n_jobs=None, seeds=None)

City	NYC	Toronto
Cluster Labels
0	76	119
1	nan	1
2	nan	1
3	nan	1
4	nan	1
5	nan	1
6	nan	1
7	nan	1
8	nan	1
9	nan	1
10	nan	1
11	nan	1
12	nan	1
13	nan	1
14	nan	1
15	nan	1
16	nan	1
17	nan	1

RunningSpectralClustering(affinity='nearest_neighbors', assign_labels='kmeans',
                   coef0=1, degree=3, eigen_solver='arpack', eigen_tol=0.0,
                   gamma=1.0, kernel_params=None, n_clusters=5, n_init=10,
                   n_jobs=None, n_neighbors=10, random_state=None)

City	NYC	Toronto
Cluster Labels
0	7	9
1	15	7
2	24	59
3	28	52
4	2	9

RunningAgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
                        connectivity=<212x212 sparse matrix of type '<class 'numpy.float64'>'
	with 4076 stored elements in Compressed Sparse Row format>,
                        distance_threshold=None, linkage='ward', memory=None,
                        n_clusters=5, pooling_func='deprecated')

City	NYC	Toronto
Cluster Labels
0	70	135
1	2	nan
2	3	nan
3	nan	1
4	1	nan

RunningAgglomerativeClustering(affinity='cityblock', compute_full_tree='auto',
                        connectivity=<212x212 sparse matrix of type '<class 'numpy.float64'>'
	with 4076 stored elements in Compressed Sparse Row format>,
                        distance_threshold=None, linkage='average', memory=None,
                        n_clusters=5, pooling_func='deprecated')

City	NYC	Toronto
Cluster Labels
0	71	136
1	2	nan
2	1	nan
3	1	nan
4	1	nan

RunningDBSCAN(algorithm='auto', eps=0.005, leaf_size=30, metric='cosine',
       metric_params=None, min_samples=20, n_jobs=None, p=None)

City	NYC	Toronto
Cluster Labels
-1	76	136

RunningOPTICS(algorithm='auto', cluster_method='xi', eps=None, leaf_size=30,
       max_eps=inf, metric='minkowski', metric_params=None,
       min_cluster_size=0.1, min_samples=20, n_jobs=None, p=2,
       predecessor_correction=True, xi=0.05)

City	NYC	Toronto
Cluster Labels
-1	76	99
0	nan	37

RunningBirch(branching_factor=50, compute_labels=True, copy=True, n_clusters=5,
      threshold=0.5)

City	NYC	Toronto
Cluster Labels
0	27	5
1	11	1
2	5	nan
3	17	130
4	16	nan

RunningGaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
                means_init=None, n_components=5, n_init=1, precisions_init=None,
                random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
                verbose_interval=10, warm_start=False, weights_init=None)

City	NYC	Toronto
Cluster Labels
0	3	nan
1	nan	3
2	72	132
3	1	nan
4	nan	1

RunningSpectralBiclustering(init='k-means++', method='bistochastic', mini_batch=False,
                     n_best=50, n_clusters=5, n_components=250, n_init=10,
                     n_jobs=None, n_svd_vecs=None, random_state=None,
                     svd_method='randomized')

City	NYC	Toronto
Cluster Labels
0	22	133
1	21	nan
2	nan	1
3	nan	2
4	33	nan

RunningSpectralCoclustering(init='k-means++', mini_batch=False, n_clusters=5,
                     n_init=10, n_jobs=None, n_svd_vecs=None, random_state=None,
                     svd_method='randomized')

City	NYC	Toronto
Cluster Labels
0	17	130
1	13	nan
2	8	6
3	21	nan
4	17	nan

. . .

On comparing the summary tables of clustering done by various algorithms, we can see that almost all the clustering algorithms cluster most of the neighbourhoods in one or two clusters with 99% of Toronto neighbourhoods being clubbed in one cluster. DBSCAN algorithm which does not take any initial value of n_clusters and tries to find inherent cluster structure classifies all neighbourhoods of both New York and Toronto in one cluster.

However, SpectralClustering produces the most granular clustering with all the clusters containing some neighbourhoods of both cities New York and Toronto. The results of SpectralClustering are repoduced below.

Cluster Labels	NYC	Toronto
0	7	9
1	15	7
2	24	59
3	28	52
4	2	9

3.4.5 Inductive Clustering

When the result of clustering naturally induces functions for classification on the whole space of interest, the method is called that of inductive clustering. In contrast, a method is called non-inductive, if it does not induce such a function. Many clustering algorithms are not inductive and so cannot be directly applied to new data samples without recomputing the clustering, which may be intractable. Instead, we can use clustering to then learn an inductive model with a classifier.
For this exercise, we shall use Inductive Clustering of Agglomerative clustering and KneighboursClassifier. First step was to use AgglomerativeClustering model on New York dataframe to generate cluster labels for New York data. Then a sequence of Aggomerative clustering and KneighboursClassifier was used to predict the cluster labels for Toronto neighbourhoods.
The parameters finalised for Agglomerative Clustering after fine-tuning the models are as follows:-
- n_clusters=5
- affinity ='cosine'
- connectivity= kneighbors_graph
- linkage = 'complete'
- compute_full_tree=True
The significance of above parameter is as follows:-
- complete linkage minimizes the maximum distance between observations of pairs of clusters
- connecitivity constraints are useful to impose a certain local structure (only adjacent clusters can be merged together)
- cosine distance works because it is invariant to global scalings of the signal
- compute_full_tree was forced as True. This is because the documentation for AgglomerativeClustering says when varying the number of clusters and using caching, it may be advantageous to compute the full tree.
The cluster labels generated by Inductive Clustering were merged with combined dataframe. With Dataframe.groupby(), aggregate() and unstack() methods, a pivot table was generated classifying all the neighbourhoods of New York and Toronto.

4. Results

The final pivot table displaying the clustering of neighbourhoods between New York and Toronto can be seen below.

City	NYC	Toronto
Cluster Labels
0	2824 W 17th St, Brooklyn, NY 11224 Gulf Ave, Staten Island, NY 10314 391 Fairbanks Ave, Staten Island, NY 10306 170 Sharrotts Ln, Staten Island, NY 10309	Stonegate-Queensway Keelesdale-Eglinton West Casa Loma High Park-Swansea Downsview-Roding-CFB Willowdale West Don Valley Village Woodbine Corridor
1	Hugh L Carey Tunnel, New York, NY 10004 7 Catherine St, New York, NY 10038 744 Greenwich St, New York, NY 10014 271 Henry St, New York, NY 10002 119 Avenue B, New York, NY 10009 7 Peter Cooper Rd, New York, NY 10010 107 W 37th St, New York, NY 10018 107 W 37th St, New York, NY 10018 489 E 41st St, New York, NY 10017 338 E 80th St, New York, NY 10075 307 W 71st St, New York, NY 10023 New York, NY 10028 327 E 105th St, New York, NY 10029 824 West End Ave, New York, NY 10025 505 E 120th St, New York, NY 10035 3096 Broadway, New York, NY 10027 141 W 118th St, New York, NY 10026 639 W 142nd St, New York, NY 10031 52 Macombs Pl, New York, NY 10039 193 Fort Washington Ave, New York, NY 10032 608 W 204th St, New York, NY 10034 350 Powers Ave, Bronx, NY 10454 420 Tiffany St, Bronx, NY 10474 1984 Randall Ave, Bronx, NY 10473 85 McClellan St, Bronx, NY 10452 1891 Harrison Ave, Bronx, NY 10453 1125 E 231st St, Bronx, NY 10466 710 E 180th St, Bronx, NY 10457 2053 Yates Ave, Bronx, NY 10461 Bronx, NY 10463 200 E 196th St, Bronx, NY 10458 2151 Bath Ave, Brooklyn, NY 11214 483 E 48th St, Brooklyn, NY 11203 654 E 17th St, Brooklyn, NY 11230 430 Lefferts Ave, Brooklyn, NY 11225 Brooklyn, NY 11232 1504 E New York Ave, Brooklyn, NY 11212 147 Richards St, Brooklyn, NY 11231 175 New York Ave, Brooklyn, NY 11216 673 Lafayette Ave, Brooklyn, NY 11216 885 Gates Ave, Brooklyn, NY 11221 389 Central Ave, Brooklyn, NY 11221 342 Devoe St, Brooklyn, NY 11211 230 Java St, Brooklyn, NY 11222 85-21 111th St, Richmond Hill, NY 11418 Jamaica, NY 11433 5873 57th St, Maspeth, NY 11378 139-02 233rd St, Rosedale, NY 11422 158-42 86th St, Howard Beach, NY 11414 164-30 73rd Ave, Fresh Meadows, NY 11366 25-48 50th Ave, Long Island City, NY 11101 144-22 11th Ave, Whitestone, NY 11357 106-16 70th Ave, Forest Hills, NY 11375 Perimeter Rd, Jamaica, NY 11430 Flushing, NY 11371	Yonge-St.Clair The Beaches Thorncliffe Park Danforth East York Humewood-Cedarvale South Parkdale South Riverdale Church-Yonge Corridor Flemingdon Park Corso Italia-Davenport Junction Area Etobicoke West Mall Greenwood-Coxwell Trinity-Bellwoods West Humber-Clairville Kensington-Chinatown Annex University Hillcrest Village Highland Creek North St.James Town Wexford/Maryvale Elms-Old Rexdale Agincourt North Agincourt South-Malvern West Regent Park Weston-Pellam Park Bay Street Corridor Cabbagetown-South St.James Town Mount Pleasant West East End-Danforth Palmerston-Little Italy Playter Estates-Danforth Woburn Woodbine-Lumsden Bayview Village Weston Long Branch Dufferin Grove Lawrence Park North Yonge-Eglinton Moss Park Little Portugal Wychwood Briar Hill-Belgravia
2	643 W 24th St, New York, NY 10011 1430 Outlook Ave, Bronx, NY 10465 5 Frank Ct, Brooklyn, NY 11229 11N, Brooklyn, NY 11234 Rockaway Park, NY 11694 10-29 Bay 31st St, Far Rockaway, NY 11691 Cross Island Pkwy, Oakland Gardens, NY 11364	York University Heights Lansing-Westgate Yorkdale-Glen Park Tam O'Shanter-Sullivan Thistletown-Beaumond Heights Islington-City Centre West Scarborough Village St.Andrew-Windfields Taylor-Massey Humber Summit Humbermede Centennial Scarborough Clairlea-Birchmount Cliffcrest Ionview Broadview North Princess-Rosethorn North Riverdale Forest Hill North Glenfield-Jane Heights Guildwood Victoria Village Waterfront Communities-The Island West Hill Westminster-Branson Kennedy Park Kingsview Village-The Westway Bayview Woods-Steeles Clanton Park O'Connor-Parkview Old East York Kingsway South Runnymede-Bloor West Village Forest Hill South Henry Farm Caledonia-Fairbank Humber Heights-Westmount Roncesvalles Mount Dennis Dorset Park Edenbridge-Humber Valley Dovercourt-Wallace Emerson-Junction Newtonbrook West Niagara Beechborough-Greenbrook High Park North Oakridge Rosedale-Moore Park Oakwood Village Eglinton East Englemount-Lawrence Eringate-Centennial-West Deane L'Amoreaux Banbury-Don Mills Bathurst Manor Bendale Birchcliffe-Cliffside Lambton Baby Point Black Creek Willowdale East Mount Olive-Silverstone-Jamestown Mount Pleasant East Blake-Jones Rexdale-Kipling New Toronto Parkwoods-Donalda Pelmo Park-Humberlea Willowridge-Martingrove-Richview Bedford Park-Nortown Rockcliffe-Smythe Bridle Path-Sunnybrook-York Mills Lawrence Park South Maple Leaf Markland Wood Steeles Morningside Newtonbrook East Milliken Pleasant View Leaside-Bennington Mimico
3	548 W 53rd St, New York, NY 10019 641 E 169th St, Bronx, NY 10456 Brooklyn, NY 11236 225 Cadman Plz E, Brooklyn, NY 11201 8th St, Brooklyn, NY 11251 39-14 112th St, Corona, NY 11368 Astoria, NY 11105 Staten Island, NY 10301	Rouge Malvern
4	4317 17th Ave, Brooklyn, NY 11204 356 83rd St, Brooklyn, NY 11209 541 1st St, Brooklyn, NY 11215	nan

For interpreting the different cluster labels, seaborn library's catplot function was used to produce a barplot for each Crime_type and City. Results can be seen below

On the basis of crime rates, clusters can be categorized as follows:-

Cluster 3 - Very High Crime Rate
Cluster 1 - High Crime Rate
Cluster 2 - Moderate Crime Rate
Cluster 4 - Low Crime Rate
Cluster 0 - Very Low Crime Rate

In order to categorize clusters on the basis of venues categories, following approach was used:-

Fetch full category list from Foursquare using request.get() on this url
Full category list has 10 master venue categories under which all venue categories are classified.
- Arts & Entertainment
- College & University
- Event
- Food
- Nightlife Spot
- Outdoors & Recreation
- Professional & Other Places
- Residence
- Shop & Service
- Travel & Transport
Create a dictionary of Venue Categories mapped to Master Category using a custom function
Use groupby on columns of combined Dataframe to condense all venue categories columns into 10 master category columns - Use Seaborn.catplot() function to plot a Boxplot of distribution of various venue master categories within each cluster and city.

5. The Following plot was obtained.

On the basis of analysis of above plots, clusters can be categorized in terms of venue types as follows:-

Cluster Number	Category by Venue	Category by Crime
Cluster 0	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation	(Very Low Crime Rate)
Cluster 1	Food, Nightlife Spot, Outdoors & Recreation	(High Crime Rate)
Cluster 2	Travel & Transport, Outdoors & Recreation, Professional & Other Places	(Moderate Crime Rate)
Cluster 3	Arts & Entertainment, Residence, Travel & Transport	(Very High Crime Rate)
Cluster 4	Food, Shop & Service	(Low Crime Rate)

5. Discussion

5.1 Observations

Both cities New York and Toronto have a lot of differences as far as crime data and venues data is concerned. While most neighbourhoods in New York have more than 100 popular venues within 500m radius, few neighbourhoods in Toronto have as many popular venues.
On analysis of the distance matrices (sorted and unsorted) of neighbourhoods in n-dimensional feature space, it was seen that neighbourhoods in New York have high disssimilarity while neighbourhoods in Toronto are highly similar and tightly packed in feature space.
Analysis of silhuette scores of Toronto dataset revealed reasonable clustering success with silhuette scores ranging from 0.3-0.35. However, silhuette scores for new York dataset were quite low ( ~ 0.1) implying poor clustering in feature space.
During the clustering of combined dataset of Toronto and New York for finding similar neighbourhoods between the two cities, it was observed that KMeans didn't perform very satisfactorily. 135 out of 136 neighbourhoods of Toronto were clubbed into single cluster.
Comparison of various clustering algorithms on combined dataset revealed similar bunching of most of Toronto neighbourhoods into one cluster of New York neighbourhoods. Only SpectralClustering algorithm created reasonable spread of Toronto neighbourhoods into all clusters.
Run of Inductive Clustering method with use of AgglomerativeClustering to first cluster NYC neighbourhoods and then using KneighboursClassifier to predict cluster labels for Toronto resulted into much better results.
Analysis of barplots of crime data across cluster and cities revealed different rates of crimes between clusters. Further, in every cluster, crime rates of Toronto neighbourhoods were significantly lower than New York neighbourhoods.
Analysis of boxplots of Venue Master categories across cluster and cities revealed prominence of 2-3 differenct categories for each cluster. On the basis of this classification, prominent characterstics of each cluster were identified.

5.2 Recommendations

Cluster 0 is the best cluster as far as crime rates are concerned. The neighbourhoods in this cluster are :-

NYC	Toronto	Crime Pattern	Venue Types
2824 W 17th St, Brooklyn, NY 11224	Stonegate-Queensway	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
Gulf Ave, Staten Island, NY 10314	Keelesdale-Eglinton West	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
391 Fairbanks Ave, Staten Island, NY 10306	Casa Loma	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
170 Sharrotts Ln, Staten Island, NY 10309	High Park-Swansea	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
	Downsview-Roding-CFB	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
	Willowdale West	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
	Don Valley Village	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation
	Woodbine Corridor	Very Low	Arts & Entertainment,Professional & Other Places,Outdoors & Recreation

More features like population, property rents/prices, traffic congestion data, pollution data, etc. may be used to get better comparison and clustering results.
Instead of dropping columns with too many zero, feature engineering to transform the data of such columns may be explored so that no data is lost.
Other statistics like Minimum Spanning Tree and use of different metrics with different algorithms using a range of parameters may be explored to locate the best clustering strategy.

6. Conclusion

In this project, we compared neighbourhoods of New York and Toronto using crime data for last years and types of popular venues present in the vicinity of each neighbourhood. It involved data retrieval from openly available public datasets as well as scraping venues data from Foursquare and geocoding through free geocoders. Further, a lot of data cleaning, feature engineering was required before data could be used for machine learning. We found that neighbourhoods in Toronto are much more similar to each other while neighbourhoods in New York are quite dissimilar. We also compared the performance of various clustering algorithms on combined dataset of New York and Toronto and found that SpectralClustering algorithm gave the best results. We also established that InductiveClustering gave good clustering results comparable to SpectralClustering. we further identified the defining characterstics of each cluster on the basis of cluster-wise crime rates and venue types present in cluster. On the basis of above analysis, we established 4 neighbourhoods of New York and Toronto that are similar to each other with lowest crime rates and having popular venues of various categories like Arts & Entertainment,Professional & Other Places,Outdoors & Recreation. We also identified scope for further exploration and refining of data and clustering strategies that may improve the result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPSTONE_WEEK5_REPORT.md

CAPSTONE_WEEK5_REPORT.md

This notebook will be used for the Assignment of week 5 of capstone course for Applied Data Science on Coursera

Author: Rajiv Ranjan Singh

1. Introduction

1.1 Background

1.2 Problem Description

1.3 Target Audience

1.4 Success criteria

2. Data

2.1 Data for Toronto

2.2 Data for New York

3. Methodology

3.1 Data Retrieval,Cleaning and Feature Engineering

3.1.1 Toronto

3.1.2 New York

3.2 Exploratory Data Analysis

3.2.1 Toronto

3.2.2 New York

3.2.3 Dropping Sparse Columns of venue cateogory which are present in 2 or less than 2 neighbourhoods

3.3 Inferential Statistical Testing

3.3.1 Plotting distance matrices for Toronto and New York

3.3.2 Finding optimum value of n_cluster for kmeans algorithm

3.4 Machine Learning

3.4.1 Clustering Toronto neighbourhoods

3.4.2 Clustering New York neighbourhoods

3.4.3 Compare Neighbourhoods between New York and Toronto

3.4.4 Comparing various clustering algorithms on combined dataframe

3.4.5 Inductive Clustering

4. Results

5. Discussion

5.1 Observations

5.2 Recommendations

6. Conclusion

Thank you for going through my notebook.

Files

CAPSTONE_WEEK5_REPORT.md

Latest commit

History

CAPSTONE_WEEK5_REPORT.md

File metadata and controls

This notebook will be used for the Assignment of week 5 of capstone course for Applied Data Science on Coursera

Author: Rajiv Ranjan Singh

1. Introduction

1.1 Background

1.2 Problem Description

1.3 Target Audience

1.4 Success criteria

2. Data

2.1 Data for Toronto

2.2 Data for New York

3. Methodology

3.1 Data Retrieval,Cleaning and Feature Engineering

3.1.1 Toronto

3.1.2 New York

3.2 Exploratory Data Analysis

3.2.1 Toronto

3.2.2 New York

3.2.3 Dropping Sparse Columns of venue cateogory which are present in 2 or less than 2 neighbourhoods

3.3 Inferential Statistical Testing

3.3.1 Plotting distance matrices for Toronto and New York

3.3.2 Finding optimum value of n_cluster for kmeans algorithm

3.4 Machine Learning

3.4.1 Clustering Toronto neighbourhoods

3.4.2 Clustering New York neighbourhoods

3.4.3 Compare Neighbourhoods between New York and Toronto

3.4.4 Comparing various clustering algorithms on combined dataframe

3.4.5 Inductive Clustering

4. Results

5. Discussion

5.1 Observations

5.2 Recommendations

6. Conclusion

Thank you for going through my notebook.