The location problem has been the challenge for many businesses starts for a long time. In this post, we try to find a more realistic location for new business, by employing Machine Learning methodologies according to existing data of related businesses. First, we fetch all related business data via FourSqare, and we define the sufficiency and insufficiency of the business, then we utilize Isolation Forest in Python Libary scikit-learn to find out the anomaly location since market anomalies can be great opportunities for investors. We also visualized the result in a map with the popular geo lib Folium.
Contents
1. Introduction
1.1 Business Requests
The location problem has been the challenge for many businesses starts for a long time. Many academic and industrial approaches focus on this problem. In this project we’re trying to answer the question, where is the realistic location to start a new business based on existing data. We use Chinese restaurant as the business category to apply machine learning to help investors to make a better decision of location choice in downtown Toronto.
We assume that an investor wants to start a new business to serve Chinese food in downtown Toronto due to its density of population, higher average income, as well as the diversity of culture.
You may jump to review the All Venues Map or Result Map directly:
All Venues map | Result Map |
---|---|
![]() |
![]() |
1.2 Analytic Approach
A good location should satisfy the two criteria at the same time:
(1) Sufficient demand
(2) Insufficient support
To address the first criteria, we could assume that if in some locations exists many restaurants business, there should have a good demand for foodservice in that location.
As per the second criteria, if we could hardly find a Chinese restaurant in the area, then we could say that the support is insufficient.
In summary, we need to gain the data of the food services venue information in the area, as well as their categories.
More specifically, we want to know how many restaurants in specific areas, how many of them provide Chinese food or Asian food. Based on this information we want to find out the most interesting area which has sufficient demand and insufficient support for Chinese food.
2. Data Collection
Majority we will use data provided by FourSqare to perform our analysis.
FourSquare is a location technology platform to allow developers to fetch the location data, as well as venues information. With the free account one can make 100K calls per day to access their 105M+ points of interest data.
Aiming to the business request described above, we will collect all restrants information in downtown Toronto, and find out their distribution and rating, etc.
2.1 Scrape location info in Toronto
We use pandas function read_html to get postal code list in Toronto, as well as the neibourhoods.
1 | import pandas as pd |
Step(1) Fetch postal code in Toronto
We get the list of postal codes in Toronto from the Wiki page: List_of_postal_codes_of_Canada, and perform some simple data cleaning job.
1 | url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' |
DataFrame[0]:(180, 3)
DataFrame[1]:(4, 18)
DataFrame[2]:(2, 18)
Postal code | Borough | Neighborhood | |
---|---|---|---|
0 | M1A | Not assigned | NaN |
1 | M2A | Not assigned | NaN |
2 | M3A | North York | Parkwoods |
3 | M4A | North York | Victoria Village |
4 | M5A | Downtown Toronto | Regent Park / Harbourfront |
Oberviously the first data frame is what we need.
Lets remove those ‘Not assigned’ rows as per column Borough, and check the duplications for Postal Code.
1 | df = dfs[0] |
(103, 3)
103
We are good there’s no duplication in column Postal Code.
Chose Postal code as index.
1 | df.set_index('Postal code', inplace=True) |
Borough | Neighborhood | |
---|---|---|
Postal code | ||
M3A | North York | Parkwoods |
M4A | North York | Victoria Village |
M5A | Downtown Toronto | Regent Park / Harbourfront |
M6A | North York | Lawrence Manor / Lawrence Heights |
M7A | Downtown Toronto | Queen's Park / Ontario Provincial Government |
Step(2) Attaching geo info for each postal code
We could get geospatial information for each postal code via the online csv file: Geospatial_data, then attach it to existing data set.
1 | url = 'http://cocl.us/Geospatial_data' |
(103, 2)
Latitude | Longitude | |
---|---|---|
Postal Code | ||
M1B | 43.806686 | -79.194353 |
M1C | 43.784535 | -79.160497 |
M1E | 43.763573 | -79.188711 |
M1G | 43.770992 | -79.216917 |
M1H | 43.773136 | -79.239476 |
Now we can merge these two data set into one.
1 | df = df.merge( geo_info, left_index= True, right_index = True) |
Borough | Neighborhood | Latitude | Longitude | |
---|---|---|---|---|
Postal Code | ||||
M3A | North York | Parkwoods | 43.753259 | -79.329656 |
M4A | North York | Victoria Village | 43.725882 | -79.315572 |
M5A | Downtown Toronto | Regent Park / Harbourfront | 43.654260 | -79.360636 |
M6A | North York | Lawrence Manor / Lawrence Heights | 43.718518 | -79.464763 |
M7A | Downtown Toronto | Queen's Park / Ontario Provincial Government | 43.662301 | -79.389494 |
Step (3) Visualization areas on map
We can have a general idea of the area by visulize these data on map.
First we get the center point of the map:
1 | lat, lng = df[['Latitude','Longitude']].max() + df[['Latitude','Longitude']].min() |
(43.71926920000001, -79.38815804999999)
Then we can illustrate them on the map:
1 | # create map of Toronto using latitude and longitude values |
This map clearly shows our research geo scope.
2.2 Fetch all ‘FOOD’ venues in Toronto
In this step we will employee FourSquare API to fetch all venues under category ‘FOOD’ in Toronto.
Step(1) First we set API credentials
1 | #hide this cell while exporting# |
{'CLIENT_ID': 'Y5FK5TTSXY24B0DDCUJBGCWCL2B01DYMXZRFOXROSYNCSSYJ',
'CLIENT_SECRET': 'Q4I1XUTLLTDWF1OWT5S0W2H0MZAS3QRJBM1XRA1OEH5E1GVW',
'VERSION': '20180605'}
Step(2) Fetch data via FourSquare API
From FourSquare API doc Venue Categories, we can tell the following Foursquare Venue Category Hierarchy, as well as their ID.
Category:
Food: 4d4b7105d754a06374d81259Asian Restaurant:
4bf58dd8d48988d142941735- Chinese Restaurant:
4bf58dd8d48988d145941735
- Chinese Restaurant:
1 | CATEGORYID = '4d4b7105d754a06374d81259' |
Due to restriction of max API calls to FourSquares, we save data for future usage.
1 | def dump2file( obj , name = None ): |
1 | # search venues at specific location |
Removing duplicated venues by id
1 | venues.drop_duplicates('id', inplace=True) |
(2073, 11)
Step(3) Review venues contains Restaurant in their Name
Lets’ foucs on those Restaurants in the venues list, since the stackholder/investor’s purpose is to open a restaurnat.
1 | len(venues['PrimaryCategory'].unique()) |
139
1 | # Find the most frequency categories by Grouping by category and sorting by count desc |
count | |
---|---|
PrimaryCategory | |
Coffee Shop | 317 |
Pizza Place | 130 |
Fast Food Restaurant | 111 |
Bakery | 97 |
Café | 91 |
Restaurant | 88 |
Chinese Restaurant | 81 |
Grocery Store | 65 |
Sandwich Place | 60 |
Caribbean Restaurant | 59 |
There are lots of categories under FOOD, most of them are caffee shop, Pizza Place, even many Grocery Stores are included in the search result.
Let’s focust on those real Restaurants.
1 | restaurants = venues[venues['PrimaryCategory'].str.contains('Restaurant')] |
(884, 11)
Now lets see how many Asian/Chinese Restaurant here:
1 | categories_counts.loc[['Restaurant','Asian Restaurant','Chinese Restaurant']] |
count | |
---|---|
PrimaryCategory | |
Restaurant | 88 |
Asian Restaurant | 34 |
Chinese Restaurant | 81 |
Seems like the category Hierarchy is not well defined.
1 | restaurants_counts = restaurants.groupby('PrimaryCategory').count().sort_values('id', ascending=False)[['id']] |
count | |
---|---|
PrimaryCategory | |
Fast Food Restaurant | 111 |
Restaurant | 88 |
Chinese Restaurant | 81 |
Caribbean Restaurant | 59 |
Italian Restaurant | 54 |
Middle Eastern Restaurant | 41 |
Indian Restaurant | 38 |
Asian Restaurant | 34 |
Sushi Restaurant | 34 |
Vietnamese Restaurant | 32 |
By reviewing the whole list, we setup a mapping on top of the current category hierarchy. We will use the mapping for further analysis.
1 | AsianRestaurants = ['Asian Restaurant','Burmese Restaurant','Sushi Restaurant', 'Vietnamese Restaurant', |
2.3 Show venues on map
Step (1) Find the center of the map
1 | lat, lng = restaurants[['location.lat','location.lng']].max() + restaurants[['location.lat','location.lng']].min() |
(43.706925113970044, -79.38951799620264)
Step (2) Mark the venues on maps
We use three colors in the visulization:
- Red: Chinese Restaurants
- Blue: Asian Restaurants, excluding Chinese Restaurants
- Green: All other restaurants
1 | # create map of New York using latitude and longitude values |
3. Methodology
In this part we will employee Isolation Forest to analysis the data, to find out the area which satisfies:
(1) Sufficient demand
(2) Insufficient support
We define a ‘Sufficient Demand’ as a bigger average restaurant provider over the neighborhood area, and ‘Insufficient Support’ as a smaller average restaurant business over the area.
Since most business opportunities arise in system edge, like optimization value is always on the boundary of the scope, we employed Anomaly Detection model using Isolation Forest in Python library scikit-learn.
3.1 Definition of Sufficiency and InSufficiency
Step(1) Get Area info
From City of Toronto Open Data Portol we can get all neibourhood boundary, area we can calculate the average food service provider over area.
1 | url='https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=geojson&projection=4326' |
(140, 6)
AREA_ID | CODE | NAME | AREA | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|
0 | 4621 | 94 | Wychwood (94) | 3.217960e+06 | 43.676919 | -79.425515 |
1 | 4622 | 100 | Yonge-Eglinton (100) | 3.160334e+06 | 43.704689 | -79.403590 |
2 | 4623 | 97 | Yonge-St.Clair (97) | 2.222464e+06 | 43.687859 | -79.397871 |
3 | 4624 | 27 | York University Heights (27) | 2.541821e+07 | 43.765736 | -79.488883 |
4 | 4625 | 31 | Yorkdale-Glen Park (31) | 1.156669e+07 | 43.714672 | -79.457108 |
Step (2) Aggregation restaurants
Now we count the restaurants in each area, as per pre defined categories.
1 | restaurnatsByArea = venuesByArea[ venuesByArea['PrimaryCategory'].str.contains('Restaurant')] |
(140, 10)
ChineseR | AsianR_ExCN | OtherR | AnyR | AREA_ID | CODE | NAME | AREA | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 5.0 | 6.0 | 4621 | 94 | Wychwood (94) | 3.217960e+06 | 43.676919 | -79.425515 |
1 | 0.0 | 0.0 | 3.0 | 3.0 | 4622 | 100 | Yonge-Eglinton (100) | 3.160334e+06 | 43.704689 | -79.403590 |
2 | 0.0 | 1.0 | 11.0 | 12.0 | 4623 | 97 | Yonge-St.Clair (97) | 2.222464e+06 | 43.687859 | -79.397871 |
3 | 1.0 | 3.0 | 7.0 | 11.0 | 4624 | 27 | York University Heights (27) | 2.541821e+07 | 43.765736 | -79.488883 |
4 | 0.0 | 4.0 | 11.0 | 15.0 | 4625 | 31 | Yorkdale-Glen Park (31) | 1.156669e+07 | 43.714672 | -79.457108 |
Step (3) Calculating the Average value
Now we have count of restaurants, and size of the area, we can calculate the density of existing business.
To enlarge the distribution, we apply log to the mean value.
1 | Avg_AsianR = df_grp.apply(lambda x: np.log( 1 + x['AsianR_ExCN'] / (1 + x['AREA']) * 1e7 ), axis=1) |
(140, 5)
AREA_ID | Avg_ChineseR | Avg_AsianR | Avg_OtherR | Avg_AnyR | |
---|---|---|---|---|---|
0 | 4621 | 0.00000 | 1.412829 | 2.805648 | 2.977841 |
1 | 4622 | 0.00000 | 0.000000 | 2.350676 | 2.350676 |
2 | 4623 | 0.00000 | 1.704659 | 3.921866 | 4.007226 |
3 | 4624 | 0.33176 | 0.779442 | 1.322804 | 1.672902 |
4 | 4625 | 0.00000 | 1.494747 | 2.352334 | 2.636789 |
Step (4) distribution of the mean values
1 | df_avg['Avg_AnyR'].hist(); |
3.2 Isolation Forest Anomaly Detection
As you might expect from the name, Isolation Forest instead works by isolating anomalies explicitly isolating anomalous points in the dataset.
Most business opportunities exist in those edge points, which drove us to apply the Isolation Forest Anomaly Detection model to find those specific opportunities.
Step (1) Define and Fit the model
We only consider the mean values in our model.
1 | IF_cols = ['Avg_ChineseR','Avg_AsianR','Avg_OtherR','Avg_AnyR'] |
1 | model=IsolationForest( n_estimators=50, max_samples='auto', contamination=float(0.1), max_features=4 ) |
IsolationForest(behaviour='deprecated', bootstrap=False, contamination=0.1,
max_features=4, max_samples='auto', n_estimators=50,
n_jobs=None, random_state=None, verbose=0, warm_start=False)
Now we have the model tained successfully.
Step (2) Attached Scores and Anomaly Column
Let’s find the scores and anomaly status for each sample. We can get this information by calling decision_function() of the above model and passing the four mean values as parameters.
Also, we can get the values of anomaly status by calling the predict() function of the above model and using the four mean values as parameters.
1 | result_cols = ['AREA_ID', 'scores','anomaly', 'NAME', 'LATITUDE','LONGITUDE' ] |
1 | df_result |
AREA_ID | scores | anomaly | NAME | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|
27 | 4649 | -0.118717 | -1 | North St.James Town (74) | 43.669623 | -79.375247 |
37 | 4660 | -0.115751 | -1 | Regent Park (72) | 43.659992 | -79.360509 |
2 | 4623 | -0.079746 | -1 | Yonge-St.Clair (97) | 43.687859 | -79.397871 |
42 | 4665 | -0.065879 | -1 | Rouge (131) | 43.821201 | -79.186343 |
40 | 4663 | -0.055065 | -1 | Roncesvalles (86) | 43.646123 | -79.442992 |
... | ... | ... | ... | ... | ... | ... |
24 | 4646 | 0.173633 | 1 | Newtonbrook West (36) | 43.785830 | -79.431422 |
113 | 4739 | 0.174701 | 1 | Forest Hill South (101) | 43.694526 | -79.414318 |
39 | 4662 | 0.177534 | 1 | Rockcliffe-Smythe (111) | 43.674790 | -79.494420 |
41 | 4664 | 0.180706 | 1 | Rosedale-Moore Park (98) | 43.682820 | -79.379669 |
9 | 4630 | 0.181015 | 1 | Leaside-Bennington (56) | 43.703797 | -79.366072 |
140 rows × 6 columns
Step (3) Visulization the result
We can use the hist char to visulize the scores of model result.
1 | df_result['scores'].hist(); |
1 | sns.boxplot(df_result['scores']); |
We can tell that the anomaly exists on the very left side.
Step (4) Visualization on map
For better impact, we put the model result on the map, which will give our stakeholders a better understanding of the data-driven approach.
We added Three Layers on top of the geo map:
- Choropleth maps, to tell the density of restaurants
- Circle Markers, to plot all restaurants plus categories
- Leaflet Markers, to flag the two anomalies location
1 | lat, lng = venuesByArea[['location.lat','location.lng']].min() + venuesByArea[['location.lat','location.lng']].max() |
4. Result
To open a new Chinese restaurant, we have two locations with potentially highest opportunites, marked as Red on above map:
- North St.James Town
- Regent Park
Plus the other three locations may also have moderate oppertunites, marked as Orange on above map:
- Yonge-St.Clair
- Rouge
- Roncesvalles
1 | df_grp[df_grp['AREA_ID'].isin([4649,4660,4623, 4665, 4663])]\ |
NAME | AREA | ChineseR | AsianR_ExCN | OtherR | AnyR | |
---|---|---|---|---|---|---|
2 | Yonge-St.Clair (97) | 2.222464e+06 | 0.0 | 1.0 | 11.0 | 12.0 |
28 | North St.James Town (74) | 8.113039e+05 | 1.0 | 1.0 | 3.0 | 5.0 |
39 | Regent Park (72) | 1.243326e+06 | 1.0 | 0.0 | 6.0 | 7.0 |
42 | Roncesvalles (86) | 2.875399e+06 | 0.0 | 1.0 | 0.0 | 1.0 |
44 | Rouge (131) | 7.214402e+07 | 0.0 | 1.0 | 5.0 | 6.0 |
5. Discussion
We could add more features into the model, such as the rating of the venues, size of the business, etc.
Also, it would be better if we could fetch more data from different data sources, along with FourQuares, it may help us to build a more accurate model.
Plus, We could introduce other dimensional data like population, Demographics, income, etc., for this information also has an impact on the consuming market.
6. Conclusion
Since many believe that business opportunities most-likely happen in an abnormal scenario, we employed the Isolation Forest model to find outliers in the restaurant business, and find out location-based significantly different from those majority of the other locations.