This project attempts to study the effects of regional weather data on motor vehicle accidents for New York City between the dates of October 12th, 2012 and December 31st, 2015. Correlations between the various weather features will be analyzed with respect to themselves and with the aggregate count for the reported motor vehicle accidents to determine which weather features can be excluded from analysis, and how the general weather trends impact the number of accidents by first order approximation. A machine learning algorithm will be selected, trained, and tested on the labelled data to determine the ability of such an algorithm to predict the liklihood of motor vehicle accidents. The applications for such analysis are numerous, including but not limited to warning drivers of hazardous weather conditions, preparing emergency services for influx of motor vehicle accidents using weather forecasting, determining monetary effects on weather conditions for insurance companies, etc.
Regional weather data for New York City is gathered from Weather Underground for dates between July 1st, 1948 and Decemeber 31st, 2015 [1]. Data is used from GitHub as it has previously been compiled and tabulated--such process is beyond the scope of this report, refer to the referenced source for more information. A tabulated list of all reported motor vehicle accidents in New York City is collected from Data.World and ranges from October 12th, 2012 to April 14th, 2020 [2].
The hypothesis for this project is to show (1) that days with poor weather conditions will lead to an increase in motor vehicle accidents, and that (2) the number of total motor vehicle accidents can be roughly predicted using a weather forecast. A definition of "poor" weather conditions are to be defined as above average precipitation, wind speed, and cloud coverage, as well as sufficiently high or low pressure and temperatures for the day. The occurence of different weather events such as rain, fog, thunderstorms, and snow are also considered poor weather conditions. The numerical ranges for poor weather condition classification will be determine later. The second part of this report will involve using a classification algorithm to predict whether a given days weather features (such as temperature, pressure, and precipitation) will result in high, average, or low amount of total reported vehicle accidents for the day. The method for aggregating the number of reported accidents into discrete categories is discussed later.
All modules and methods used for this report are shown below. The raw data for the daily weather statistics as well as the complete motor vehicle accident logs are also shown below.
import numpy as np
from numpy.random import default_rng
import pandas as pd
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix, f1_score, balanced_accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.model_selection import GridSearchCV
weather_raw = pd.read_csv('https://raw.githubusercontent.com/zonination/weather-us/master/nyc.csv')
weather_raw
| Unnamed: 0 | Date | Max.TemperatureF | Mean.TemperatureF | Min.TemperatureF | Max.Dew.PointF | MeanDew.PointF | Min.DewpointF | Max.Humidity | Mean.Humidity | ... | Min.VisibilityMiles | Max.Wind.SpeedMPH | Mean.Wind.SpeedMPH | Max.Gust.SpeedMPH | PrecipitationIn | CloudCover | Events | WindDirDegrees.br... | city | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1948-07-01 | 84 | 78.0 | 72 | 71 | 65 | 58 | 93 | 65 | ... | 2.0 | 16 | 8 | NaN | 0.00 | 0.0 | Fog | 264<br /> | New York City (USA) | Summer |
| 1 | 2 | 1948-07-02 | 82 | 72.0 | 63 | 62 | 53 | 49 | 76 | 51 | ... | 10.0 | 16 | 10 | NaN | 0.00 | 0.0 | NaN | 315<br /> | New York City (USA) | Summer |
| 2 | 3 | 1948-07-03 | 78 | 71.0 | 64 | 66 | 58 | 53 | 84 | 62 | ... | 5.0 | 14 | 6 | NaN | 0.00 | 0.0 | NaN | 203<br /> | New York City (USA) | Summer |
| 3 | 4 | 1948-07-04 | 84 | 76.0 | 68 | 68 | 63 | 56 | 90 | 67 | ... | 2.0 | 12 | 5 | NaN | 0.00 | 0.0 | Fog | 198<br /> | New York City (USA) | Summer |
| 4 | 5 | 1948-07-05 | 93 | 82.0 | 70 | 74 | 69 | 65 | 93 | 71 | ... | 3.0 | 18 | 8 | NaN | 0.00 | 0.0 | Fog-Rain-Thunderstorm | 218<br /> | New York City (USA) | Summer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24555 | 24623 | 2015-12-27 | 63 | 56.0 | 48 | 58 | 51 | 37 | 96 | 82 | ... | 0.0 | 23 | 12 | 29.0 | 0.08 | 8.0 | Fog-Rain | 283<br /> | New York City (USA) | Winter |
| 24556 | 24624 | 2015-12-28 | 48 | 42.0 | 35 | 35 | 25 | 19 | 70 | 57 | ... | 6.0 | 23 | 14 | 37.0 | 0.05 | 7.0 | Rain-Snow | 39<br /> | New York City (USA) | Winter |
| 24557 | 24625 | 2015-12-29 | 49 | 42.0 | 34 | 46 | 40 | 30 | 93 | 88 | ... | 1.0 | 24 | 12 | 30.0 | 0.66 | 8.0 | Rain-Snow | 28<br /> | New York City (USA) | Winter |
| 24558 | 24626 | 2015-12-30 | 51 | 46.0 | 41 | 48 | 41 | 37 | 93 | 85 | ... | 2.0 | 22 | 7 | 26.0 | 0.37 | 8.0 | Rain | 65<br /> | New York City (USA) | Winter |
| 24559 | 24627 | 2015-12-31 | 52 | 48.0 | 43 | 46 | 38 | 27 | 93 | 71 | ... | 5.0 | 20 | 12 | 25.0 | 0.06 | 7.0 | Rain | 317<br /> | New York City (USA) | Winter |
24560 rows × 26 columns
accidents_raw = pd.read_csv('https://query.data.world/s/wyhfcc4evd3fydwgw3tdp2synck42u?dws=00000')
accidents_raw
C:\Users\andre\AppData\Local\Temp\ipykernel_21300\1668878680.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
accidents_raw = pd.read_csv('https://query.data.world/s/wyhfcc4evd3fydwgw3tdp2synck42u?dws=00000')
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 07/22/2019 | 19:00 | BROOKLYN | 11208.0 | 40.679080 | -73.877210 | POINT (-73.87721 40.67908) | NaN | NaN | 289 LOGAN STREET | ... | NaN | NaN | NaN | NaN | 4175832 | Sedan | NaN | NaN | NaN | NaN |
| 1 | 08/11/2019 | 15:35 | BROOKLYN | 11234.0 | 40.606000 | -73.928450 | POINT (-73.92845 40.606) | AVENUE U | EAST 35 STREET | NaN | ... | Driver Inattention/Distraction | NaN | NaN | NaN | 4190884 | Sedan | Sedan | NaN | NaN | NaN |
| 2 | 07/31/2019 | 15:10 | BROOKLYN | 11205.0 | 40.699192 | -73.955190 | POINT (-73.95519 40.699192) | FLUSHING AVENUE | WALWORTH STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4180546 | Box Truck | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 3 | 08/14/2019 | 17:00 | BROOKLYN | 11206.0 | 40.696354 | -73.940710 | POINT (-73.94071 40.696354) | MARCUS GARVEY BOULEVARD | MYRTLE AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4189386 | Sedan | Sedan | NaN | NaN | NaN |
| 4 | 07/22/2019 | 12:25 | QUEENS | 11374.0 | NaN | NaN | NaN | Woodhaven Boulevard | Metropolitan Avenue | NaN | ... | Unspecified | NaN | NaN | NaN | 4175283 | Box Truck | Sedan | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1635885 | 07/06/2012 | 23:45 | BRONX | 10454 | 40.809107 | -73.922887 | POINT (-73.9228869 40.8091071) | EAST 138 STREET | WILLIS AVENUE | NaN | ... | NaN | NaN | NaN | NaN | 72627 | PASSENGER VEHICLE | NaN | NaN | NaN | NaN |
| 1635886 | 07/05/2012 | 18:11 | QUEENS | 11361 | 40.766554 | -73.772545 | POINT (-73.7725446 40.7665544) | BELL BOULEVARD | 38 AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 259386 | SPORT UTILITY / STATION WAGON | SPORT UTILITY / STATION WAGON | NaN | NaN | NaN |
| 1635887 | 07/02/2012 | 22:56 | QUEENS | 11385 | 40.700593 | -73.854565 | POINT (-73.8545649 40.7005934) | WOODHAVEN BOULEVARD | FOREST PARK DRIVE | NaN | ... | Unspecified | NaN | NaN | NaN | 203707 | UNKNOWN | BICYCLE | NaN | NaN | NaN |
| 1635888 | 07/06/2012 | 12:00 | BROOKLYN | 11204 | 40.619963 | -73.982281 | POINT (-73.9822811 40.6199627) | 59 STREET | 20 AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 129942 | SPORT UTILITY / STATION WAGON | UNKNOWN | NaN | NaN | NaN |
| 1635889 | 07/03/2012 | 14:50 | MANHATTAN | 10012 | 40.725083 | -73.995328 | POINT (-73.9953279 40.7250834) | EAST HOUSTON STREET | LAFAYETTE STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 5309 | LARGE COM VEH(6 OR MORE TIRES) | PICK-UP TRUCK | NaN | NaN | NaN |
1635890 rows × 29 columns
The correlation between the weather statistics of a given day and the amount of motor vehicles accidents must be handled on a day-to-day basis for accurate data aggregation and comparison. Therefore, only the range of dates encapsulated in both data sets will be used for valid comparisons.
print("Weather Data")
weather_dates = pd.to_datetime(weather_raw["Date"]) # convert all listed dates as datetime values
print("Start Date: " + str(weather_dates.min()) )
print("End Date: " + str(weather_dates.max()) )
print('')
print("Collision Data")
accident_dates = pd.to_datetime(accidents_raw["CRASH DATE"], errors = "coerce") # force non-date entries to be NaT
print("Start Date: " + str(accident_dates.min()) )
print("End Date: " + str(accident_dates.max()) )
Weather Data Start Date: 1948-07-01 00:00:00 End Date: 2015-12-31 00:00:00 Collision Data Start Date: 2012-07-01 00:00:00 End Date: 2020-01-11 00:00:00
Therefore, this analysis will use the range of dates from October 12th, 2012 through December 31st, 2015.
The weather data frame is narrowed down to the valid date range as shown below.
is_past_start = (weather_dates >= '2012-10-12') # boolean expressions to check if within date range
is_before_end = (weather_dates <= '2015-12-31')
weather2 = weather_raw[is_past_start & is_before_end]
weather2.head(5)
| Unnamed: 0 | Date | Max.TemperatureF | Mean.TemperatureF | Min.TemperatureF | Max.Dew.PointF | MeanDew.PointF | Min.DewpointF | Max.Humidity | Mean.Humidity | ... | Min.VisibilityMiles | Max.Wind.SpeedMPH | Mean.Wind.SpeedMPH | Max.Gust.SpeedMPH | PrecipitationIn | CloudCover | Events | WindDirDegrees.br... | city | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23384 | 23449 | 2012-10-12 | 60 | 52.0 | 43 | 44 | 35 | 23 | 80 | 58 | ... | 10.0 | 32 | 15 | 38.0 | T | 3.0 | NaN | 298<br /> | New York City (USA) | Autumn |
| 23385 | 23450 | 2012-10-13 | 55 | 46.0 | 37 | 43 | 27 | 18 | 61 | 44 | ... | 10.0 | 24 | 9 | 29.0 | 0.00 | 2.0 | NaN | 201<br /> | New York City (USA) | Autumn |
| 23386 | 23451 | 2012-10-14 | 72 | 64.0 | 55 | 59 | 52 | 44 | 90 | 69 | ... | 10.0 | 24 | 17 | 29.0 | 0.00 | 6.0 | NaN | 210<br /> | New York City (USA) | Autumn |
| 23387 | 23452 | 2012-10-15 | 69 | 65.0 | 61 | 63 | 60 | 57 | 93 | 83 | ... | 2.0 | 23 | 14 | 28.0 | 0.18 | 7.0 | Rain | 208<br /> | New York City (USA) | Autumn |
| 23388 | 23453 | 2012-10-16 | 61 | 55.0 | 48 | 56 | 38 | 34 | 84 | 62 | ... | 10.0 | 29 | 15 | 36.0 | 0.00 | 3.0 | NaN | 320<br /> | New York City (USA) | Autumn |
5 rows × 26 columns
For analysis purposes, the data frame is simplified into a select few feature columns. The features used are mean temperature (in degrees Fahrenheit), mean humidity (as a percentage), mean barometric pressure (adjusted to sea level pressure, in inches of mercury), mean wind speed (in MPH), precipitation (in inches), cloud cover (in oktas), and weather events (boolean statements of snow, rain, fog, etc.). The date column is shown below, but this is used to create an index for the data frame later.
This data is used as the major representatives of a typical days weather conditions and expresses weather features which might best correlate with vehicle accidents. The temperature and pressure describe the general weather pattern of the day and the extremes can suggest hazardous conditions for drivers. The humidity can combine with the temperature to describe how the temperature might feel to the driver, such as with "feels like" temperature guages. Wind speed and cloud coverage can describe adverse weather such as with tropical storms, hurricans, thunderstorms, etc. The precipitation provides a driving impairment in addition to the wind speed and cloud coverage; the source of this is specified by the weather events which each might have different effects on collision probabilities. The average values are used instead of the extrema, like the average temperature for example, as the number of reported collisions will be aggregated each day so the weather conditions during each collision is unknown and the average gives the best representative of the daily weather conditions.
wx_cols = pd.Series(weather_raw.columns)
used_features = [1, 3,9,12,18,20,21,22] # indices of features/columns used
wx_features = wx_cols[used_features]
print("All available weather features for analysis:")
print(wx_cols[2:-2])
print("")
print("Simplified weather data used:")
print(wx_features[1:])
All available weather features for analysis: 2 Max.TemperatureF 3 Mean.TemperatureF 4 Min.TemperatureF 5 Max.Dew.PointF 6 MeanDew.PointF 7 Min.DewpointF 8 Max.Humidity 9 Mean.Humidity 10 Min.Humidity 11 Max.Sea.Level.PressureIn 12 Mean.Sea.Level.PressureIn 13 Min.Sea.Level.PressureIn 14 Max.VisibilityMiles 15 Mean.VisibilityMiles 16 Min.VisibilityMiles 17 Max.Wind.SpeedMPH 18 Mean.Wind.SpeedMPH 19 Max.Gust.SpeedMPH 20 PrecipitationIn 21 CloudCover 22 Events 23 WindDirDegrees.br... dtype: object Simplified weather data used: 3 Mean.TemperatureF 9 Mean.Humidity 12 Mean.Sea.Level.PressureIn 18 Mean.Wind.SpeedMPH 20 PrecipitationIn 21 CloudCover 22 Events dtype: object
X2 = weather2[wx_features]
pd.to_datetime(X2["Date"]) # convert date to proper format if not already
X3 = X2.set_index("Date")
X3.head()
| Mean.TemperatureF | Mean.Humidity | Mean.Sea.Level.PressureIn | Mean.Wind.SpeedMPH | PrecipitationIn | CloudCover | Events | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2012-10-12 | 52.0 | 58 | 30.20 | 15 | T | 3.0 | NaN |
| 2012-10-13 | 46.0 | 44 | 30.45 | 9 | 0.00 | 2.0 | NaN |
| 2012-10-14 | 64.0 | 69 | 30.15 | 17 | 0.00 | 6.0 | NaN |
| 2012-10-15 | 65.0 | 83 | 29.75 | 14 | 0.18 | 7.0 | Rain |
| 2012-10-16 | 55.0 | 62 | 29.79 | 15 | 0.00 | 3.0 | NaN |
The "events" feature is categorical data and must be converted into a dummy variable to be used. Each available weather event (e.g. snow, rain, fog) will be given a seperate feature and if that event was present that day, the feature will get assigned 1, else, the feature will get assigned 0. This allows for multiple weather events occuring on the same day to be accounted for, and for NaN values to be classified numerically as all 0 value entries. This has the downside of equally quantifying concurrent weather events, but such is a limitation of the data presented and it must be assumed that all weather events are equally present that day.
Each of the unique combinations of weather events presented in the data are shown below.
print(X3["Events"].unique()) # all possible weather combinations
event_type = ["Rain", "Snow", "Fog", "Thunderstorm"] # unique weather events
[nan 'Rain' 'Fog-Rain' 'Fog' 'Fog-Rain-Snow' 'Snow' 'Rain-Thunderstorm' 'Rain-Snow' 'Fog-Snow' 'Fog-Rain-Thunderstorm' 'Thunderstorm' 'Fog-Thunderstorm' 'Rain-Snow-Thunderstorm']
Therefore, the weather events of rain, snow, fog, and thunderstorms are used as dummy variables.
# boolean search if weather type is present that day
rain = []
snow = []
fog = []
tstorm = []
for i in range(len(X3["Events"])): #loop over each day
if 'Rain' in str(X3["Events"][i]): # rain present
rain.append(1)
else:
rain.append(0)
if 'Snow' in str(X3["Events"][i]): # snow present
snow.append(1)
else:
snow.append(0)
if "Fog" in str(X3["Events"][i]): # fog present
fog.append(1)
else:
fog.append(0)
if "Thunderstorm" in str(X3["Events"][i]): # thunderstorm present
tstorm.append(1)
else:
tstorm.append(0)
# add dummy features to dataframe
X3["Is_Rain"] = rain
X3["Is_Snow"] = snow
X3["Is_Fog"] = fog
X3["Is_Storm"] = tstorm
X4 = X3.drop(columns="Events") # event feature is obsolete now
X4 = X4.rename(columns={"Mean.TemperatureF" : "AvgTempF", "Mean.Humidity" : "AvgHumidity", "Mean.Sea.Level.PressureIn" : "AvgPressureIn", "Mean.Wind.SpeedMPH" : "AvgWindSpdMPH", "PrecipitationIn" : "PrecipIn"}) # rename columns to shorten text
X4.head()
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2012-10-12 | 52.0 | 58 | 30.20 | 15 | T | 3.0 | 0 | 0 | 0 | 0 |
| 2012-10-13 | 46.0 | 44 | 30.45 | 9 | 0.00 | 2.0 | 0 | 0 | 0 | 0 |
| 2012-10-14 | 64.0 | 69 | 30.15 | 17 | 0.00 | 6.0 | 0 | 0 | 0 | 0 |
| 2012-10-15 | 65.0 | 83 | 29.75 | 14 | 0.18 | 7.0 | 1 | 0 | 0 | 0 |
| 2012-10-16 | 55.0 | 62 | 29.79 | 15 | 0.00 | 3.0 | 0 | 0 | 0 | 0 |
In the above table, it is observed that the precipation is sometimes reported as "T" for trace. As the trace amounts are too small to be measured but yet non-zero as observed, they will we be simplified hereafter and assigned as zero inch precipitation days.
X4["PrecipIn"] = X4["PrecipIn"].replace("T", 0) # replace trace amounts of precipitation
X4 = X4.astype("float") # convert stray values to floating point values
X = X4
X
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2012-10-12 | 52.0 | 58.0 | 30.20 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-13 | 46.0 | 44.0 | 30.45 | 9.0 | 0.00 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-14 | 64.0 | 69.0 | 30.15 | 17.0 | 0.00 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-15 | 65.0 | 83.0 | 29.75 | 14.0 | 0.18 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-16 | 55.0 | 62.0 | 29.79 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2015-12-27 | 56.0 | 82.0 | 29.97 | 12.0 | 0.08 | 8.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2015-12-28 | 42.0 | 57.0 | 30.41 | 14.0 | 0.05 | 7.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2015-12-29 | 42.0 | 88.0 | 30.12 | 12.0 | 0.66 | 8.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2015-12-30 | 46.0 | 85.0 | 30.23 | 7.0 | 0.37 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2015-12-31 | 48.0 | 71.0 | 30.06 | 12.0 | 0.06 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1176 rows × 10 columns
The target column must be a single parameter for the correlation, and later learning algorithm, to apply to. The data presented will be greatly simplified to just present the number of total collisions occuring during that day.
daily_count = pd.Series(accident_dates.value_counts()) # count occurences of each date & collision entry pair
daily_count = daily_count.to_frame()
daily_count2 = daily_count.reset_index(inplace=False) # have date as seperate column
daily_count2.columns = ["Date", "No. of Collisions"]
daily_count3 = daily_count2.sort_values(by = "Date") # sort dataframe by date
daily_count3
| Date | No. of Collisions | |
|---|---|---|
| 1964 | 2012-07-01 | 538 |
| 1716 | 2012-07-02 | 564 |
| 653 | 2012-07-03 | 664 |
| 2601 | 2012-07-04 | 432 |
| 1438 | 2012-07-05 | 591 |
| ... | ... | ... |
| 2704 | 2020-01-07 | 381 |
| 2655 | 2020-01-08 | 410 |
| 2411 | 2020-01-09 | 479 |
| 2587 | 2020-01-10 | 438 |
| 2728 | 2020-01-11 | 350 |
2751 rows × 2 columns
The dates listed above are then filtered into the range specified from before in order to be used alongside the feature data frame.
accident_dates_2 = daily_count3["Date"] # extract dates from list before
is_past_start = (accident_dates_2 >= '2012-10-12') # boolean expressions to check if within date range
is_before_end = (accident_dates_2 <= '2015-12-31')
collisions2 = daily_count3[is_past_start & is_before_end]
collisions3 = collisions2.set_index("Date") # convert date column back to index
collisions3
| No. of Collisions | |
|---|---|
| Date | |
| 2012-10-12 | 594 |
| 2012-10-13 | 547 |
| 2012-10-14 | 517 |
| 2012-10-15 | 607 |
| 2012-10-16 | 602 |
| ... | ... |
| 2015-12-27 | 518 |
| 2015-12-28 | 490 |
| 2015-12-29 | 533 |
| 2015-12-30 | 501 |
| 2015-12-31 | 527 |
1176 rows × 1 columns
We can now merge the feature and daily count dataframes in order to make sure that the two dataframes contain data from the same days with neither having any more or less. This dataframe is also used for later correlation plots.
Xy = X # copy feature frame
Xy["Collisions"] = collisions3 # append collisions column
Xy
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | Collisions | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||
| 2012-10-12 | 52.0 | 58.0 | 30.20 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 594 |
| 2012-10-13 | 46.0 | 44.0 | 30.45 | 9.0 | 0.00 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 547 |
| 2012-10-14 | 64.0 | 69.0 | 30.15 | 17.0 | 0.00 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 517 |
| 2012-10-15 | 65.0 | 83.0 | 29.75 | 14.0 | 0.18 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 | 607 |
| 2012-10-16 | 55.0 | 62.0 | 29.79 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 602 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2015-12-27 | 56.0 | 82.0 | 29.97 | 12.0 | 0.08 | 8.0 | 1.0 | 0.0 | 1.0 | 0.0 | 518 |
| 2015-12-28 | 42.0 | 57.0 | 30.41 | 14.0 | 0.05 | 7.0 | 1.0 | 1.0 | 0.0 | 0.0 | 490 |
| 2015-12-29 | 42.0 | 88.0 | 30.12 | 12.0 | 0.66 | 8.0 | 1.0 | 1.0 | 0.0 | 0.0 | 533 |
| 2015-12-30 | 46.0 | 85.0 | 30.23 | 7.0 | 0.37 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 501 |
| 2015-12-31 | 48.0 | 71.0 | 30.06 | 12.0 | 0.06 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 | 527 |
1176 rows × 11 columns
We now finally divide the complete dataframe into the feautre dataframe (X) and the label dataframe (y).
X = Xy[["AvgTempF", "AvgHumidity", "AvgPressureIn", "AvgWindSpdMPH", "PrecipIn", "CloudCover", "Is_Rain", "Is_Snow","Is_Fog","Is_Storm"]]
X
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2012-10-12 | 52.0 | 58.0 | 30.20 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-13 | 46.0 | 44.0 | 30.45 | 9.0 | 0.00 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-14 | 64.0 | 69.0 | 30.15 | 17.0 | 0.00 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-15 | 65.0 | 83.0 | 29.75 | 14.0 | 0.18 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-16 | 55.0 | 62.0 | 29.79 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2015-12-27 | 56.0 | 82.0 | 29.97 | 12.0 | 0.08 | 8.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2015-12-28 | 42.0 | 57.0 | 30.41 | 14.0 | 0.05 | 7.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2015-12-29 | 42.0 | 88.0 | 30.12 | 12.0 | 0.66 | 8.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2015-12-30 | 46.0 | 85.0 | 30.23 | 7.0 | 0.37 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2015-12-31 | 48.0 | 71.0 | 30.06 | 12.0 | 0.06 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1176 rows × 10 columns
y = Xy[["Collisions"]]
y
| Collisions | |
|---|---|
| Date | |
| 2012-10-12 | 594 |
| 2012-10-13 | 547 |
| 2012-10-14 | 517 |
| 2012-10-15 | 607 |
| 2012-10-16 | 602 |
| ... | ... |
| 2015-12-27 | 518 |
| 2015-12-28 | 490 |
| 2015-12-29 | 533 |
| 2015-12-30 | 501 |
| 2015-12-31 | 527 |
1176 rows × 1 columns
The summary statistics of the weather data for New York City from 2012-2015 are described below.
X.describe().iloc[[1,2,3,7,4,5,6]]
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| mean | 54.597789 | 63.653912 | 30.052738 | 11.244048 | 0.112355 | 5.136054 | 0.365646 | 0.092687 | 0.079932 | 0.051020 |
| std | 17.217773 | 13.884268 | 0.218680 | 4.300911 | 0.344442 | 2.053379 | 0.481816 | 0.290117 | 0.271303 | 0.220133 |
| min | 12.000000 | 27.000000 | 28.980000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 89.000000 | 97.000000 | 30.760000 | 36.000000 | 4.680000 | 8.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 41.000000 | 53.000000 | 29.910000 | 8.000000 | 0.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 55.000000 | 64.000000 | 30.050000 | 11.000000 | 0.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 70.000000 | 74.000000 | 30.200000 | 14.000000 | 0.030000 | 7.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
The statistical data can also be aggregated across the entire month, combined with each month in the date range. The temperature and precipitation data for New York City are shown below.
X_monthly = X.reset_index(inplace=False) # copy dataframe
X_monthly["Date"] = pd.to_datetime(X_monthly["Date"]) # convert date to proper format
X_monthly["Month"] = pd.DatetimeIndex(X_monthly['Date']).month # add month column
# show summary statistics for each month, transpose to fit on screen
i = 0 # temperature data ranges from 0 to 8
X_monthly.groupby("Month").describe().transpose().iloc[[1+i*8,2+i*8,3+i*8,7+i*8,4+i*8,5+i*8,6+i*8]]
| Month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AvgTempF | mean | 31.655914 | 30.130952 | 38.612903 | 51.322222 | 62.473118 | 71.377778 | 78.021505 | 75.892473 | 70.055556 | 59.194690 | 46.891667 | 43.169355 |
| std | 9.399924 | 7.708202 | 7.095670 | 6.003422 | 5.775871 | 5.529447 | 4.091291 | 3.402245 | 6.372979 | 6.355339 | 8.036467 | 7.926206 | |
| min | 13.000000 | 12.000000 | 19.000000 | 39.000000 | 51.000000 | 54.000000 | 67.000000 | 69.000000 | 58.000000 | 44.000000 | 28.000000 | 26.000000 | |
| max | 49.000000 | 46.000000 | 52.000000 | 65.000000 | 76.000000 | 83.000000 | 89.000000 | 83.000000 | 83.000000 | 73.000000 | 69.000000 | 63.000000 | |
| 25% | 25.000000 | 25.750000 | 35.000000 | 47.000000 | 59.000000 | 68.250000 | 75.000000 | 73.000000 | 64.000000 | 55.000000 | 41.750000 | 37.000000 | |
| 50% | 33.000000 | 29.000000 | 37.000000 | 52.000000 | 62.000000 | 72.000000 | 78.000000 | 76.000000 | 70.000000 | 60.000000 | 47.000000 | 43.000000 | |
| 75% | 39.000000 | 36.250000 | 44.000000 | 55.000000 | 66.000000 | 75.000000 | 80.000000 | 78.000000 | 74.750000 | 63.000000 | 52.250000 | 48.250000 |
i = 4
X_monthly.groupby("Month").describe().transpose().iloc[[1+i*8,2+i*8,3+i*8,7+i*8,4+i*8,5+i*8,6+i*8]] # precipitation
| Month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PrecipIn | mean | 0.113441 | 0.11750 | 0.122688 | 0.115222 | 0.086989 | 0.172778 | 0.101075 | 0.113871 | 0.065889 | 0.082566 | 0.081500 | 0.171452 |
| std | 0.261187 | 0.28166 | 0.334333 | 0.517664 | 0.249703 | 0.501237 | 0.277212 | 0.427823 | 0.204131 | 0.293582 | 0.252605 | 0.384580 | |
| min | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
| max | 1.620000 | 1.27000 | 2.600000 | 4.680000 | 1.540000 | 4.010000 | 1.870000 | 2.920000 | 1.180000 | 1.810000 | 1.410000 | 3.040000 | |
| 25% | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
| 50% | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
| 75% | 0.060000 | 0.04750 | 0.040000 | 0.025000 | 0.020000 | 0.087500 | 0.030000 | 0.000000 | 0.000000 | 0.010000 | 0.000000 | 0.135000 |
The average temperature is around 54.6F with an average precipitation of 0.11". We also observe that in the monthly summary statistics, the coldest month (January) averages around 31.7F and the warmest month (July) averages around 78.0F. We can categorize the climate of New York City using the Köppen climate classification [3]. The variable monthly temperatures and fairly low precipitation rule out tropical and arid climate classifications. As the average January temperature falls between 32F and 64F and the average July temperature average above 50F, this is a temperate climate.
Slightly more precipitation is observed in the summer months (June-August, avg: 0.129") than the winter months (January-March, avg: 0.118") so this region has a wet summer. No dry summer months are observed, and at least one month (July) has a temperature above 77F. Therefore, the climate of New York City is classified as a humid subtropical climate (Cfa), which is verified by literature sources [4].
Note that observing the data across the full data range from 1948 to 2015 will provide a more accurate classification, but likely will experience changing climate classification. For now, this is considered an accurate classification and it can therefore be said that the weather patterns experienced in New York City are expected for similar cities with climates of classification Cfa. This may suggest similar environmental factors (weather features) to have equal importance in terms of collision liklihood for regions with similar climates as the inhabitants of the region will likely have similar experience in hazardous conditions. However, the city-to-city driving experience will be too broad as collisions may also be attributed to the higher traffic density of the city, more invested infastructure, different municiple regulations, etc. This classification does provide a basis for future research and can be used to negate the presense of different weather conditions when comparing the collision rates between two cities--e.g., if one were to compare the collision count per day per 1,000 residents in New York City versus Los Angeles, there would be a slight bias caused by the two regions having different climates and thus weather patterns over multiple years.
The variation of each major weather classification category used in this analysis by day can be graphically shown.
cols = X.columns
for i in range(0,6): # range over each X column (exclude dummy variables)
sns.relplot(data=X_monthly, x="Month", y=cols[i], kind="line");
#sns.relplot(data=X_monthly, x="Month", y="AvgTempF", kind = "line");
We observe some expected cyclical trends in the above data. The temperature distribution throughout the year is roughly symmetrical and peaks some time in July as shown from previous analysis. It is also observed that the average humidity follows a similar trend, but seems to peak in June. More data is needed to draw the correlation for these features. Another notable pattern in the data is the peak of the average wind speed occuring around the month of March. This is unexpected as intuition would suggest it to be around late summer-early fall corresponding with hurricane season, but this could perhaps be an effect of the drastic temperature increases occuring in February through April.
While the precipitation was before said to be more in the summer than winter (and thus categorizing the city as a subtropical climate), there are large fluctuations in the data making it difficult to prove any correlation between the time of year and expected precipitation without further analysis.
We can also plot the temperature distribution using a box and whiskers plot to observe the respective data distributions of each month. As observed, the overall cyclical trend is reproduced but a large IQR is found in the winter months suggesting the region to be prone to intense and longer lasting periods of extreme hot or cold weather, likely brought by tropical storm activity or from the westerlies. This is inferred based on the climate classification and the above data showing large wind speeds in these months.
sns.catplot(data=X_monthly,
x="Month", y="AvgTempF",
kind="box"
);
The number of vehicle collisions can be plotted versus time to show the relationship to the time of year and vehicle collisions.
# use daily_count3 to get a longer range view
daily_count3['Month'] = pd.DatetimeIndex(daily_count3['Date']).month # add a month column
sns.relplot(data=daily_count3,
x="Month", y="No. of Collisions",
kind = "line",
);
It is observed in the plot above that the number of vehicle collisions peaks in the month of June and is at a minimum in January. This is most likely explained by the increased number of drivers on the road during warmer, summer months compared to colder, winter months. This is an important bias to the data as the time of year correlates very strongly to the temperature and thus it would suggest that the temperature would have a high correlation with the number of vehicle collisions.
It has yet to be established whether the current features used for classification are completely independent from one another. This is vital to improving the efficiency of the classification algorithm as increasing the input features will drastically increase the computational time and can lead to misleading results as well. Therefore, it is imperitive to determine how the features are correlated with respect to one another.
print("Weather Features: ")
print(*np.array(X.columns), sep = ", ")
Weather Features: AvgTempF, AvgHumidity, AvgPressureIn, AvgWindSpdMPH, PrecipIn, CloudCover, Is_Rain, Is_Snow, Is_Fog, Is_Storm
For each feature listed above, a pair plot can be created. The first six are plotted together as these are the features with meaningful graphical comparisons items worth comparing.
cols = X.columns[0:6]
sns.pairplot(data=X[cols], height=2);
As shown above, there appears to be no correlation between the (average) temperature, humidity, pressure, and wind speed as these are all characteristic parameters of a weather forecast. The cloud cover feature is a quantized value more accurately determined by qualitative data and as such sees little correlation with the remaining data. There are some trends noted, such the lack of low pressure-low cloud coverage data points. As low pressure corresponds to increased winds and storm activity, cloud formation occurs and thus explains the gap in the data. There is also a significant number of low to zero precipitation data which prevents any correlation from being drawn between the other data values.
There is no expected correlation between the weather event, such as fog or rain, and the other quantitative data, such as temperature, wind speed, as the event data are dummy variables to classify the weather events of the day and can only be a 1 or 0. Some notable expectations are the lack of high temperature snow days, low pressure no-rain days, and low humidity rain days. To verify this, the comparisons between the precipitation features (or weather events) and the remaining five features can be shown in a pairwise plot below. A total of 20 of such plots are created but are commented out to avoid clutter.
'''
events = ['Is_Rain', 'Is_Snow', 'Is_Fog', 'Is_Storm']
other_weather = ['AvgTempF', 'AvgHumidity', 'AvgPressureIn', 'AvgWindSpdMPH', 'CloudCover']
for i in range(len(events)): # for each weather event
for j in range(len(other_weather)): # for each weather feature
sns.relplot(data=X, x=events[i], y=other_weather[j]);
'''
"\nevents = ['Is_Rain', 'Is_Snow', 'Is_Fog', 'Is_Storm']\nother_weather = ['AvgTempF', 'AvgHumidity', 'AvgPressureIn', 'AvgWindSpdMPH', 'CloudCover']\n\nfor i in range(len(events)): # for each weather event\n for j in range(len(other_weather)): # for each weather feature\n sns.relplot(data=X, x=events[i], y=other_weather[j]);\n"
The correlation between weather events occuring simultaneously can be established. The total number of occurences for each possible weather event combination is shown below, sorted by occurence.
event_count = weather2["Events"].value_counts()
print(event_count)
plt.scatter(event_count, event_count.index);
Rain 275 Snow 58 Fog-Rain 52 Rain-Thunderstorm 52 Rain-Snow 32 Fog 18 Fog-Rain-Snow 13 Fog-Snow 5 Fog-Rain-Thunderstorm 5 Thunderstorm 1 Fog-Thunderstorm 1 Rain-Snow-Thunderstorm 1 Name: Events, dtype: int64
In order to prove two weather events are correlated, it must be concluded that 1) for most X, Y occurs, and 2) for most Y, X occurs. For example, almost all thunderstorms are accompanied by rain, as shown by the small frequency of thunderstorms occuring alone compared to the larger frequency of Rain-Thunderstorm and Fog-Rain-Thunderstorm pairs. However, thunderstorms cannot be removed as a feature as the high frequency of Rain and Fog-Rain pairs would make the dependence of Thunderstorms on Fog and Rain hard to predict. There is no clear parameter which is obsolete from this analysis so each weather event is considered for classification purposes.
A similar way to reach this conclusion is by creating a Venn diagram between each of the four events and finding the respective areas of the overlapping regions compared to the entire event section--for example, as there is little overlap between the Rain days and other weather event days compared to the total number of Rain occurences, this feature cannot be ignored.
A data table of the spearman coefficients between any two weather features are calculated in order to verify these conclusions.
full_spearman = X.corr(method = 'spearman') # compute spearman correlation
for i in range(len(X.columns)):
for j in range(len(X.columns)):
if i==j:
full_spearman.iloc[i,j] = None # remove all values on diagonal
full_spearman
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| AvgTempF | NaN | 0.276427 | -0.242008 | -0.251217 | -0.029748 | -0.054441 | 0.107183 | -0.411795 | -0.069609 | 0.236140 |
| AvgHumidity | 0.276427 | NaN | -0.221299 | -0.215239 | 0.583763 | 0.625396 | 0.566843 | -0.005344 | 0.343545 | 0.214482 |
| AvgPressureIn | -0.242008 | -0.221299 | NaN | -0.284828 | -0.284235 | -0.211834 | -0.302258 | -0.088026 | -0.112153 | -0.189865 |
| AvgWindSpdMPH | -0.251217 | -0.215239 | -0.284828 | NaN | 0.069912 | -0.045296 | 0.035806 | 0.151964 | -0.041808 | -0.034684 |
| PrecipIn | -0.029748 | 0.583763 | -0.284235 | 0.069912 | NaN | 0.621792 | 0.761792 | 0.211847 | 0.281316 | 0.330087 |
| CloudCover | -0.054441 | 0.625396 | -0.211834 | -0.045296 | 0.621792 | NaN | 0.615265 | 0.213627 | 0.245711 | 0.155206 |
| Is_Rain | 0.107183 | 0.566843 | -0.302258 | 0.035806 | 0.761792 | 0.615265 | NaN | 0.037411 | 0.231971 | 0.289359 |
| Is_Snow | -0.411795 | -0.005344 | -0.088026 | 0.151964 | 0.211847 | 0.213627 | 0.037411 | NaN | 0.100422 | -0.060783 |
| Is_Fog | -0.069609 | 0.343545 | -0.112153 | -0.041808 | 0.281316 | 0.245711 | 0.231971 | 0.100422 | NaN | 0.017158 |
| Is_Storm | 0.236140 | 0.214482 | -0.189865 | -0.034684 | 0.330087 | 0.155206 | 0.289359 | -0.060783 | 0.017158 | NaN |
# track current maximum value and position
max_val = 0
max_val_pos = [0,0]
# loop over all values in table to find maximum
for i in range(len(X.columns)):
for j in range(len(X.columns)):
if abs(full_spearman.iloc[i,j]) > max_val:
max_val = full_spearman.iloc[i,j] # save new max
max_val_pos = [i,j]
print('Most correlated features: ' + str(X.columns[max_val_pos[0]]) + ' and ' + str(X.columns[max_val_pos[1]]) )
print('Corresponding spearman coefficient: ' + str(max_val))
Most correlated features: PrecipIn and Is_Rain Corresponding spearman coefficient: 0.7617920864237233
The highest correlated features are the reported precipitation and the occurence of rain, as expected. However, as described before, the "Is_Rain" feature is needed to specify the specific type of precipitation which creates a hazardous condition. Besides this, the next best correlated features are the cloud coverage and the average humidity, which only have a correlation of 0.625. Therefore, all features are necessary for computation and thus considered uncorrelated.
It was established that none of the parameters are dependent directly upon any of the others, e.g. rain cannot be expressed as a function of pressure. However, the above analysis did not consider whether multiple features can combine to correlate to another feature. For example, high amounts of precipitation on its own does not correspond to the pressence of a storm, but high precipitation and low pressure might. It is computationally exhaustive to compute each pairwise relation and their effect on weather events, but educated guesses can be employed to determine this two-factor correlation.
The occurence of snow may be correlated to the temperature and pressure as the certain conditions must be met in order for snow to accumulate. Storms are also correlated to higher wind speeds and lower pressures so these features are used together as well. As described before, low pressure and high precipitation may be indicative of a storm so these are plotted as well. Finally, rain occurence is typically preceded by ample nimbus cloud coverage and is followed by high humidity throughout the day so these are plotted together. For each pairwise correlation,
print("Pressure vs Temp for Snow Occurence")
sns.relplot(data=X, x="AvgTempF", y="AvgPressureIn", hue="Is_Snow"); # plot pressure vs precipitation with hues as storm
snow_occ = X[X["Is_Snow"] == 1] # data which snow occurs
snow_no_occ = X[X["Is_Snow"] == 0] # data which no snow occurs
print("spearman coeff. w/ snow: " + str(snow_occ["AvgTempF"].corr(snow_occ["AvgPressureIn"],"spearman")))
print("spearman coeff. w/o snow: " + str(snow_no_occ["AvgTempF"].corr(snow_no_occ["AvgPressureIn"],"spearman")))
plt.show()
print('')
print("Wind Speed vs Pressure for Storm Occurence")
sns.relplot(data=X, x="AvgPressureIn", y="AvgWindSpdMPH", hue="Is_Storm"); # plot pressure vs wind speed with hues as storm
storm_occ = X[X["Is_Storm"] == 1] # data which storm occurs
storm_no_occ = X[X["Is_Storm"] == 0] # data which no storm occurs
print("spearman coeff. w/ storm: " + str(storm_occ["AvgPressureIn"].corr(storm_occ["AvgWindSpdMPH"],"spearman")))
print("spearman coeff. w/o storm: " + str(storm_no_occ["AvgPressureIn"].corr(storm_no_occ["AvgWindSpdMPH"],"spearman")))
plt.show()
print('')
print("Precipitation vs Pressure for Storm Occurence")
sns.relplot(data=X, x="AvgPressureIn", y="PrecipIn", hue="Is_Storm"); # plot pressure vs precipitation with hues as storm
print("spearman coeff. w/ storm: " + str(storm_occ["AvgPressureIn"].corr(storm_occ["PrecipIn"],"spearman")))
print("spearman coeff. w/o storm: " + str(storm_no_occ["AvgPressureIn"].corr(storm_no_occ["PrecipIn"],"spearman")))
plt.show()
print('')
print("Humidity vs Cloud Coverage for Rain Occurence")
sns.relplot(data=X, x="CloudCover", y="AvgHumidity", hue="Is_Rain"); # plot cloud coverage vs humidity with hues as rain
rain_occ = X[X["Is_Rain"] == 1] # data which rain occurs
rain_no_occ = X[X["Is_Rain"] == 0] # data which no rain occurs
print("spearman coeff. w/ rain: " + str(rain_occ["CloudCover"].corr(storm_occ["AvgHumidity"],"spearman")))
print("spearman coeff. w/o rain: " + str(rain_no_occ["CloudCover"].corr(storm_no_occ["AvgHumidity"],"spearman")))
plt.show()
Pressure vs Temp for Snow Occurence spearman coeff. w/ snow: -0.17878996089695212 spearman coeff. w/o snow: -0.3144606791051107
Wind Speed vs Pressure for Storm Occurence spearman coeff. w/ storm: -0.3725827041242697 spearman coeff. w/o storm: -0.2932232506528725
Precipitation vs Pressure for Storm Occurence spearman coeff. w/ storm: -0.13826739455694717 spearman coeff. w/o storm: -0.23667109437839273
Humidity vs Cloud Coverage for Rain Occurence spearman coeff. w/ rain: 0.6470294494629876 spearman coeff. w/o rain: 0.35438050802803356
As observed in the above graphs, there are no strong correlations between three connected features of the weather data. The highest correlated relation found was between the occurence of rain to a position on the average humidity & cloud coverage feature space (0.647 for data with rain, 0.354 for data without rain). However, this is still not a high enough correlation to completely disregard the rain weather event and thus the rain event must be considered independently of the other features in order to create a classification algorithm. This is true for the storm and rain events observed above. Note that this is a large simplification of the data but provides a first order approximation of how strongly the features are intercorrelated. For higher order correaltion, and for more complex predictions, the classification algorithm must be used as the calculations shown above suggest weak correlations between all features.
The correlation between each weather parameter and the amount of reported vehicle collisions can be assessed. A correlation plot is given between the individual parameter (X-axis) and the number of total collisions (Y-axis)--for each plot, the Pearson and Spearman correlation coefficients are reported as well. For the specific weather events, the correlations are directly computed as these are categorical data quantified by binary dummy classifiers and thus plotted data would not be as representative.
X_var = pd.Series(X.columns, name = "feature") # all weather features
pearsons = []
spearmans = []
for i in range(len(X_var)): # loop over each feature
print("No. of collisions vs. " + str(X_var[i]) )
# print the coefficients
print("Pearson Coeff.: " + str(X[X_var[i]].corr(Xy['Collisions'], "pearson")))
print("Spearman Coeff.: " + str(X[X_var[i]].corr(Xy['Collisions'], "spearman")))
# add to list of coefficients
pearsons.append(X[X_var[i]].corr(Xy['Collisions'], "pearson"))
spearmans.append(X[X_var[i]].corr(Xy['Collisions'], "spearman"))
# display plot
if i < len(X_var) - 4: # non-categorical data
sns.relplot(data=Xy, x=X_var[i], y='Collisions')
plt.show()
print('')
No. of collisions vs. AvgTempF Pearson Coeff.: 0.20467144652139668 Spearman Coeff.: 0.23901028163707264
No. of collisions vs. AvgHumidity Pearson Coeff.: 0.08215998681606174 Spearman Coeff.: 0.08004977307082761
No. of collisions vs. AvgPressureIn Pearson Coeff.: -0.028578100529787716 Spearman Coeff.: -0.04203623218258776
No. of collisions vs. AvgWindSpdMPH Pearson Coeff.: -0.03889247226056687 Spearman Coeff.: -0.030474391389913476
No. of collisions vs. PrecipIn Pearson Coeff.: 0.1732000911740921 Spearman Coeff.: 0.0940116119954689
No. of collisions vs. CloudCover Pearson Coeff.: -0.0623599947930078 Spearman Coeff.: -0.07306310534018964
No. of collisions vs. Is_Rain Pearson Coeff.: 0.07321402775491874 Spearman Coeff.: 0.06373261383027515 No. of collisions vs. Is_Snow Pearson Coeff.: -0.07387382158194554 Spearman Coeff.: -0.10681662185406175 No. of collisions vs. Is_Fog Pearson Coeff.: 0.021309140379378786 Spearman Coeff.: 0.0017503268421786807 No. of collisions vs. Is_Storm Pearson Coeff.: 0.09826594937604892 Spearman Coeff.: 0.1060325779368204
The Pearson and Spearman coefficients for each metric are summarized in the table shown below.
# convert lists to series
pearsons_s = pd.Series(pearsons, name = "pearsons_s")
spearmans_s = pd.Series(spearmans, name = "spearmans_s")
coeff = pd.concat([X_var, pearsons_s, spearmans_s], axis=1) # convert to dataframe
coeffs = coeff.rename(columns={"pearsons_s" : "pearson", "spearmans_s" : "spearman"}) # rename columns
coeffs = coeffs.set_index("feature") # index by weather feature
coeffs.head(10)
| pearson | spearman | |
|---|---|---|
| feature | ||
| AvgTempF | 0.204671 | 0.239010 |
| AvgHumidity | 0.082160 | 0.080050 |
| AvgPressureIn | -0.028578 | -0.042036 |
| AvgWindSpdMPH | -0.038892 | -0.030474 |
| PrecipIn | 0.173200 | 0.094012 |
| CloudCover | -0.062360 | -0.073063 |
| Is_Rain | 0.073214 | 0.063733 |
| Is_Snow | -0.073874 | -0.106817 |
| Is_Fog | 0.021309 | 0.001750 |
| Is_Storm | 0.098266 | 0.106033 |
From the table of correlation coefficients for each of the weather feature metrics, there is virtually no correlation between any of the metrics and the number of vehicle collisions. A classification algorithm will be applied to see if there is an accurate way to determine the collisions using all of the features combined. The most important feature of the classification algorithm can be determined to suggest the effect of different weather features on vehicle collisions.
For this classification algorithm, a k-nearest neighbors (KNN) algorithm is used as the complexity of the feature matrix requires a more complex algorithm. While the size of the data set favors the computationally simpler decision tree algorithm compared to the kNN algorithm, the low correlation found between each feature and the number of collisions suggests that the algorithm will need to use all features of different weights to determine the label and thus the kNN algorithm is best.
As the kNN algorithm uses a distance heuristic, standardization of data is required to remove any bias against larger values of the data set and instead fix each feature to a specific range. This is performed by using a pipeline to take the z-score of all values in the dataframe. The z-score is computed by $ z = (x - \mu)/\sigma $ where x is the data value, $\mu$ is the mean of the feature series x is an element of, and $\sigma$ is the standard deviation of that series.
The label series is given as numerical data, but must be converted to categorical data in order to classify. Outliers can be determined using the 1.5IQR rule; this is also used to determine good positions to split the data for the categories.
sns.catplot(data=Xy,
y="Collisions",
kind="box"
);
sns.displot(data=Xy,
x="Collisions"
);
print('Daily Collision Data')
y_desc = pd.Series(Xy["Collisions"].describe())
print(y_desc.head(10))
num_out = 0 # number of outliers
p25 = float(y_desc[4]) # 25th percentile
p75 = float(y_desc[6]) # 75th percentile
for i in range(len(y)): # loop over each label ent
val = y.iloc[i][0]
if val > (p75 + 1.5*(p75-p25)) or val < (p25 - 1.5*(p75-p25)):
num_out+=1
print("")
print("Number of outliers: " + str(num_out))
Daily Collision Data count 1176.000000 mean 570.761054 std 95.460790 min 188.000000 25% 510.750000 50% 571.000000 75% 629.250000 max 1161.000000 Name: Collisions, dtype: float64 Number of outliers: 17
As shown in the plots and calculations above, there are a few outliers which will be removed from the data set in order to create results more representative of the population at hand.
IQR = p75 - p25
min_outlier = p25 - 1.5*IQR # compute outlier bounds
max_outlier = p75 + 1.5*IQR
# boolean statements for filtering, True if in allowable 1.5IQR range
is_not_outlier_min = (Xy["Collisions"] > min_outlier)
is_not_outlier_max = (Xy["Collisions"] < max_outlier)
Xy2 = Xy # copy data
# apply filters
Xy2 = Xy[is_not_outlier_min & is_not_outlier_max]
Xy = Xy2 # update dataframes
X = Xy[["AvgTempF", "AvgHumidity", "AvgPressureIn", "AvgWindSpdMPH", "PrecipIn", "CloudCover", "Is_Rain", "Is_Snow","Is_Fog","Is_Storm"]]
X.head(5)
| AvgTempF | AvgHumidity | AvgPressureIn | AvgWindSpdMPH | PrecipIn | CloudCover | Is_Rain | Is_Snow | Is_Fog | Is_Storm | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2012-10-12 | 52.0 | 58.0 | 30.20 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-13 | 46.0 | 44.0 | 30.45 | 9.0 | 0.00 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-14 | 64.0 | 69.0 | 30.15 | 17.0 | 0.00 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-15 | 65.0 | 83.0 | 29.75 | 14.0 | 0.18 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2012-10-16 | 55.0 | 62.0 | 29.79 | 15.0 | 0.00 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
y = Xy[["Collisions"]]
y.head(5)
| Collisions | |
|---|---|
| Date | |
| 2012-10-12 | 594 |
| 2012-10-13 | 547 |
| 2012-10-14 | 517 |
| 2012-10-15 | 607 |
| 2012-10-16 | 602 |
The data now must be divided into categories for classification. Five categories are used to create a good balance between having minimal categorical features for easier computation yet still having ample distinction between data points. This is modelled after a common five point Likert scale used in survery questionaires [5], for example. Values left of the median (< 50th percentile of total daily collisions) will be referred to negative collision days in reference to having a negative z-score for the number of collisions. Values right of the median (> 50th percentile of total daily collisions) will be referred to as positive collision days for similar reasoning. Values near the median (between the 40th and 60th percentile) are referred to as being median collision days for relatively low z-scores. The nomeclature used for classification is shown below.
label_types = pd.DataFrame(data =
{'Symbol' : ['N', 'n', 'M', 'p', 'P'],
"Name" : ['Very negative', 'Moderately negative', 'Median', 'Moderately positive', 'Very positive'],
"Percentile Range" : [' 0 <= p < 20 ', '20 <= p < 40 ', '40 <= p <= 60 ', '60 < p <= 80 ', '80 < p <= 100'] }
)
label_types.head()
| Symbol | Name | Percentile Range | |
|---|---|---|---|
| 0 | N | Very negative | 0 <= p < 20 |
| 1 | n | Moderately negative | 20 <= p < 40 |
| 2 | M | Median | 40 <= p <= 60 |
| 3 | p | Moderately positive | 60 < p <= 80 |
| 4 | P | Very positive | 80 < p <= 100 |
Note that each label has roughly equal population size by design (20th percentile ranges), giving the base classification algorithm (i.e., were it to just blindly predict each day) equal liklihood to predict a given days total number of collisions. Each value of the label series is categorized according to the scheme described above.
labels = pd.Series(y["Collisions"]) # convert to series
# compute percentiles for divisions
p20 = float(y.quantile(q=0.2))
p40 = float(y.quantile(q=0.4))
p50 = float(y.quantile(q=0.5))
p60 = float(y.quantile(q=0.6))
p80 = float(y.quantile(q=0.8))
print("Percentiles: " + "20th = " + str(p20) + ", 40th = " + str(p40) + ", 60th = " + str(p60) + ", 80th = " + str(p80))
yc_list = [] # label list
for i in range(len(y)):
val = y.iloc[i][0]
if val < p20:
yc_list.append('N')
if val >= p20 and val < p40:
yc_list.append('n')
if val >= p40 and val <= p60:
yc_list.append('M')
if val > p60 and val <= p80:
yc_list.append('p')
if val > p80:
yc_list.append('P')
yc = pd.Series(data = yc_list, name = "yc", index = y.index) # convert to series
# display labels to make sure they make sense
yc_frame = pd.DataFrame(data={"Collisions" : y["Collisions"], "Label" : yc}, index = Xy.index)
yc_frame
Percentiles: 20th = 496.0, 40th = 549.0, 60th = 595.0, 80th = 645.0
| Collisions | Label | |
|---|---|---|
| Date | ||
| 2012-10-12 | 594 | M |
| 2012-10-13 | 547 | n |
| 2012-10-14 | 517 | n |
| 2012-10-15 | 607 | p |
| 2012-10-16 | 602 | p |
| ... | ... | ... |
| 2015-12-27 | 518 | n |
| 2015-12-28 | 490 | N |
| 2015-12-29 | 533 | n |
| 2015-12-30 | 501 | n |
| 2015-12-31 | 527 | n |
1159 rows × 2 columns
For a starting point, an arbitrary $n = 3$ neighbors are used for the kNN algorithm. This value will be optimized in later sections but for now provides a basis for analysis. The data is divided into training and testing sets of data with the testing being 20% of the total dataset. A random shuffle state is also used to evaluate performance later.
X_train, X_test, y_train, y_test = train_test_split(
X, yc,
test_size=0.2,
shuffle=True,
random_state=43913,
); # split data (features and categorical labels) between testing and training data
# if KNN
knn = KNeighborsClassifier(n_neighbors=3, weights = "distance", p = 2); # use kNN algorithm, euclidean distance
pipe = make_pipeline(StandardScaler(), knn); # standardize data
# if tree
#tree = DecisionTreeClassifier(max_depth=1) # code if using
#pipe = make_pipeline(StandardScaler(), tree); # standardize data
pipe.fit(X_train, y_train); # train pipeline
yhat = knn.predict(X_test); # make predictions using trained data
pd.Series(yhat, index = y_test.index).head(10) # display predicted data
C:\Users\andre\anaconda3\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but KNeighborsClassifier was fitted without feature names warnings.warn(
Date 2013-02-24 P 2014-08-26 M 2015-06-03 M 2013-11-12 P 2014-02-11 n 2015-05-06 M 2014-06-26 M 2014-03-26 P 2014-01-11 P 2014-11-06 M dtype: object
A confusion matrix is created using the predicted values from the algorithm. A one-versus-all method is used to compute the precision score for each label--this is averaged for each label and reported as the average precision value shown below.
label_symbols = ['N', 'n', 'M', 'p', 'P'] # all possible labels
# compute and display confusion matrix
C = confusion_matrix(y_test, yhat, labels=label_symbols)
ConfusionMatrixDisplay(C, display_labels=label_symbols).plot();
pr_all = precision_score(y_test, yhat, average="macro", zero_division = 0)
print('Average precision (single): ' + str(pr_all) )
Average precision (single): 0.14396147944762983
In the confusion matrix above, it is observed that the algorithm has difficulty predicting collision totals outside of the moderate range. With varying the n for the kNN algorithm, this becomes more apparent. This is likely caused by the large variation in weather data which makes classification hard--a median prediction the best outcome to minimize the error on average so this data sees increased predictions. A good number of predicitions also occurs with the "very negative" label with minimal number of crashes. This could be suggestive of certain weather data closely aligning with wintry weather which has been observed to have much less collisions.
An ROC curve can be generated which is used to describe how well the algorithm predicts the data without having erronous false positives. The AUC score is generated corresponding to this plot as well.
p_hat = knn.predict_proba(X_test) # probabilistic interpretation
# calculate true and false positive rates for a variety of theta values
results = []
for i, label in enumerate(knn.classes_):
actual = (y_test==label)
fp, tp, theta = roc_curve(actual,p_hat[:,i])
results.extend( [(label,fp,tp) for fp,tp in zip(fp,tp)] )
roc = pd.DataFrame( results, columns=["label","FP rate","TP rate"] )
sns.relplot(data=roc,
x="FP rate", y="TP rate",
hue="label", kind="line", estimator=None
);
# compute auc score
auc = roc_auc_score(y_test, p_hat, average = 'macro', multi_class = 'ovr')
print(f"AUC score (single classifier) = {auc:.6f}")
C:\Users\andre\anaconda3\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but KNeighborsClassifier was fitted without feature names warnings.warn(
AUC score (single classifier) = 0.498915
For the single classifier, the algorithm does a poor job in predicting the correct label on average across all labels.
An ensemble is created to remove the bias of any single classification on the final precision score or AUC score. A training set of 80% of the allocated training dataset is used with using 50 different estimators and drawings as replacement. A confusion matrix can be generated as before using the ensemble to predict the classification of each testing data point.
ensemble = BaggingClassifier(
pipe,
max_samples=0.8,
n_estimators=50,
random_state=4912
) # create ensemble
ensemble.fit(X_train, y_train); # train ensemble
yhat_e = ensemble.predict(X_test); # ensemble prediction
C = confusion_matrix(y_test, yhat_e, labels=label_symbols)
ConfusionMatrixDisplay(C, display_labels=label_symbols).plot();
pr_all_e = precision_score(y_test, yhat_e, average="macro", zero_division = 0)
print('Average precision (ensemble): ' + str(pr_all_e) )
Average precision (ensemble): 0.2066064887493459
A slight increase in precision is observed using this ensemble method, but the algorithm as a whole does a very poor job in predicitng the data as shown. The ROC curve and associated AUC score can be generated to compare directly to the results from the single classifier analysis.
phat_e = ensemble.predict_proba(X_test); # ensemble probabilistic prediction
# calculate true and false positive rates for a variety of theta values
results = []
for i, label in enumerate(ensemble.classes_):
actual = (y_test==label)
fp, tp, theta = roc_curve(actual,phat_e[:,i])
results.extend( [(label,fp,tp) for fp,tp in zip(fp,tp)] )
roc = pd.DataFrame( results, columns=["label","FP rate","TP rate"] )
sns.relplot(data=roc,
x="FP rate", y="TP rate",
hue="label", kind="line", estimator=None
);
auc = roc_auc_score(y_test, phat_e, average = 'macro', multi_class = 'ovr')
print(f"AUC score (ensemble classifier) = {auc:.6f}")
AUC score (ensemble classifier) = 0.511558
Again, the ensemble method does improve the results, but the improvement is very slight suggesting that the data is either too complex to analyze or has much too much noise to give meaningful predictions. The distribution of predictions in the confusion matrix shown above suggests the algorithm to be randomly guessing each label which creates the poor results observed.
While the kNN algorithm from before used $n=3$, this hyperparameter can be optimized using cross-validation. For a large range of $n$ for the kNN algorithm, the $n$ with the best balanced accuracy score is determined to be the best hyperparameter to use for the classification algorithm. A total of 6 stratified folds are used as shown below.
kf = StratifiedKFold(n_splits=6, shuffle=True, random_state=15321)
scores = cross_validate(
pipe,
X_train, y_train,
cv=kf,
scoring="balanced_accuracy"
)
print("Validation scores:")
print( scores["test_score"] )
print('')
print("Validation score st. dev: " + str(np.std(scores["test_score"]))) # display standard deviation
Validation scores: [0.23125695 0.22614016 0.21190211 0.19554876 0.24601641 0.22170467] Validation score st. dev: 0.01571246126302048
There is low variance in the validation scores, suggesting the data to be decently representative of the overall data set. However, the low values suggest the algorithm to be largely innaccurate.
With the same folds as before, a new kNN algorithm is trained for certain $n$ values. Each number of neighbors from 1 to 16 are checked and the balanced accuracy score is determined for cross-validation.
neighbors = range(1, 100, 1) # check all neighbors from n=1 to n=12
results = []
for n in neighbors:
knn = KNeighborsClassifier(n_neighbors=n, weights = "distance", p = 2)
cv = cross_validate(knn,
X_train, y_train,
cv=kf,
scoring="balanced_accuracy",
n_jobs=-1
)
for err in (1 - cv["test_score"]):
results.append( (n, err) )
results = pd.DataFrame(results, columns=["neighbors", "error"] )
sns.relplot(data=results,
x="neighbors", y="error",
kind="line"
);
The number of neighbors with the minimum associated error (measured by the balanced accuracy score) is used for the nearest neighbors algorithm. This value can be found by aggregating the data presented above to find the average error based on the number of nearest neighbors.
errors = results.set_index('neighbors') # use neighbors as index
error_agg = errors.groupby("neighbors")["error"].describe()['mean'] # group data by neighbors, find mean of aggregate
print(error_agg.head())
print('')
index_best_n = error_agg.argmin()
print('Optimal n: ' + str(neighbors[index_best_n]) )
print('Associated error: ' + str(error_agg.iloc[index_best_n]) )
neighbors 1 0.783605 2 0.783605 3 0.783928 4 0.781273 5 0.782426 Name: mean, dtype: float64 Optimal n: 65 Associated error: 0.7325304350512916
Therefore, a kNN algorithm with n=65 is the most error-minimizing method of classifying the data. This n value is used to maximize the precision score and AUC score. As the ensemble method was superior to the single classifier in both these metrics, this is used as well. The same test-train data split is used as before.
knn = KNeighborsClassifier(n_neighbors=65, weights = "distance", p = 2); # use kNN algorithm, euclidean distance
pipe = make_pipeline(StandardScaler(), knn); # standardize data
ensemble = BaggingClassifier(
pipe,
max_samples=0.8,
n_estimators=50,
random_state=4912
) # create ensemble
ensemble.fit(X_train, y_train); # train ensemble
yhat_e = ensemble.predict(X_test); # ensemble prediction
C = confusion_matrix(y_test, yhat_e, labels=label_symbols)
ConfusionMatrixDisplay(C, display_labels=label_symbols).plot();
pr_all_e = precision_score(y_test, yhat_e, average="macro", zero_division = 0)
print('Average precision (ensemble): ' + str(pr_all_e) )
Average precision (ensemble): 0.2505324756544268
As before, the AUC score can be computed as an average over the performance in classifying each label.
p_hat_e = ensemble.predict_proba(X_test) # probabilistic characterization, ensemble
auc = roc_auc_score(y_test, p_hat_e, average = 'macro', multi_class = 'ovr')
print(f"AUC score (ensemble classifier) = {auc:.6f}")
AUC score (ensemble classifier) = 0.564489
There is marginal improvement from before for this algorithm. This suggests the previous explanation to be true in that the data is too noisy, uncorrelated, or sparse to get accuracy classifications.
While the number of neighbors was optimized for algorithm performance, the hyperparameters of using a uniform or distance weighting scheme as well as the hyperparameter for the distance model used (if distance weighting is applied) can also be varied. This creates a 3D search space for n, the weighting type, and weighting model which is used to compute the overall best balanced accuracy score.
# total grid search space
grid = { "n_neighbors":range(1,150,1),
"weights":['uniform', 'distance'],
"p": [1,2]}
knn = KNeighborsClassifier(); # kNN algorithm based on grid search
kf = StratifiedKFold(n_splits=4, shuffle=True, random_state=302)
grid_dt = GridSearchCV(
knn, grid,
scoring="balanced_accuracy",
cv=kf,
n_jobs=-1
)
grid_dt.fit(X_train, y_train)
print("Best parameters:")
print(grid_dt.best_params_)
print()
print("Best score:")
print(grid_dt.best_score_)
Best parameters:
{'n_neighbors': 111, 'p': 1, 'weights': 'uniform'}
Best score:
0.28887508727530886
This is thus the best possible learning algorithm for the data based on all hyperparameters. The averaged ensemble precision score and AUC score are computed once again.
# Confusion Matrix,
knn = KNeighborsClassifier(n_neighbors=111, weights = "uniform", p = 1); # use kNN algorithm based on optimal hyperparameters
pipe = make_pipeline(StandardScaler(), knn); # standardize data
ensemble = BaggingClassifier(
pipe,
max_samples=0.8,
n_estimators=50,
random_state=4912
) # create ensemble
ensemble.fit(X_train, y_train); # train ensemble
yhat_e = ensemble.predict(X_test); # ensemble prediction
C = confusion_matrix(y_test, yhat_e, labels=label_symbols)
ConfusionMatrixDisplay(C, display_labels=label_symbols).plot();
pr_all_e = precision_score(y_test, yhat_e, average="macro", zero_division = 0)
print('Ideal average precision (ensemble): ' + str(pr_all_e) )
# AUC score
p_hat_e = ensemble.predict_proba(X_test) # probabilistic characterization, ensemble
auc = roc_auc_score(y_test, p_hat_e, average = 'macro', multi_class = 'ovr')
print(f"Ideal AUC score (ensemble) = {auc:.6f}")
Ideal average precision (ensemble): 0.19542002380148277 Ideal AUC score (ensemble) = 0.572560
While the AUC score increased again from before, the average precision of the ensemble decreased. Both values are still exceptionally low and thus the algorithm does a poor job predicting the classification labels.
The results of this study are largely inconclusive. First, correlations between each weather parameter and the total number of vehicle collisions were computed and it was determined that the temperature has the greatest correlation with predicting vehicle collisions, with a spearman coefficient of 0.239. While this is not a strong correlation, it corresponds to the fact that the number of vehicle collisions was strongly correlated to the time of year, with a cyclical trend observed where collisions peaked in June and saw a minimum in January. Therefore, as the temperature in this region peaks during the month of July and is at a minimum in February, it is likely that the correlation between the temperature and vehicle collisions is rather indicative of a third-variable problem of both being correlated to the time of year.
The correlation between the various weather events and features determined previously to be "hazardous" conditions is minimal-to-none. The likely explanation behind this is the large variability in the data itself as weather is notoriously unpredictable based on the large fluctuations day-to-day. Based on the noise of this data, the low correlation makes sense. Combined with the fairly low dataset size (around 1,000 elements), the noise provided a notable challenge to producing meaningul correlations. Using the current data set, the original hypothesis of poor weather conditions impacting vehicle collision totals is rejected.
Next, a supervised machine learning algorithm was developed to determine whether the weather features as a whole could be used to predict the collision count. The number of collisions was discretized for classification into five categories ranging from the lowest 20th percentile of data to the highest--each category therefore had roughly 20% of the data to give the algorithm ideal performance. The correlations between each of the weather features was determined and, as mentioned before, the complexity and irregularity of these features prevented any features from being excluded from analysis. An ensemble of k-nearest neighbor classification algorithms were created and cross validated using stratified folding. A grid search was created to compute the optimal hyperparameters to maximize the balanced accuracy score of the algorithm. It was found that using a uniform weighting metric with n=116 neighbors was ideal. An ensemble of 50 estimators was created using these parameters and the average precision score across all labels, using a one-versus-all scheme, was found to be 0.195. Previous, non-optimized algorithms could only reach an average precision score of 0.251. The AUC score for this optimal ensemble learner was only 0.573.
The confusion matrix created for this learner suggested that the algorithm is more or less randomly predicting each label equally and occasionally reaching better scoring metrics as shown. This is observed by the algorithm initially only predicting the median classification label, as to minimize average error when blindly predicting, and later having a confusion matrix of nearly uniform values suggesting equal liklihood to select any label. As said before, a much larger dataset is needed to more accurately create and measure performance for such a complex algorithm. The second part of the hypothesis tested by this report is determined to be inconclusive as while the algorithm could not properly predict the labels sufficiently accurately, the small dataset values require more information for definitive conlcusions.
For future research, the following suggestions are made for improvement. First, a large correlation between the time of year and the number of collisions was observed. While this may be a true phenomenom, this does not suggest that roads are inherently more dangerous during colder weather (barring different weather events like snow and ice), but rather that less drivers may be using the roads during wintry months. The number of collisions should therefore be normalized with the estimated number of drivers on the road for that day. For example, using data from New York City tollbooths to estimate the number of cars on the road as a function of daily grossed revenue, the amount of vehicle collisions per 1,000 drivers can be computed and give a more accurate measure of road conditions. Thus, the data aligns more with the scope of the project. The data is largely limited in terms of population size so a larger data set is needed. This was a reult of having drastic differences in data ranges for the collected data, so early vehicle collision reports or more recent weather reports could be used. Increasing the number of samples should reduce the error generated by the inherent noise in the data samples as well.
[1] Zoni Nation. (2016). Weather for 24 US Cities. Weather Undergound. Github. Retrieved April 5, 2023 from https://github.com/zonination/weather-us.
[2] City of New York. (2020, October 11). Motor Vehicle Collisions - Crashes. data.world. Retrieved April 5, 2023, from https://data.world/city-of-ny/h9gi-nx95
[3] Arnfield, A. John (2022, December 5). Köppen climate classification. Encyclopedia Britannica. https://www.britannica.com/science/Koppen-climate-classification
[4] Weatherbase. (n.d.). New York, New York. Weatherbase. Retrieved April 5, 2023, from https://www.weatherbase.com/weather/weather-summary.php3?s=330527&cityname=New%2BYork%2C%2BNew%2BYork%2C%2BUnited%2BStates%2Bof%2BAmerica
[5] Momentive. (2023). Likert scale: Examples and how to use it. SurveyMonkey. Retrieved April 5, 2023, from https://www.surveymonkey.com/mp/likert-scale/