Understanding Traffic Collision Severity’s Contributing Factors: A Mixed Effect Multinomial Logistic Regression and Machine Learning Approaches A Research study submitted to the University Canada West In Partial Fulfillment of the Requirements For the Degree of Master’s in Business Administration University Canada West By Kawal Walia (2112003) 1 Abstract This study aims to understand the influence of various contributing factors on traffic collision severity. With a focus on variables such as pedestrian involvement, cyclist presence, motor vehicle roles, weather conditions, road characteristics, geographical contexts, and among others. The objective of this study is to shed light on the in-depth behavioral dynamics that underlie the severity of accidents. The dataset utilized in this study is retrieved for the Virginia Road department and contains over 500,00 data points with 18 different variables. This study utilized two statistical models and one machine learning model—Multinomial Logistic Regression, Multi-level (Mixed Effect) Multinomial Logistic Regression, which captures the group level heterogeneity, and Random Forest model—to analyze and understand the relationship between various factors and collision severity outcomes. The results show that the Multi-Level Multinomial Logistic Regression model overcomes the Multinomial Logistic Regression model. Moreover, the results show that the existence of vulnerable road users, including pedestrians and bikes would likely increase the odds of fatalities. The odds ratios for fatality and major injuries of collisions involving unbelted drivers are higher than 10, raveling the higher likelihood of sever outcomes compared to belted drivers. Collision occurs are traffic controls (e.g., signalized intersections) are likely to be more severe compared to collision occurs at regular road. These results were in alignment with what were reveled from the Random Forest model. Overall, these findings can help policymakers to design strategies that can reduce severity outcomes in different regions. 2 Preface The thesis chapters are submitted for publication in a reputable journal as follows: Portions of the introductory text in Chapter 1, portions of the literature review in chapter 2,the proportion of the methodology in chapter 3, and the statistical results in chapter 4 are under review in a reputable journal: Walia, Kawal, & Alsaleh, R. (2023). Exploring the Contributing Factors of Traffic Collision Severity: A Multilevel Multinomial Logistic Regression Model. Submitted. 3 Contents Abstract ......................................................................................................................................................... 2 Preface .......................................................................................................................................................... 3 1. Introduction .............................................................................................................................................. 6 1.1 Background ......................................................................................................................................... 6 1.2 Problem Statement ............................................................................................................................. 8 1.3 Objective ............................................................................................................................................. 9 1.4 Significance ....................................................................................................................................... 10 2. Literature Review .................................................................................................................................... 12 3. Methodology ........................................................................................................................................... 26 3.1 Introduction ...................................................................................................................................... 26 3.2 Data Description ............................................................................................................................... 26 3.2.1 Data Processing Steps ................................................................................................................ 27 3.2 Variables............................................................................................................................................ 28 3.3 Multinomial Logistic Regression Model ............................................................................................ 30 3.3.1 Description ................................................................................................................................. 30 3.3.2 Equation ..................................................................................................................................... 31 3.3.3 Model Development .................................................................................................................. 32 3.4 Multi-level Multinomial Logistic Regression ..................................................................................... 33 3.4.1 Description ................................................................................................................................. 33 3.4.2 Equation ..................................................................................................................................... 34 3.4.3 Model Development .................................................................................................................. 35 3.4.3 Model Evaluation ....................................................................................................................... 35 3.5 Random Forrest Model ..................................................................................................................... 36 3.5.1 Description ................................................................................................................................. 36 3.5.2 Equation ..................................................................................................................................... 36 3.5.3 Model Development: ................................................................................................................. 37 3.5.4 Model Evaluation: ...................................................................................................................... 37 4. Statistical Modeling Results .................................................................................................................... 39 4 4.1 Multinomial Logistic Regressions...................................................................................................... 39 4.2 Multi-Level Multinomial Logistical Regression ................................................................................. 44 4.3 Models Comparison .......................................................................................................................... 51 5. Machine Learning Modeling Results ....................................................................................................... 53 5.1 Random Forest .................................................................................................................................. 53 5.1.1 Simulating the RF Model ............................................................................................................ 56 6. Conclusion ............................................................................................................................................... 68 6.1 Summary of Findings......................................................................................................................... 68 6.2 Limitation and Future Work .............................................................................................................. 69 References .................................................................................................................................................. 73 5 1. Introduction 1.1 Background Car crashes bring about immense human suffering and economic burdens on a global scale. In Canada, traffic collisions are a major contributor to avoidable fatalities, injuries, and about $37 billion in economic losses each year (Transport Canada, 2018). Crafting precise plans to decrease the occurrence, severity, and financial impact of these accidents is a top concern for decisionmakers and invested parties. A deeper comprehension of how human actions contribute to these collisions is pivotal in achieving this objective. Although traffic incidents disproportionately affect lower-income and middle-income nations, causing roughly 1.35 million deaths and up to 50 million injuries each year (WHO, 2021), they are also a notable challenge in Canada. In Canada, over 1,800 people face fatal outcomes, and about 160,000 endure injuries yearly due to trafficrelated incidents (Transport Canada, 2018). The monetary expenses associated with these accidents in Canada are estimated at around $37 billion annually, a figure that considerably surpasses the human toll. However, delving into specifics, Transport Canada's National Collision Database records 156,310 reported vehicle collisions in 2019. In the corpus of incidents, a total of 1,181 culminated in fatalities, 9,558 in instances of grave injuries, and 137,571 in cases of minor injuries, as reported by Transport Canada in the year 2021. A multitude of inquiries have pinpointed distinct conduct exhibited by drivers that contribute to the probability and intensity of vehicular mishaps. For example, Pang et al. (2019) determined that engaged driving, encompassing activities like operating a mobile device while steering, emerges as a substantial harbinger of road accidents within the confines of Canada, especially prevalent among the demographic of youthful operators. Similarly, Shrestha et al. (2017) established that inebriated 6 driving constitutes a chief instigator of road-related mortalities across Canada, constituting 34% of such fatalities and engendering an annual expenditure of $20.62 billion. Additionally, in consonance with data provided by the Canadian Institute of Actuaries, the fiscal ramifications of motor vehicle collisions amassed to $29 billion in 2018, signifying 1.5% of Canada's gross domestic product (GDP) (Canadian Institute of Actuaries, 2019). This encompassed a gamut of costs, ranging from medical outlays and property destruction to forfeited output and legal charges. Chauffeur demeanor stands as a solitary component within a nexus of variables that contribute to traffic mishaps, encompassing factors such as road configuration and characteristics of automobiles. Hamer et al. (2021) embarked on an inquiry into the interrelation between Canadian road structure and the propensity for traffic accidents, revealing that road layout constituents like restricted speed thresholds and the incorporation of circular intersections are conducive to mitigating the frequency of vehicular collisions. Correspondingly, Zhu et al. (2018) scrutinized the way diverse vehicular attributes impinge on the gravity of traffic accidents, ascertaining that aged and diminutive vehicles bear an augmented probability of culminating in severe collisions. Taken together, these investigations accentuate the pivotal function ascribed to driver conduct in dictating the recurrence and severity of road mishaps, alongside the accompanying economic toll, both worldwide and in the context of Canada. Through the cultivation of an augmented comprehension of the precise proclivities that augment the chance and magnitude of road accidents, policymakers and stakeholders are poised to formulate targeted stratagems aimed at abating their incidence and repercussions. Such endeavors, in turn, stand to fashion a 7 transportation framework characterized by enhanced safety and efficiency, redounding to an ameliorated standard of living for the populace of Canada. 1.2 Problem Statement In terms of road safety and accident prediction, the present research focuses on to work into the intricate interplay of contributing factors and their impact on collision severity. With a detailed and in-depth exploration of all relevant variables including the involvement of pedestrians, cyclists, and motor vehicles, alongside meteorological conditions, road attributes, and area characteristics, this study aims to find the behavioral aspects that contribute to varying levels of accident severity. Central to the research are three distinct predictive models employed to ascertain the severity of collisions: Multinomial Logistic Regression, Multi-Linear Multinomial Logistic Regression, and Random Forest. These models are built to uncover the underlying dynamics governing collision outcomes. The uniqueness of each model lies in their ability to consider the complexity of multiple variables simultaneously and derive comprehensive insights. The focal point of this study is to assess the effectiveness of these models in accurately predicting collision severity, surpassing the current conventional methodologies. By simulating different variables within the Random Forest model, the illumination of the inadequacies of existing models can be noticed, demonstrating that traditional approaches may not capture the nuances of collision severity with precision. The intended outcome of this research is to present an evidence-based framework that empowers road safety agencies, policymakers, and traffic management authorities to adopt proactive measures that align with behavioral patterns. In turn, this can lead to targeted interventions that mitigate accident severity, ultimately fostering a safer and more resilient road 8 environment. The findings of this study not only enrich the understanding of collision dynamics but also pave the way for transformative advancements in road safety strategies. 1.3 Objective This research study aims to unravel the intricate interplay between human behavior and traffic collision severities. The investigation aims to dissect the multifaceted components inherent in the behavioral dimensions of vehicular collisions, encompassing facets like diverted driving, roadway description, alcohol consumption, excessive speed, and other audacious conduct. Furthermore, the research seeks to disentangle the fiscal consequences associated with road accidents, spanning medical outlays, asset impairment, and compromised productivity. The core idea of this research study is to explicate the sway of behavioral proclivities on the gravity of road accidents. The inquiry aspires to probe diverse constituents that constitute the behavioral facet of vehicular accidents, encapsulating distracted driving, driving under the influence, speeding, and allied recklessness. To achieve these aims, this research will utilize statistical and machine learning models that prognosticate the severity of road collisions with different categories grounded in behavioral constituents. These models will scrutinize the efficacy of an array of determinants germane to vehicular mishaps, including driver demeanor, road layout, and vehicle type, in foretelling the magnitude of accidents. The inquiry will lean on trustworthy data sources like law enforcement reports, medical archives, and insurance claims to uphold the precision and dependability of the models. 9 Preceding research studies have underscored the salience of driver deportment in road accidents, along with their concomitant economic ramifications. For instance, Pang et al. (2019) expounded upon the pivotal role of diverted driving in prognosticating traffic collisions in Canada, while Shrestha et al. (2017) posited that 34% of vehicular fatalities in Canada could be ascribed to drunk driving. The implications of this research study extend to policymakers, stakeholders, and the general populace, facilitating the formulation of strategies that efficaciously mitigate the recurrence, severity, and economic import of road accidents. By identifying the precise behavioral constituents that contribute to road mishaps, this inquiry can provide insights for precisely calibrated interventions, thereby enhancing road safety and abating the fiscal burden stemming from such incidents. Furthermore, the machine learning models engendered by this study possess the potential to presage the magnitude of road collisions, bolstering the efficiency and efficacy of emergency retorts to such occurrences. 1.4 Significance The investigation of the role that human behavior plays in determining the severities of traffic collisions constitutes a pivotal realm of research. This domain bears direct implications for both public safety and the economic welfare of a society. The present study delves into the underlying factors contributing to traffic collisions and provides valuable insights into the contributing factors of their severities. These insights, in turn, hold promise for the formulation of efficacious strategies geared towards curtailing the frequency, intensity, and economic aftermath of such collisions. 10 The crux of this study's importance resides in its innovative harnessing of machine learning models to prognosticate the severity of traffic accidents hinging upon behavioral dynamics. By adopting novel approaches, a prospective avenue opens for emergency services to craft swifter and more effective responses to these incidents. This, in effect, carries the potential to attenuate the loss of human lives and the subsequent economic strain that typically accompanies traffic mishaps. Furthermore, the study's reliance on trustworthy data sources – inclusive of police reports, hospital records, and insurance claims – serves as a safeguard, ensuring the accuracy and dependability of the predictive models generated. The outcomes gleaned from this study can thus serve as the bedrock for targeted interventions aimed at augmenting road safety, mitigating the economic encumbrance tied to such incidents, and formulating more efficacious policies designed to avert traffic accidents. 11 2. Literature Review This chapter summarizes the work that were done on identifying traffic collision severities contributing factors. Several previous studies investigated the relation between driving behavior on one side and traffic collision severities and consequence economic impact on another side. For example, Gamage et al. (2021) found a significant correlation between the conduct of drivers and the intensity of vehicular accidents. The repercussions extend beyond mere collisions – the aftermath encompasses escalated instances of accidents, physical injuries, and fatalities, all culminating in a broader economic impact. The ramifications of these incidents manifest as two-fold: direct monetary expenses encompass medical bills, property refurbishments, and legal charges. On a more nuanced level, the indirect costs encompass disruptions to work schedules, compromised quality of life, and elevated insurance premiums (Tlaiss & Baaj, 2020). The research conducted by Robartes and Chen (2017) investigates the crucial factors influencing the severity of injuries sustained by cyclists in automobile-bicycle crashes using data from Virginia police crash reports spanning the period from 2010 to 2014. Employing an ordered probit model, the study analyzes various crash characteristics, encompassing those related to bicyclists, automobile drivers, vehicles, environmental conditions, and roadways. Their findings reveal significant determinants of injury severity, with intoxicated automobile drivers demonstrating a six-fold increase in the likelihood of cyclist fatalities and double the risk of severe injuries. Bicyclist intoxication also raises the probability of fatalities by 36.7% and doubles the likelihood of severe injury. Moreover, factors such as vehicle speed, obscured driver vision, specific vehicle types (SUVs, trucks, and vans), roadway grades, and curves were identified as contributors to more severe cyclist injuries. The research underscores the need for measures to 12 combat biking and driving under the influence, emphasizing the importance of analyzing and enhancing existing legislation, educating on the perils of drunk driving for cyclists, and promoting the separation of bicycles and vehicles on the road. (Robartes & Chen, 2017). The in-depth investigation conducted by Agyemang, Li, and Wu (2019) spotlights the critical need to factor in behavioral constituents when devising strategies to mitigate the socioeconomic toll stemming from road accidents. Their research accentuates the imperative of implementing effective road safety measures that account for the multifaceted role of human behavior within the context of traffic collisions. Rahimi et al. (2019) embarked on a study aimed at probing into the severity of injuries resulting from solo-truck crashes in a developing nation. The study's primary objective was to pinpoint the underlying factors that play a role in exacerbating injury severity within such incidents. By harnessing a binary logistic regression model, the study dissected data derived from crash reports detailing 1,690 solo-truck accidents that transpired in Iran between 2011 and 2016. The outcomes disclosed noteworthy influences of injury severity encompassing the age of the driver, speed restrictions, truck arrangement, road type, lighting conditions, and weather scenarios. The authors recommended that an amelioration of road and vehicle conditions, alongside more robust driver training initiatives, might serve to diminish the gravity of injuries in solitary-truck collisions within developing nations. The menace of road traffic mishaps assumes a substantial role in public health concerns worldwide, carrying both human and financial tolls. In response, scholars have devoted their efforts to investigate the principal risk factors that accentuate traffic accidents and the ensuing injury severity. Rovšek, Batista, and Bogunović (2018) did research to ascertain the primary risk 13 factors governing the severity of injuries arising from traffic accidents on Slovenian roads, using a classification tree method devoid of the constraints of parameter assumptions. Data garnered from traffic accident reports spanning the years 2013 to 2015 in Slovenia were brought under the analytical spotlight. The classification tree analysis, eschewing the need for rigid parameters, divulged those pivotal factors in the intensity of traffic accident injuries included the age of the driver, road type, vehicle category, and nature of the collision. The findings of this study emphasized the necessity of strategies enhancing road infrastructure, bolstering vehicular safety attributes, advocating for prudent driving practices, and elevating driver training programs as the prime avenues for nurturing road safety. The study conducted by Lestina et al. (1991) addresses a critical issue concerning the understanding of behavior in collision severity, specifically focusing on crashes at freeway entrance and exit ramp interchanges. The research identified and analyzed the most common crash types in heavily traveled urban interstate ramps in Northern Virginia, distinguishing between drivers entering and exiting the freeway. They found that run-off-road, rear-end, and sideswipe/cutoff crashes constituted the majority (95%) of incidents. Notably, run-off-road crashes were most associated with exiting, while rear-end and sideswipe/cutoff crashes were prevalent among entering drivers. Factors such as speed, weather conditions, and alcohol were identified as significant contributors to crash severity. In a distinct by Zhang et al. (2020), the aim was to unravel the underlying forces shaping traffic breaches and calamity gravity in the realm of China. To accomplish this feat, a stockpile of data culled from the official archives of 4,045 mishaps that unfolded in five of China's urban enclaves, betwixt the years 2016 and 2017, was meticulously scrutinized. The upshot of this 14 endeavor unveiled an assemblage of influential facets predisposing to traffic transgressions and the severity of vehicular mishaps. This array encompassed the chauffeur's vintage, gender, scholastic attainment, motoring history, vehicular genre, road classification, atmospheric state, temporal juncture, and the motorist's comportment, spanning from hastened traversal and temerarious motoring to navigating while under the sway of inebriation. On this account, the scholars proffered the notion that augmentation of driver instruction regimens, the tenacious enforcement of traffic statutes, and an amelioration of vehicular safety attributes could collectively serve to attenuate the frequency of traffic contraventions and the gravity of accidents within the Chinese milieu. This inquiry bequeaths perspicacity that might prove instrumental in the formulation of efficacious stratagems aimed at the advancement of road safety in the Chinese domain. In Vietnam, Nguyen et al. (2013) investigated the fiscal encumbrance wrought by road traffic traumas upon a provincial general infirmary. In the records of this exploration, information picked from the hospital's ledgers concerning patients felled by road traffic casualties in the span between 2009 and 2010 was diligently analyzed. The expedition yielded revelations of a sizable pecuniary load thrust upon Vietnam due to road traffic injuries, with the monetary outlay averaging a lofty US$1,001 for each sufferer. The upshot of this expedition led to the determination that road traffic mishaps exert a considerable fiscal burden upon both individuals and the collective, and thus, a bevy of preemptive measures, spanning from instilling road safety erudition and the rigorous enforcement of traffic decrees to the amplification of infrastructure standards, should be marshaled forthwith to alleviate the fiscal onus spawned by road traffic casualties. 15 In another study, Jiang et al. (2020) delved into the ramifications of abnormal weather on fatal road accidents. This inquiry hinged on data extracted from the Chinese Statistical Yearbook and the China Meteorological Data Sharing Service System over the stretch of 2006-2015. The study laid bare a remarkable discovery: extreme weather events, like scorching temperatures and heavy rainfall, displayed a strong connection with escalated chances of lethal road mishaps in China. The findings underlined the potential of adapting to climate shifts by bolstering road infrastructure and transportation networks, countering the adverse effects of freak weather on road safety. Shifting to a separate perspective, Noland and Quddus (2004) did a research study on a finely detailed analysis of road casualties across England, aiming to uncover the traits that typified regions with elevated road mishap rates. Their lens zoomed in on the timeframe between 1999 and 2001, scrutinizing data at the level of Lower Super Output Areas (LSOAs), diminutive geographical divisions. Employing Poisson regression models, the researchers probed the nexus between road casualties and a slew of factors, including road attributes, socioeconomic variables, and vehicle possession. The study yielded a conspicuous outcome: metropolitan pockets grappling with higher poverty levels and lower car ownership, coupled with roadways featuring raised speed limits, bore the brunt of the worst casualty rates. The team's counsel was crystal clear—direct road safety efforts toward these high-risk locales to curtail road mishaps. In essence, the study provides a treasure trove of insights regarding the spatial arrangement of road casualties and the contributory elements at play in England. Another study was done with the primary objective of comprehensively exploring the existing conditions and safety challenges faced by nonmotorized vehicles, as well as the role of 16 heavy vehicle drivers in road safety and multi-vehicle collisions within the context of Bangladesh. The initial work by Ahsan and Sufian (2019) utilized secondary data sources encompassing literature reviews and road accident statistics to uncover safety issues linked to nonmotorized vehicles in Bangladesh. The outcomes of this inquiry highlighted that bicycles, rickshaws, and pedestrians are more prone to accidents due to inadequate road infrastructure, insufficient safety precautions, and a lack of adherence to safe driving practices. Similarly, another research study conducted by Anjuman et al. (2021) within Bangladesh employed a mixed-methods approach to investigate the impact of heavy vehicle drivers on road safety and accidents involving multiple vehicles. This involved surveys, and discussions with drivers, passengers, and road safety experts to amass pertinent insights. The findings underscored that the behaviors of heavy vehicle drivers, such as excessive speed, fatigue, and inadequate training, play a pivotal role in causing road accidents in Bangladesh. The study put forth recommendations including stringent regulations, effective driver training initiatives, and awareness campaigns aimed at enhancing road safety awareness. This literature review elucidates the critical safety challenges confronted by both nonmotorized vehicle operators and heavy vehicle drivers in Bangladesh, providing valuable insights into the country's road safety landscape. Bahrololoom, Young, and Logan (2019) employed a random parameter model to delve into the factors influencing fatalities and severe injuries in bicycle accidents across Victoria, Australia. Drawing from data within the Victorian Traffic Accident System spanning 2012 to 2016, the study identified parameters such as the cyclist's age and gender, time of day, vehicle count, geographic area, and vehicle type as significant predictors of accident severity. The study underscores the potential of enhancing infrastructure, road design, 17 educational initiatives, and adherence to traffic regulations to mitigate the severity of bicycle accidents within Victoria. In an interesting study undertaken by Aziz et al. (2018), the primary aim was to employ a mixed logit model in order to ascertain the factors influencing the severity of pedestrian-vehicle collisions within the environs of New York City. To accomplish this, the authors drew upon data gleaned from the Motor Vehicle Collision database maintained by the New York City Police Department for the years spanning 2010 through 2014. The study findings demonstrated that variables such as age, gender, vehicle type, geographical location, and time of day held significant predictive power in determining the extent of harm arising from pedestrian accidents. Notably, the study contends that by implementing alterations to policies concerning road layout, enhancing law enforcement measures, and offering more effective educational initiatives, the severity of pedestrian-vehicle collisions in New York City could be mitigated. Simultaneously, Eluru et al. (2013) undertook a great research study to construct a mixed generalized ordered response model, aimed at comprehending the degrees of harm sustained by pedestrians and cyclists in the context of traffic accidents. The research group derived insights from the South East Queensland Travel Survey spanning 2003 to 2007, along with the Queensland Road Crash Database. Their inquiry revealed the influential role played by variables such as age, gender, vehicle type, travel mode, and geographic location in determining the gravity of accidents involving pedestrians and cyclists. The study advocates for policy adjustments with a focus on augmenting infrastructural elements, refining road design, and bolstering educational and awareness campaigns. These measures are proposed to contribute to the reduction of severity in pedestrian and bicycle accidents within Queensland. 18 Further, Ehsani et al. (2015) conducted a quasi-experimental study to examine the impact of Michigan's prohibition on texting while driving on the incidence of vehicular accidents. The research aimed to gauge the efficacy of this regulation in curbing the frequency of accidents and injuries resulting from driver distraction. For this purpose, data harnessed from the Michigan State Police Crash Reporting System, encompassing crash records from January 1, 2002, to December 31, 2010, were employed. Employing a difference-in-difference analysis, the study contrasted accident occurrences before and after the enforcement of the texting ban. The outcomes showcased a marked reduction in accidents and injuries attributed to distracted driving following the enactment of the ban, particularly among younger drivers. This study offers insights into potential legislative modifications that could foster a decline in distracted driving incidents. Alghnam et al. (2018) conducted a case-control investigation in Saudi Arabia to explore the association between cell phone usage while driving and severe injuries resulting from accidents. The study's objective was to identify contributing factors to serious traffic injuries, with a specific focus on the role of cell phone use in such incidents. Data were drawn from a hospital trauma register, with cases representing individuals with major traffic-related injuries and controls encompassing those with minor injuries. Structured interviews were employed to gather data, while logistic regression analysis was utilized to ascertain risk factors. The findings indicated a correlation between talking on a cell phone while driving and an elevated likelihood of experiencing severe car accidents. Rogeberg & Elvik (2016) employed meta-analysis to investigate the impact of cannabis intoxication on car accidents and subsequent adjustments in perception. The study aimed to ascertain the effects of cannabis use on the occurrence of car accidents and to identify variables 19 influencing this relationship. Prior research on the linkage between marijuana use and vehicular accidents was examined through a meta-analytical approach involving nine studies. The results revealed a heightened probability of car accidents associated with cannabis use, albeit with a smaller effect size than previously believed. The study also determined that the strength of the link between cannabis use and car accidents was influenced by factors such as dosage, frequency, and method of consumption. Paleti et al. (2010) conducted research with the objective of investigating the influence of aggressive driving behavior on the severity of injuries sustained by drivers in car accidents. Utilizing a multinomial logit model, the study analyzed data from the National Automotive Sampling System Crashworthiness Data System (NASS-CDS) spanning 2004 to 2006. Despite controlling for socioeconomic and crash-related variables, the study established that drivers displaying aggressive behavior were more prone to experiencing severe injuries in collision scenarios. An insightful study by Wang et al. (2013) aimed to comprehend the impact of traffic dynamics and road characteristics on road safety, subsequently guiding future research directions. Employing a systematic approach, the authors reviewed studies published between 1990 and 2010. The study highlighted the substantial influence of factors such as road configuration, speed limits, and traffic volume on road safety. The authors recommended future investigations center on integrated models that elucidate the intricate interplay between these various factors. Zhang et al. (2014) delved into the determinants of culpability and injury severity in pedestrian-car accidents within the context of China. Drawing from the Shanghai Traffic Police Department's database, the study employed a binary logit model for analysis. The study unearthed 20 that the pedestrian's age, gender, and crossing location, alongside the driver's age and vehicle type, were pivotal factors in attributing fault and gauging injury severity in such accidents. Musa et al. (2017) conducted an interesting study into the influence of Malaysia's governmental road conditions on the severity of accidents. Utilizing a binary logit model, the study examined data from the MIROS database, maintained by the Malaysian Institute of Road Safety Research. The investigation underscored the substantial impact of factors such as road geometry, lighting, and road surface conditions on the severity of accidents occurring on federal roads in Malaysia. Safaei et al. (2021) devised a novel approach employing fuzzy TOPSIS and AHP to prioritize strategies for mitigating motorcycle-related injuries in Iran. They gathered expert insights through a questionnaire and subsequently employed fuzzy TOPSIS and AHP methodologies for analysis. The outcome encompassed a ranked catalog of pivotal criteria and strategies, delineating their significance in curbing motorcycle-linked injuries. Safaei et al. (2020) aimed to assess the causes and hazards precipitating motorbike accidents in Iran, with a focus on enhancing public safety and well-being. Employing the Delphi method and analytic hierarchy process (AHP), the researchers sequenced selected strategies. Noteworthy findings endorsed enhancing road infrastructure, bolstering awareness and education, and enforcing traffic regulations as optimal measures to reduce motorbike accidents in Iran. Jacobsen (2013) probed the interconnection between fuel efficiency and safety, delineating its dynamics contingent upon vehicle type and driver behavior. Analysis of data from the Fatality Analysis Reporting System (FARS) and the National Household Travel Survey (NHTS) facilitated 21 the author's estimation of the influence of fuel efficiency on accident rates. Incongruent with conventional wisdom, improved fuel economy did not uniformly heighten accident probability, with the relationship between mileage and safety contingent upon vehicle type and driver conduct. Zhou et al. (2020) did a study aimed at comparing the severity of public bus accidents arising from collisions versus non-collision incidents. Leveraging data covering bus crashes and related events in Hong Kong spanning 2013 to 2018, the authors conducted analyses employing descriptive statistics and logistic regression. Outcomes indicated that collision-based accidents exhibited graver consequences than their non-collision counterparts. Irrespective of accident type, head, and neck injuries prevailed, advocating the adoption of measures such as seat belts and safety glasses to mitigate the severity of public bus accidents. Zahabi et al. (2010) sought to gauge the influence of speed limits, built environment, and other factors on the severity of injuries sustained by pedestrians and cyclists in crashes. Drawing from crash data in Montreal, Canada, reported to law enforcement from 2003 to 2013, the researchers employed logistic regression models to explore determinants of injury severity. Results underscored the role of streetlights and pavements in mitigating injuries, while higher vehicular speeds and larger vehicles correlated with more severe outcomes. The study imparts insights into enhancing street safety and decelerating vehicle speeds to ameliorate injury severity among pedestrians and cyclists in accidents. Wali et al. (2020) scrutinized the relation between fault attribution and injury severity in head-on collisions. Harnessing data from the National Automotive Sampling System Crashworthiness Data System (NASS-CDS) concerning head-on collisions within the United States between 1994 and 2014, the authors employed bivariate ordinal models with copulas to 22 explore the relationship between fault allocation and injury extent. Notably, driving behaviors, encompassing excessive speed and failure to use seat belts, correlated with heightened injury severity. Consequently, curtailing fault allocation emerges as a viable avenue for reducing injury severity in head-on collisions. Quddus et al. (2015) conducted a study investigating the impact of drivers' geodemographic characteristics on their vulnerability to injury in car accidents. The research employed data from the STATS19 database, encompassing crashes occurring in Great Britain from 2005 to 2009. The scholars employed multilevel mixed effects ordered logit models to explore the influence of various factors on injury severity. The findings demonstrated that older drivers and females exhibited reduced likelihood of sustaining severe injuries. Conversely, drivers in urban areas and those traveling at higher speeds displayed an increased probability of experiencing serious harm. The researchers advocate for strategies such as enhancing road safety measures and reducing vehicle speeds to mitigate injury severity in car accidents. In a separate study, Osman et al. (2019) examined the extent of injuries sustained by individuals involved in accidents with large trucks within construction zones. The investigation utilized data from the Fatality Analysis Reporting System (FARS) provided by the National Highway Traffic Safety Administration, covering crashes transpiring in the United States from 2007 to 2016. Employing descriptive statistics and ordered logistic regression models, the authors scrutinized factors influencing injury severity. The study disclosed that speed, driver distractions, and the type of truck wielded substantial influence on the seriousness of injuries sustained. The researchers propose interventions like reducing speed limits, bolstering safety measures in 23 construction zones, and enhancing driver awareness to mitigate injury severity resulting from accidents involving large trucks. Pour et al. (2017) undertook a study with the aim of uncovering the association between neighborhood attributes and the severity of car-pedestrian accidents. The investigation drew upon data from Melbourne, Australia, and employed logistic regression models to analyze how land usage, socioeconomic status, and road characteristics impacted crash severity. The outcomes indicated a heightened likelihood of severe car-pedestrian accidents in areas with increased commercial land use and diminished residential land use. This study underscores the significance of considering the distinctive features of neighborhoods when devising road safety strategies. Hammad et al. (2019) conducted a study aiming to uncover the impact of environmental factors on car crash occurrences within a suburban region of Pakistan. Drawing upon police records and employing statistical analysis, the researchers explored how variables such as road condition, traffic volume, and weather conditions influence accident rates. The findings revealed a higher frequency of accidents during rainy conditions, on narrow roads, and during periods of heavy traffic. The study underscores the importance of integrating natural elements into road safety planning (Hammad et al., 2019). Azimian et al. (2020) did a study on an investigation centered around the correlation between area-level attributes and the frequency and severity of automobile accidents. Using crash data from Houston, Texas, the researchers employed a multivariate space-time model to ascertain the impact of factors like land use, socioeconomic status, and road attributes on crash incidents. The study demonstrated a heightened likelihood of severe crashes in regions dominated by business land use and reduced residential land use. The implications suggest that altering land 24 usage and refining transportation planning could contribute to a reduction in both the number and severity of traffic accidents. Pasha et al. (2016) did a study into the interplay between street layout, traffic flow, road infrastructure, socioeconomic elements, and demographic factors influencing public transportation utilization. Based on data from Dhaka, Bangladesh, the researchers employed regression models to dissect the effects of diverse variables on traffic patterns. The outcomes spotlight the substantial influence of factors such as population density, road width, and traffic volume on public transport utilization. The study underscores the necessity of comprehensive considerations in public transport planning, offering valuable insights to policymakers in developing nations. Sapkota, Bista, and Adhikari (2021) aimed to quantify the economic repercussions of motorbike accidents in Kathmandu, Nepal. Utilizing data from the Government of Nepal's Metropolitan Traffic Police Division spanning from 2013 to 2017, the experts assessed the financial impact of motorbike crashes, encompassing direct costs like medical expenses and property damage, as well as indirect costs like lost productivity. The study unveiled a staggering total cost of NPR 8.25 billion (equivalent to USD 72.16 million) to the economy during the study period. The findings underline the urgency for policy adjustments to enhance road safety and mitigate the economic burden of motorbike accidents in Nepal. 25 3. Methodology 3.1 Introduction As part of our study, statistical and machine-learning models will be developed to foresee Collision Severity and determine their contributing factors. In this section, we'll delve into statistical models: the multinomial logistic regression model and the multi-level multinomial logistic regression model. The multinomial logistic regression model will help us scrutinize crash severities by considering various behavioral factors and forecasting the probabilities of diverse severity groups. The multi-level multinomial logistic regression model will boost our analysis by considering the data's hierarchical setup and incorporating predictors at both individual and group levels. With these approaches, our objective is to uncover pivotal elements influencing crash severity and their ramifications. 3.2 Data Description This chapter will investigate the data that was used in research study focused on predicting collision severity. The data was obtained from the Virginia Road Department, where it's meticulously collected based on reported collisions. Notably, the data is accessible to the public and promotes transparency. The dataset used represents a comprehensive compilation of collisions transpiring between 2019 and 2023. These collisions were promptly reported to the Virginia Road Department. The dataset consisting of approximately 500,000 collision incidents. The dataset contains both categorical and numerical attributes. These attributes furnish us with diverse insights that are crucial for our prediction model. 26 3.2.1 Data Processing Steps To ensure the reliability and accuracy of our model, a study undertook a series of meticulous data processing steps. These steps encompassed cleaning, transformation, and feature extraction. The critical phase of data cleaning is started first. This cleansing process bolstered the quality of our data, ensuring that our model's foundation is sturdy and dependable. Given the mix of categorical and numerical attributes, the power of encoding techniques was utilized to transform categorical variables into numerical representations. This pivotal step enabled our model to understand and learn from the data more effectively. By translating categorical attributes into numerical factors, the groundwork for comprehensive analysis was laid. Delving deeper, feature extraction was engaged in. This intricate process fortified our model's potential for generalization, noise reduction, and overall stability. It enabled our model to focus on the most impactful attributes while diminishing the influence of extraneous factors. This, in turn, heightened the precision of our predictions. As the aim of this research is predicting collision severity, each of these data-related strides was pivotal. The rich data, coupled with meticulous processing, fortified the foundation of our research study. By refining the dataset to recent years, it was ensured that the model remains attuned to contemporary patterns. The transformation of categorical attributes into numerical forms heightened our model's perceptiveness, and feature extraction optimized its predictive capabilities. 27 3.2 Variables To achieve the objective of forecasting collision severity accurately, the selection of predictor variables plays a pivotal role. This section delves into the process of picking the right variables to empower our prediction model with the ability to discern and anticipate the severity of collisions. When this research project was initiated, a set of potential predictor variables was assembled by drawing on both domain knowledge and existing research. These variables, it was surmised, might wield the influence to convert the complex and big tapestry of factors that contribute to collision severity. The culmination of this phase brought forth a promising array of attributes ready to be scrutinized for their predictive prowess. A vital step on this road to precision was conducting an exploratory data analysis (EDA). This analytical journey unfurled insights into the relationships between variables and their role in shaping collision severity. Imagine it peering through a magnifying glass at the complex web of data. Our focus was on understanding how each variable interacts with the target variable – collision severity. Visualizing the correlation matrix emerged as an invaluable tool during this phase. This matrix unveiled a visual relation of interconnectedness, allowing us to perceive which variables danced in harmony and which ones stood out in stark contrast. These visualizations provided a tangible glimpse into the complex relationships that underlie collision severity, and the most influential or important factors are mentioned in Table 3.1. Table Variables in our dataset to do the modeling. 3.1 28 Variables Description VDOT_DISTRICT Accidents for districts in Virginia WEATHER CONDITION Collision in Rainy condition 16% (Adverse) ROADWAY ALLIGNMENT Collision at different alignment like 14.71% (Curve) straight or intersection 85% (Straight) Collision based on road type 3.19% ROADWAY DESCRIPTION (One-Way) 57% (2-way divided) 39.73% (2-way undivided) COLLISION TYPE In what way did the collision 26.22 % (Angle) happened 21.7 % (Fixed) 2.3% (Head On) 1.6% (No Collision) 27.51 % (Rear- End) 10.04 % (Side swipe) ALCOHOL Driver under influence of alcohol at 5.8% the time of collision UNBELTED At the time of Collision if the seat 4.6% belts were on or not BIKE Collision with Bike or Not 0.48% TRAFFIC SIGNAL If the traffic control signal were 80.4 % present in the area AREA TYPE (Urban) DISTRACTED If the collision happened in Urban or 74.27% (Urban) Rural area 25.77% (Rural) Any distractions in external 17.43% environment for driver 29 DROWSY Sleepy 2.7% CRASH DATE Collison happened on weekday or 73.61% (Weekday) weekend 26.38 % (Weekend) DRUG Driver under influence of 1.01% drugs/medicine at the time of collision MOTOR Were any Car or motor vehicle 1.71% involved in the collision PED Any Pedestrian involved 1.18% SPEED Driver speeding or not at the time of 20.77% collision ANIMAL Collision involving Animals 6.4% 3.3 Multinomial Logistic Regression Model 3.3.1 Description The multinomial logistic regression model is a method used to analyze and predict results with more than two categories. In the realm of traffic accidents, this model lets us comprehend the human behavioral, environmental, and other elements that influence crash severity and their economic repercussions. This methodology will lay out the steps to apply the multinomial logistic regression model in our research, which focuses on comprehending the impact of behavioral facets on traffic collision seriousness and its economic implications. 30 When this research project was initiated, a set of potential predictor variables was assembled by drawing on both domain knowledge and existing research. These variables were surmised to potentially wield influence in converting the complex and substantial tapestry of factors contributing to collision severity. A promising array of attributes ready to be scrutinized for their predictive prowess was brought forth by the culmination of this phase. A vital step on this road to precision was taken as an exploratory data analysis (EDA) was conducted. Insights into the relationships between variables and their role in shaping collision severity were unfurled through this analytical journey. The complex web of data was examined as if through a magnifying glass. Emphasis was placed on understanding how each variable interacts with the target variable – collision severityThis analysis will provide insights into the key behavioral aspects that need to be addressed to mitigate collision severity. 3.3.2 Equation The multinomial logistic regression model estimates the log odds of each crash severity category relative to a reference category. The general equation for the multinomial logistic regression model can be expressed as follows: log (P(Y = j | X)) = βj0 + βj1X1 + βj2X2 + ... + βjkXk, where: • P(Y = j | X) represents the probability of crash severity category j given predictor variables X. 31 • βj0, βj1, βj2, ..., βjk represent the estimated coefficients for each predictor variable in the model. • X1, X2, ..., Xk represent the predictor variables (behavioral aspects, road conditions, etc.) influencing crash severity. 3.3.3 Model Development Our aim for prediction leads us to the use of one of the relevant models which is Multinomial Logistic Regression. This model, grounded in statistical principles, is designed to predict the degrees of crash severity and is helpful when multiple categories are there in the targeted variable. In the core of the model development process, the analysis was done to categorize into distinct sections. The dataset was categorized into multiple binary logistic regression sub-models, comparing each severity category against the reference category (e.g., minor severity vs. reference, severe severity vs. reference). This step ensures that our model captures the shades of difference that define each level of severity. The goal of implementing this approach lies in training multiple binary logistic regression sub-models. Each sub-model stands as a beacon of insight, illuminating the intricate relationships between predictor variables and crash severity. By dissecting the dataset into these sub-models, the data is navigated with precision, enabling to grasp the factors that play a pivotal role in determining the outcome of a collision. This model development phase is not just a technical interlude; it's the cornerstone of our research study's empirical foundation. By choosing Multinomial Logistic Regression, 32 methodology is aligned with the complexities of real-world collision scenarios.. The categorization and comparison offer a lens to perceive the spectrum of severity levels, much like dissecting light through a prism. To conclude, the multinomial logistic regression model is a valuable tool for analyzing crash severity in traffic collision studies. By employing this methodology, the behavioral aspects that significantly influence crash severity and their impact on the economy can be identified. The model's estimated coefficients and odds ratios allow for a quantitative understanding of the relationship between predictor variables and crash severity categories. Implementing this methodology will contribute to our research study's objective of comprehending the influence of behavioral aspects in traffic collision severities and its broader economic implications. 3.4 Multi-level Multinomial Logistic Regression 3.4.1 Description The multi-level multinomial logistic regression model is an extension of the multinomial logistic regression model that allows for the incorporation of hierarchical or nested data structures. In the context of our research study on understanding the influence of behavioral aspects in traffic collision severities and its impact on the economy, this methodology outlines the steps involved in applying the multi-level multinomial logistic regression model to analyze crash severity while accounting for nested data structures. Furthermore, the variation in crash severity across different locations results in a level of diversity within specific groups that is smaller than the diversity observed between these groups (Tang et al., 2020). This phenomenon becomes evident when considering factors like geographic 33 divisions, differences between rural and urban areas, land use patterns, climate zones, and functional areas (Peng et al., 2019). In essence, the spatial differences play a role in shaping the severity of injuries resulting from crashes. It's of utmost importance to take all levels of clustering into account when conducting an analysis of crash data that involves multiple levels. Failing to consider the effects specific to each cluster could introduce statistical errors, which might manifest as skewed parameter estimates, underestimation of standard errors, and an exaggerated sense of statistical significance. 3.4.2 Equation The general equation for the multi-level multinomial logistic regression model can be expressed as follows: log (P(Yij = j | Xij)) = β0j + β1jX1ij + β2jX2ij + ... + βkjXkij + u0j + u1jW1j + u2jW2j + ... + ukjWkj, where: • P(Yij = j | Xij) represents the probability of crash severity category j for individual i within group j given predictor variables Xij. • β0j, β1j, β2j, ..., βkj represent the fixed effects coefficients for individual-level predictors. • X1ij, X2ij, ..., Xkij represent the individual-level predictor variables influencing crash severity. • u0j, u1j, u2j, ..., ukj represent the random effects coefficients for group-level predictors. • W1j, W2j, ..., Wkj represent the group-level predictor variables influencing crash severity. 34 3.4.3 Model Development This model considers the unique structure of our data, accommodating the fact that crashes can vary in severity. In this process, factors that pertain to both individual cases and broader groups are considered. To ensure precision, the dataset is segmented into different categories of crash severity. Each category is compared against a reference point, essentially helping the model understand what distinguishes, for instance, a minor collision from a more severe one.This breakdown adds depth to our analysis, as it helps us capture the nuances inherent in various levels of crash outcomes. Furthermore, our model isn't just focused on individual cases; it also recognizes that different groups might exhibit distinct patterns. This is where the concept of random effects comes into play. These effects are incorporated to capture the variations that arise between different groups, which might experience crashes differently due to unique circumstances. In the model, V_DOT is going to be kept as the group level. The estimation of random effects at the group level captures the variations across diverse groups, enabling us to analyze the contextual effects on crash severity. 3.4.3 Model Evaluation After running the model and defining all variables and conditions, the model can be evaluated by assessing the significance and contribution of each predictor at both individual and group levels using statistical tests, such as p-values or Bayesian model comparison. Moreover, to have a better understanding of the results the emphasis can also be given at the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). As these models were built as 35 explanatory models (not predicted), the main aim is to identify the contributing factors in collision severities, rather than predicting the collision severities. In our research study, the multi-level multinomial logistic regression model allows us to account for the hierarchical structure of the data and examine the influence of individual-level and group-level factors on crash severity. By including individual-level behavioral aspects and grouplevel factors such V-DOT in our model, it can give a comprehensive understanding of the factors contributing to crash severity. The estimation of random effects at the group level captures the variations across diverse groups, enabling us to analyze the contextual effects on crash severity. This analysis will provide valuable insights for policymakers, transportation planners, and stakeholders to develop targeted interventions and strategies at the individual and group levels. 3.5 Random Forrest Model 3.5.1 Description In understanding and predicting collision severity, a comprehensive research study will be conducted, utilizing the power of data analysis and machine learning techniques. In this part of the methodology, the step-by-step process followed to build a Random Forest model aimed at accurately predicting and classifying collision severity using data collected from the North Virginia Federal Department website will be outlined. The approach encompasses data collection, data preprocessing, feature engineering, model building, and evaluation. 3.5.2 Equation Random Forest consists of a group of classification trees, where each classification tree is trained based on a random sample from the input dataset (Breiman, 1999). The Random Forest utilized in 36 this study used a random number of features to be considered at each split to grow the trees. Each decision tree votes towards the classification decision, and results of the classification is specified using the Gini Index. At a given input T, selecting the Collison severity level and specifying that it belongs to category Ci, a Gini Index can be given by: ∑ ∑(𝑓(𝐶𝑖 , 𝑇)/|𝑇|)(𝑓(𝐶𝑗 , 𝑇)/|𝑇|) 𝑗≠𝑖 where f(Ci,T)/|T| is the probability that the selected case belongs to class Ci. The Random Forest would grow N number of trees and the decision would be for the categories with eh maximum number of votes for the collision severities. 3.5.3 Model Development: The main part of our methodology lies in the construction of the Random Forest model. A Random Forest is an ensemble learning technique that combines multiple decision trees to make more accurate predictions. The model was trained using the training data, and 100 decision trees were used as our ensemble.. This forest of trees collaboratively worked together to predict collision severity. 3.5.4 Model Evaluation: The model's true prowess shines in its ability to make accurate predictions. To evaluate its performance, the testing dataset is used that had been kept aside. Accuracy, a widely used classification metric, was employed to measure how well the model's predictions aligned with the actual collision severity labels. This metric provided us with a tangible understanding of the model's accuracy in predicting different severity levels. 37 38 4. Statistical Modeling Results In this chapter, the results of the developed model will be discussed. Three models are developed in this thesis, including (1) multinomial logistic regression model, (2) multi-level multinomial logistic regression model, and (3) Random Forest model. For all three models, the target variable remains the same, which is Collision Severity, where 4 categories are retained: Minor Injury, Major Injury, PDO, and Fatality. The different categories of Collision Severity will be examined using the term KABCO, which assesses and classifies the severity of collisions if they lead to Fatality, Major Injury, Minor Injury, or just property damage. In our modeling, PDO (property damage) was kept as the baseline. 4.1 Multinomial Logistic Regressions The parameter estimates for the multinomial logistic regression model are presented in Table 4.1. Table Output from MNL Model including Coefficients, p-value and Odds Ratio Variables bike-vehicle FATALITY Coef. P-Value MAJOR INJURY Odds Ratio Coef. P-Value (exp(coef)) Intercept -14.9 0 Collision Type (Head- -0.02 .07 -1.4 10.07 4.1 MINOR INJURY Odds Ratio Coef P-Value (exp(coef)) -2.4 <0.01 0.98 - - 0.07 0.19 -0.75 <0.01 4.6 -0.25 Odds Ratio (exp(coef)) -1 <0.01 - - - - .04 0.50 - - - <0.01 0.83 -0.08 <0.01 1.15 On) Angle* Collision Type (Rear End) Angle* Traffic Control (Yes) (No)* 39 Traffic Control (Other) 9.1 <0.01 1.5 -0.58 <0.01 0.59 -0.3 <0.01 0.90 3.2 <0.01 29.25 2.2 <0.02 10.03 - - - 5.06 <0.01 269.07 4.3 <0.01 104 3.6 <0.01 51 -10.6 <0.01 0.00 -0.1 <0.01 0.53 -0.1 <0.01 0.8 1.05 <0.01 2841 1.05 <0.01 4952 1.03 <0.01 2248 (No)* Belted (No) (Yes)* Bike (Yes) (No)* Animal (Yes) (No)* Pedestrians (Yes) (No)* AIC 4275949.80422 BIC 428464.30900 The equations for the multinomial logistic regression model developed for the crash severity are: Fatality: Log(P(Y = Fatal/PDO)) = -14.9 – 0.02X1 – 1.14X2 + 10.07X3 + 9.1X4 + 3.2X5 + 5.06X6 – 10.06X7 + 1.5 X8 Major Injury: Log(P(Y = Major_Inj/PDO)) = -2.4 -0.02X1 – 0.75X2 – 0.25X3 -0.58X4 + 2.2X5 + 4.3X6 – 0.1X7 + 1.05X8 Minor Injury: Log(P(Y = Major_injury/PDO)) = -1 -0.3X1 – 0.2X2 – 0.08X3 – 0.3X4 + 1.3X5 + 3.6X6 – 0.1X7 + 1.03X8 where: 40 X1 Collision Type (Head- On) X2 Collision Type (Rear End) X3 Traffic Control (Yes) X4 Traffic Control (Other) X5 Belted (No) X6 Bike (Yes) X7 Animal (Yes) X8 Pedestrians (Yes) After the Multinomial Logistic Regression model was run for the dataset, output including Coefficients, Std error, and P-values of the coefficients was obtained. In this part of the chapter, multiple significant variables will be discussed, and the effect of the variable in predicting collision severity will be concluded. The different categories of Collision Severity will be examined using the term KABCO, which assesses and classifies the severity of collisions if they lead to Fatality, Major Injury, Minor Injury, or just property damage. The outcome will be analyzed using the Odds Ratio. The Odds Ratio can be defined as the ratio of the probability of two events. For our modeling, PDO (property damage) was kept as the baseline. The Odds ratio of different variables for different collision severity can be examined. Collision Type: 41 Different collision types, such as Head-on, Rear end, Side sweep, and Fixed, were analyzed with the reference category kept as Angle. The odds ratios associated with collision type 'Head On' provide valuable insights into the impact on injury severity outcomes. Specifically, the odds ratio for fatality is 4.9, indicating a substantially higher likelihood, followed by a 3.4-fold increase for major injuries, and a 1.6-fold increase for minor injuries. These findings underline the significant influence of collision type on injury outcomes compared to the reference category. On the other, the odds ratio linked with Collision type- Rear end for fatality stands at 0.19, implying a notably lower likelihood compared to the reference category. For major injuries, the odds ratio is 0.50, while for minor injuries, it approximates at 1.03. This indicates that Head-on collisions are more severe as compared to other type of collision recorded (see Table 4.1). Traffic Signal: Looking at collision outcomes, the presence or absence of traffic control emerges as a pivotal factor. When compared to collisions devoid of traffic control measures, those featuring traffic control display an odds ratio of 4.6 for fatality, indicating a notably elevated likelihood. This underlines the potential protective effect of traffic control measures, which appear to mitigate the risk of minor injuries, as reflected in the odds ratio of 1.15. Interestingly, for major injuries, the odds ratio of 0.83 tells a reduced probability in these scenarios (see Table 4.1). Traffic Control (Other): Further looking at our model outcomes, the category 'Traffic Control (Other)' emerges as an intermediary. In comparison to the baseline, 'No' category, collisions falling under this classification exhibit odds ratios of 1.5 for fatality, 0.59 for major injury, and 0.9 for minor injury. 42 These odds ratios serve as markers, denoting the nuanced impact that 'Traffic Control (Other)' has on each level of injury severity. Belted: The fastened seatbelt shows significant influence. In comparison to the baseline 'Yes' category, collisions, where individuals are not belted, show odds ratios of 29.25 for fatality, 10.03 for major injury, and 3.6 for minor injury. These odds ratios unveil a reality – the absence of seatbelt usage dramatically heightens the likelihood of severe outcomes specially Fatality and Major injury. The substantial odds ratios for fatality and major injury underscore the pivotal role seatbelts play in mitigating the risk of grave consequences. The odds ratio for minor injuries, while still elevated, denotes a relatively moderated impact. These findings unequivocally advocate for the universal adoption of seatbelt practices to foster a safer road environment (see Table 4.1). Bike: Looking at the involvement of bicycles in collisions, a striking pattern emerges. Collisions involving bikes reveal odds ratios of 269 for fatality, 104 for major injury, and 51 for minor injury, relative to the baseline. This highlights the vulnerability associated with bicycle-involved accidents, where the odds of fatality and major injury are very high. The odds ratio for minor injuries, while still substantial, appears comparatively moderate, shedding light on the relative impact. Animal: Table 4.1 gives an insight about the influence of animal involvement in the predicting the severity. The presence of animals, often unpredictable elements, lends a unique perspective. In 43 contrast to the baseline, 'No' category, collisions involving animals display odds ratios of 0 for fatality, 0.53 for major injury, and 0.8 for minor injury. The fatality odds ratio of 0 suggests a potential protective factor, although a cautious interpretation is warranted given the rarity of animal-related fatalities. The odds ratios for major and minor injuries indicate a modestly elevated risk, highlighting the potential for animal-involved collisions to lead to a higher likelihood of injury. Pedestrians: Considering the last significant variable of our model, pedestrians introduce a significant dimension. Collisions involving pedestrians present contrasting odds ratios across the severity spectrum. With odds ratios of 28441 for fatality, 4952 for major injury, and 2248 for minor injury compared to the baseline 'No' category, the message is unequivocal: pedestrian-involved collisions pose a dramatically heightened risk across all injury levels. These odds ratios serve as a clarion call for increased attention to pedestrian safety measures. 4.2 Multi-Level Multinomial Logistical Regression The parameter estimates for the multi-level multinomial logistic regression model are presented in Table 4.2. 44 Table 4.2 Output from Multi-level MNL Model including Coefficients, p-value and Odds Ratio Variables bike-vehicle FATALITY MAJOR MINOR INJURY INJURY Coef P-Value . Intercept -10.4 <0.01 Collision Type (Head- 1.85 0.06 -0.34 Odds Ratio Coef (exp(coef)) . P-Value Odds Ratio Coef. P-Value (exp(coef)) -7.01 <0.01 1.9 1.4 <0.01 <0.01 0.75 -0.69 -0.38 <0.01 0.68 1.6 <0.01 0.25 Odds Ratio (exp(coef)) -0.7 <0.01 3.6 0.62 <0.5 0.04 0.65 - - -1.1 0.02 0.38 - - - 46 4.1 <0.01 431 -0.64 <0.01 1 <0.01 18 4.4 <0.01 345 -0.41 <0.01 0.84 3.8 <0.01 44.70 2.3 <0.01 10 1.2 <0.01 4 4.5 <0.01 13 3.6 <0.01 67 2.9 <0.01 41 -0.76 <0.01 5.7 -0.46 <0.01 0.56 - - - 9.2 <0.01 354 7.0 <0.01 436 5.7 <0.01 258 -1.6 0.10 0.60 - - - - - - 1.2 On) *Angle Collision Type ( Rear End) *Angle Collision Type(Side sweep) *Angle Traffic Control ( Yes) (No)* Traffic Control (Other) (No)* Belted (No) (Yes)* Bike (Yes) (No)* Animal (Yes) (No)* Pedestrians (Yes) (No)* Area Type (Urban) Rural* 45 Roadway Description (2- 2.8 <0.03 1.7 0.49 0.06 2.7 - - - way divide) (One way) * Random Effect Coef.: Var (V_dot) 0.67 0.64 0.86 Goodness of fit: AIC 421118 BIC 422365 The equations for the Multi-level multinomial logistic regression model developed for the crash severity are: FatalityLog(P(Y = Fatal/PDO)) = -10.4 + 1.85X1 -0.38X2 – 0.38X3 + 1.6X4 + 0.25X5 + 3.8X6 + 4.5X7 – 0.76X8 + 9.2X9 - 1.6X10 + 2.8X11 Major Injury Log(P(Y = Major_inj/PDO)) = -7.01 + 1.4X1 – 0.69X2 – 1.1X3 + 4.1X4 + 4.4X5 + 2.3X6 + 3.6X7 -0.46X8 + 7 X9 – 0.46X10 + 0.49X11 Minor Injury Log(P(Y = Minor_Inj/PDO)) = -0.7 + 0.62X1 -0.14X2 – 0.28X3 – 0.64X4 – 0.41X5 + 1.2X6 + 2.9X7 – 0.23X8 + 5.7X9 – 0.24X10 + 0.32X11 where: X1 Collision Type(Head On) 46 X2 Collision Type(Rear End) X3 Collision Type(Sidesweep) X4 Traffic Control ( Yes) X5 Traffic Control (Other) X6 Belted (Yes) X7 Bike ( Yes) X8 Animal ( Yes ) X9 Pedestrians (Yes) X10 Area Type(Urban) X11 Roadway Description (2-way divide) In this section, looking at the results of the Multi-Level Multinomial Logistic regression results. The odds ratios for different variables and their potential impact on collision severity were obtained and analyzed. 'Collision Type (Head-On)' will be examined, with odds ratios of 1.9 for fatality, 3.6 for major injuries, and 1.2 for minor injuries. These numbers resonate with significance, revealing a heightened likelihood of both fatality and major injuries in head-on collisions. Minor injuries, on the other hand, display a relatively modest elevation. This output differentiation underscores the influence of collision type on the array of outcomes, as head-on collisions emerge as a potent driver of severe consequences (See Table 4.2). Collision Type (Rear End): For the Collision Type for Rear End, the odds ratios are 0.75 for fatality, 0.65 for major injuries, and 1.23 for minor injuries. A narrative of contrasts unfolds, showing a lowered likelihood of fatality and major injuries in rear-end collisions, perhaps attributed to their typically lower 47 impact force or there would be cases of vehicle damage or property damage. However, minor injuries showcase a slight elevation. These odds ratios show the differential influence of collision types across varying injury levels (See Table 4.2). Collision Type (Side sweep): When looking at Collision-type side sweep, the odds ratios here are 0.68 for fatality, 0.38 for major injuries, and 0.47 for minor injuries. They paint a picture of reduced likelihood across the board, underscoring the potential protective effect associated with sideswipe collisions. The odds ratios resoundingly advocate for the relatively milder impact of side sweep collisions in mitigating the risk of severe outcomes. Traffic Control: The variable of 'Traffic Control' comes into the output, offering a glimpse into its influence. As contrasted with collisions without traffic control measures, those marked by traffic control bear odds ratios of 46 for fatality, 431 for major injuries, and 1 for minor injuries. These odds ratios shine a light on the substantial protective role that traffic control measures play, significantly reducing the likelihood of major injuries. Yet, it's noteworthy that the odds ratio of 1 for minor injuries suggests a similar likelihood, indicative of the balance maintained by these measures (See Table 4.2). Traffic Control (Other): Further looking at the narrative with the same Traffic control but with a different category, the category 'Traffic Control (Other)' comes to the fore. Relative to the baseline 'No' category, these collisions exhibit odds ratios of 18 for fatality, 345 for major injuries, and 0.84 for minor 48 injuries. The unique odds ratios for each severity level underscore the differential impact of 'Traffic Control (Other)' on outcomes. These odds ratios echo the intricate interplay, reflecting the nuanced influence this variable imparts on each level of injury severity (See Table 4.2). Belted (Yes): The absence of seatbelt usage, embodied in 'Belted (No),' unfolds its story. Compared to the baseline 'Yes' category, this variable portrays odds ratios of 16 for fatality, 10 for major injuries, and 4 for minor injuries. These numbers are a stark reminder of the critical role seatbelt usage plays in averting severe outcomes. The odds ratios for fatality and major injuries emphasize the protective umbrella seatbelts provide, while minor injuries are significantly elevated in this context (See Table 4.2). Bike: The involvement of bicycles with the category 'Bike (Yes),' gives a unique hue to the narrative. Collisions featuring bicycles unfurl odds ratios of 13 for fatality, 67 for major injuries, and 41 for minor injuries, compared to the baseline. These odds ratios underline the heightened vulnerability associated with bicycle-involved collisions. The odds ratios for fatality and major injuries speak to the pivotal need for protective measures in these scenarios, while the substantial odds ratio for minor injuries underscores the potential for injury across the spectrum (See Table 4.2). Animal: The variable 'Animal (Yes)' introduces a novel facet. In contrast to the baseline 'No' category, these collisions present odds ratios of 5.7 for fatality, 0.56 for major injuries, and 1.5 for 49 minor injuries. While the odds ratio of 5.7 for fatality should be interpreted cautiously due to the rarity of animal-related fatalities, the odds ratios for major and minor injuries indicate a unique impact. These odds ratios portray a measured elevation in risk, indicating the potential for animalinvolved collisions to elevate the likelihood of injury (See Table 4.2). Pedestrians: As the labyrinth of collision dynamics is traversed, the presence of pedestrians emerges as a pivotal and impactful variable. Pedestrian-involved collisions project odds ratios of 354 for fatality, 436 for major injuries, and 258 for minor injuries when contrasted with the baseline 'No' category. The resounding message is unequivocal – pedestrian-involved collisions bear a significantly heightened risk across all levels of injury severity. These odds ratios accentuate the urgent need for focused efforts in enhancing pedestrian safety measures (See Table 4.2). Area Type (Urban): The 'Area Type (Urban)' variable gives a distinctive backdrop. Compared to its counterpart, 'Urban' scenarios embody odds ratios of 0.6 for fatality, 0.67 for major injuries, and 0.87 for minor injuries. These odds ratios spotlight the protective cloak urban areas offer, reducing the likelihood of severe outcomes. The odds ratios reflect the relatively safer environment urban settings provide across all levels of injury severity (See Table 4.2). Roadway Description (2-way divide): Lastly, 'Roadway Description (2-way divide)' takes its position. In contrast to the baseline 'One way' scenario, '2-way divide' unfolds odds ratios of 1.7 for fatality, 2.7 for major injuries, and 1.35 for minor injuries. These odds ratios unveil a nuanced panorama, signifying a heightened 50 likelihood of severe outcomes in '2-way divide' contexts. Yet, minor injuries exhibit a relatively moderate elevation (See Table 4.2). 4.3 Models Comparison In this section, the aim will be to look for the best model to predict collision severity, we worked on a task to compare two distinct models: the Multinomial Logistic Regression (MLR) and the Multi-Level Multinomial Logistic Regression (MLMLR) models. By examining key metrics such as AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion), it can be determined which model offers a better fit for our data and research objectives. AIC and BIC are like navigational tools that help us steer our models in the right direction. They consider the delicate balance between model performance and complexity. Lower AIC and BIC values are akin to a clearer path and a more efficient route towards understanding our data. Multinomial Logistic regression Multi- Level Multinomial Logistic regression AIC 427549 421118 BIC 428464 422365 When looking at the Multinomial Logistic Regression model the numbers are - AIC: 427549 and BIC: 428464. It presents us with an AIC value of approximately 427550 and a BIC value of around 428464. While interpreting these numbers, remember that smaller values are akin to a more streamlined model that fits the data well while avoiding unnecessary complexity. On the other hand, enters the Multi-Level Multinomial Logistic Regression model. This model unveils an AIC 51 value of roughly 421118 and a BIC value of about 422365. Lower AIC and BIC indicate a better balance between fit and complexity, ultimately leading to clearer insights. As this crossroads is navigated, it's crucial to remember that the lower the AIC and BIC, the better the model's performance in explaining our data while avoiding overfitting. By comparing these values, the model that best captures the intricate dance of collision severity can be discerned. Based on our AIC and BIC analysis, the Multi-Level Multinomial Logistic Regression model emerges as the frontrunner. The MLMLR has a significantly lower AIC and DIC value (more than 10 units). With noticeably lower AIC and BIC values compared to the Multinomial Logistic Regression model, it demonstrates a superior balance between explaining the data and maintaining simplicity. This suggests that the MLMLR model likely captures underlying patterns and nuances in collision severity more effectively. 52 5. Machine Learning Modeling Results This chapter will discuss the implementation and results of Machine Learning algorithms. The model which was used in the thesis is Random Forest Classifier. 5.1 Random Forest In understanding collision severity and its contributing factors, the analysis was done on a data-driven study using a Random Forest model. The model’s feature importance and accuracy provide us with valuable insights into the dynamics of collision severity prediction. A detailed analysis of the results and their implications is presented in this research study. Figure 5.1 shows the visuals of the contribution of each variable in predicting the collision severity. The feature importance analysis is like finding or identifying a spotlight on the variables that hold the most influential factor in predicting collision severity. Each feature's important score reflects its contribution to the model's decision-making process. A hierarchy of influence can be discerned when looking at the list of features and their respective importance scores. At the top of the list, "COLLISION_TYPE_1" is found with an importance score of approximately 19%: This underscores the significant impact of the collision type on predicting severity. Unsurprisingly, the way vehicles collide can greatly influence the outcome. Similarly, "BELTED_UNBELTED" and "PED_NONPED" follow closely, with importance scores of around 12% and 10%, respectively. This underscores the pivotal role of safety measures (seatbelts) and pedestrian involvement in determining the outcome of collisions. 53 Figure Feature 5.1 Importance for RF model "ANIMAL," though appearing further down the list, still holds notable importance at around 8.9%. This highlights the significance of encounters with animals in road incidents, contributing significantly to the model's predictions. Other variables like "BIKE_NONBIKE," "ROADWAY_DESCRIPTION_1," and "TRFC_CTRL_STATUS_TYPE_1" hold moderate importance, underlining the varied dimensions that influence collision severity. When assessing these important scores in the context of our thesis, a logical alignment can be found. The top features are those that intuitively impact collision severity—collision type, 54 seatbelt usage, pedestrian involvement, and road description. These are factors we would naturally expect to influence the outcome. This reaffirms that our model is capturing meaningful patterns. Table 5.1 Accuracy for our Random Forest Model Accuracy 0.69 AOC-ROC Macro Avg Weighted Avg 0.93 0.44 0.31 0.32 0.63 0.69 0.61 When looking at the macro and weighted averages, how these metrics stack up overall can be observed. The macro average is around 32%, which indicates the general performance across all classes. The weighted average, considering the class distribution, lands at 61%, showcasing the model's overall performance considering class imbalances. The accuracy of the model stands at 69%, reflecting the proportion of correctly predicted instances across all categories. While this is a decent overall accuracy, it's important to note that accuracy can sometimes be misleading when dealing with imbalanced datasets. Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a vital metric in assessing the performance of binary classification models like the Random Forest. The ROC curve is a graphical representation of a model's ability to discriminate between positive and negative classes across different threshold values. The curve plots the True Positive Rate (sensitivity) against the False Positive Rate (1-specificity) as the threshold for classification varies. 55 The AUC-ROC value, ranging from 0 to 1, summarizes the overall performance of the model. A higher AUC-ROC indicates better discrimination and model accuracy. An AUC-ROC of 0.5 suggests random guessing, while an AUC-ROC of 1 signifies perfect separation. In our results, the AUC-ROC of 0.937 indicates that the Random Forest model is effectively distinguishing between the two classes. This high value signifies strong predictive capability and a good balance between sensitivity and specificity. In practical terms, it means that when making predictions using this model, one can expect it to correctly rank positive instances higher than negative ones with a high probability. The results can give us a conclusion that the Random Forest model performed impressively well with an AUC-ROC of 0.937. This indicates that the model has a high ability to differentiate between the classes, making it a promising choice for tasks requiring binary classification, such as disease diagnosis or fraud detection. 5.1.1 Simulating the RF Model In this research study, a comprehensive exploration of predictive modeling using a Random Forest Classifier was conducted on the Collison Cleaned dataset.. The dataset, comprising variables such as COLLISION_TYPE_1, TRFC_CTRL_STATUS_TYPE_1, WEATHER_CONDITION_1, and others, aimed to predict the multi-class target variable, Crash_severity_3_level. 56 Upon building and fine-tuning the Random Forest model, the model's behavior under different circumstances was ventured into to extract insightful patterns. The top 8 significant variables, namely Belted, Animal, Pedestrians, Crash Day, Bike, Motor, and Collision type, were strategically modified.. Since these eight variables are the most important variables to predict the output and these simulations provided us with profound insights into the impact of these variables on the predicted outcomes, shedding light on the intricate relationships within the data. BELTED: In our analysis, the impact of the 'BELTED_UNBELTED' variable on predicted class labels for Collision Severity was simulated while keeping the other variables the same, using a Random Forest model. This variable represents whether a person was belted (0) or unbelted (1) during a collision. Our goal was to understand how this variable affects the severity of crashes, as indicated by different target class labels: Fatal, Major Injury, Minor Injury, PDO. When the results were visualized using a stacked bar plot, the distribution of these target class labels based on different values of the 'BELTED_UNBELTED' variable was observed. The stacked bar plot shows the distribution of target class labels for two scenarios: 'Belted' (0) and 'Unbelted' (1) individuals involved in collisions. For 'Belted' Individuals (0): The 'Fatal' category has a count of 348, indicating that 348 collisions involving belted individuals resulted in fatal outcomes. The 'PDO' (Property Damage Only) category has a count of 329,803, indicating that a large number of belted individuals had collisions resulting in property 57 damage only. Similarly, it has counts of 3,279 and 13,050 for 'major_inj' (major injuries) and 'minor_inj' (minor injuries) categories, respectively. Figure Target 5.2 Classes based on Simulating variable BELTED_UNBELTED For 'Unbelted' Individuals (1): The 'Fatal' category has a significantly higher count of 4,101, suggesting that collisions involving unbelted individuals were more likely to result in fatal outcomes. The 'PDO' category has a count of 123,295, indicating a substantial number of property damage-only collisions for unbelted individuals. The counts for 'major_inj' and 'minor_inj' categories are 29,756 and 189,328, respectively. This suggests a higher likelihood of major and minor injuries for unbelted individuals compared to belted ones 58 From the results and the visualization, it's evident that wearing seat belts plays a crucial role in reducing the severity of collision outcomes. Collisions involving belted individuals are more likely to result in property damage only, while unbelted individuals face a higher risk of fatal outcomes, major injuries, and minor injuries. MOTOR: The simulation focused on the 'MOTOR_NONMOTOR' variable's impact on predicted class labels can be seen in Figure 5.3. By altering the feature, the aim was to understand how different vehicle types influence crash severity. The output highlights key insights. When vehicles were identified as "MOTOR" (representing motorized vehicles), the majority of crashes resulted in Property Damage Only (PDO) incidents (324,504 cases) and minor injuries (17,095 cases). However, when considering "NONMOTOR" vehicles (non-motorized transport), there were fewer PDO incidents (50656) and minor injuries (195,073), with a notable increase in severe incidents. Fatalities were predominant (7452 cases) along with major injuries (93,299 cases). Figure 5.3 Target Classes based on Simulating variable MOTOR_NONMOTOR 59 This indicates that non-motorized vehicles are more susceptible to severe outcomes. Such insights are crucial for designing safety measures for different vehicle types to mitigate severe incidents and ensure road user safety. The findings here emphasize the significance of vehicle type in predicting crash severity outcomes, which can inform policy and infrastructure improvements to address these disparities effectively. Collision Type: In this part, the impact of Collision Type, such as Head-on, rear end, side sweep, etc., will be analyzed. The simulated impact of different collision types on predicted class labels provides valuable insights into the potential implications of varying collision scenarios on the severity of outcomes. and it's essential to interpret the results to uncover trends and implications for road safety. 60 The bar plot visualization showcases the distribution of predicted class labels across various collision types. Notably, collision type '6' appears to have the highest frequency of 'PDO' (Property Damage Only) outcomes, with a count of 321,198. Conversely, collision type '3' has the highest count of 'Fatal' outcomes at 4,371. This discrepancy suggests that collision type '6' might involve less severe accidents, while collision type '3' could correspond to more catastrophic incidents (See Figure 5.4). Figure Target 5.4 Classes based on Simulating variable COLLISION_TYPE_1 Furthermore, analyzing the distribution of 'minor_inj' and 'major_inj' outcomes across collision types reveals interesting patterns. Collision type '3' again stands out with the highest counts in both categories, indicating a higher likelihood of serious injuries. On the other hand, collision type '2' displays the second-highest count in both categories, suggesting a consistent level 61 of impact severity. The results also highlight variations in outcome severity within specific collision types. For instance, collision type '2' demonstrates a relatively balanced distribution across all four outcomes, while collision type '0' appears to have a higher count of 'Fatal' outcomes (See Figure 5.4). Pedestrians: In the simulation that assesses the impact of different pedestrian involvement levels ('PED_NONPED') on predicted class labels, a comprehensive understanding of potential outcomes based on pedestrian presence or absence is unveiled. The visualization and numerical breakdown offer critical insights into the severity of predicted consequences in vehicular incidents involving pedestrians. In figure 5.5 the bar plot presentation adeptly illustrates how varying levels of pedestrian involvement correlate with predicted class labels. Notably, instances with 'PED_NONPED' equal to '1' exhibit a substantially higher occurrence of 'Fatal' outcomes, totaling 26,567. This observation is noteworthy and suggests that incidents involving pedestrians often result in more severe consequences, underscoring the vulnerability of pedestrians on the road. Moreover, while 'minor_inj' outcomes still persist in the presence of pedestrians, they considerably outweigh 'major_inj' and 'Fatal' outcomes. In contrast, when 'PED_NONPED' equals '0,' the incidence of 'Fatal' outcomes significantly decreases to 585. This trend implies that accidents in which pedestrians are not involved tend to result in fewer fatal incidents. Interestingly, 'PDO' (Property Damage Only) outcomes remain the most common in such scenarios, indicating relatively less severe accidents. 62 Figure 5.5 Target Classes based on Simulating variable PED_NONPED Additionally, the count of 'PDO' outcomes is substantially higher across both levels of pedestrian involvement, suggesting that many accidents might involve only property damage, regardless of whether pedestrians are present or not. Bike: The simulation exploring the impact of bicycle involvement ('BIKE_NONBIKE') on predicted class labels reveals compelling insights into the potential outcomes of accidents involving bicycles. The visual representation and quantitative breakdown offer a clear understanding of the severity of predicted consequences in incidents where bicycles are present on the road. 63 The stacked bar plot eloquently illustrates the distribution of predicted class labels based on the presence or absence of bicycles. Notably, instances with 'BIKE_NONBIKE' equal to '1' show a substantial count of 'minor_inj' outcomes, totaling 89,946. This observation suggests that accidents involving bicycles tend to result in a higher frequency of minor injuries. Furthermore, 'PDO' (Property Damage Only) outcomes are also prominent, signifying that accidents with bicycles often result in minimal physical harm but still involve property damage. Figure 5.6 Target Classes based on Simulating variable BIKE-NONBIKE Conversely, when 'BIKE_NONBIKE' equals '0,' the occurrences of 'minor_inj' outcomes decrease to 5,923. However, 'Fatal' outcomes in this scenario remain comparatively low at 732. This pattern indicates that accidents without bicycle involvement are less likely to result in 64 fatalities, possibly due to lower impact forces. Moreover, 'PDO' outcomes persist as the most common outcome. Interestingly, 'major_inj' outcomes occur regardless of bicycle involvement, emphasizing that severe injuries remain a possibility. Additionally, 'PDO' outcomes remain consistently high, regardless of bicycle presence or absence, further underscoring that many accidents result in only property damage. Crash Day- Weekday: The simulation investigating the influence of the day of the week on predicted accident outcomes ('CRASH_DT_day_1') provides valuable insights into the potential variations in accident severity based on different days. The visual representation and the accompanying numerical breakdown offer a comprehensive understanding of how accident outcomes differ across days of the week. Figure 5.7 Target Classes based on Simulating variable CRASH DAY 65 The stacked bar plot effectively illustrates the distribution of predicted class labels for each day of the week. Notably, accidents occurring on both days ('CRASH_DT_day_1' = 0) and weekdays ('CRASH_DT_day_1' = 1) exhibit similar trends in predicted outcomes. The highest count of outcomes is 'PDO' (Property Damage Only), which signifies accidents with minimal physical harm but property damage. It's intriguing to observe that these outcomes are consistently prominent across all days (See Figure 5.7). Interestingly, the counts of 'minor_inj' and 'major_inj' outcomes vary slightly between the two groups. Accidents on weekdays tend to have marginally higher counts of 'major_inj' outcomes, suggesting that accidents occurring during the workweek may involve higher impact forces. Conversely, accidents on both days have a slightly higher count of 'minor_inj' outcomes, indicating that minor injuries are more frequent in these cases. Accuracy Assessment: 66 Accuracy serves as a benchmark to gauge the model's performance in classification tasks. In our case, the accuracy of approximately 68.4% signifies the proportion of correctly predicted collision severity levels. While this accuracy score provides a solid foundation, it also highlights the complexity of predicting such multifaceted events. Achieving higher accuracy would require a deep understanding of the intricate interplay between numerous variables influencing collision outcomes. Interpreting the Results: Our research takes a significant step forward with these results. The feature importance scores not only shed light on influential factors but also guide us toward areas that demand greater attention and further investigation. The dominance of collision type, seatbelt usage, and pedestrian involvement suggests that interventions aimed at reducing collision severity could potentially focus on these aspects. Moreover, the accuracy score of our model signifies its capability to discern collision severity to a reasonable extent. This provides a solid foundation for policymakers, law enforcement, and stakeholders to develop informed strategies for road safety enhancements. However, it's essential to acknowledge the room for improvement. Collaboration between data scientists, traffic experts, and policymakers is crucial for refining the model's accuracy and ensuring its real-world applicability. 67 6. Conclusion 6.1 Summary of Findings With this research study, the aim was to examine different factors contributing to traffic accident severities in North Virginia, US. In this research models that could not only address the issue but can also fit more than 2 categories in the output variables. The Multinomial Logistic Regression (MLR) and the Multi-level Multinomial Logistic Regression (MLMLR) models for our research. Upon analyzing the results from both the Multinomial Logistic Regression (MLR) and the Multi-level Multinomial Logistic Regression (MLMLR) models, distinct patterns and insights have obtained regarding the relationship between various factors and collision severity. The MLR model revealed several key factors that influence collision severity levels, including Collision Type, Traffic Control, Belt Usage, Bicycle Presence, Animal Incidents, and Pedestrian Involvement. These variables showed statistically significant coefficients, which indicated their significant impact on the outcome categories, such as Fatality, Major Injury, and Minor Injury, given their theoretical rationale of impacting collision severities. The results from this model showing the Odds ratio for fatality and major injury when Pedestrians and Bike are involved in accidents are 269, 28441 and 104, 4952 respectively, concluding that the collision where involvement of Pedestrians and Bike is present, those have the most severe impact leading to more Fatality and major injuries. Moreover, collisions where the driver is without the safety belts can lead to more of minor injuries. Collision occurs are traffic controls (e.g., signalized intersections) are likely to be more severe compared to collision occurs at regular road. On the other hand, the MLMLR model, with its ability to account for hierarchical data structures, provided a more nuanced perspective. It reaffirmed the influence of factors like 68 Collision Type, Traffic Control, Belt Usage, Bicycle Presence, Animal Incidents, and Pedestrian Involvement. The results from the model show that accidents involving Pedestrians and Bike are more severe leading to Fatality and Major injuries. Moreover, additional variables like Area Type and Roadway Description were introduced, shedding light on their roles in influencing collision severity. In comparison to the two models for predicting collision severity, it is evident that both the MLR and MLMLR models offer valuable insights into the relationship between predictor variables and collision severity outcomes. However, the MLMLR model brings an added advantage by accommodating the hierarchical nature of the data, yielding results that are potentially more robust and fine distinction. 6.2 Limitation and Future Work Within the context of this research study, several limitations are acknowledged that warrant consideration and offer avenues for future investigations. It is imperative to recognize the inherent variability in crash record collection practices across different entities such as counties, governments, police agencies, and insurance companies. Since the dataset considered this study was from 9 districts of the North Virginia Transport Department, and every jurisdiction implemented its own rules and policies to report and collect information on crashes and collisions. This variation in data collection procedures may introduce inconsistencies in the available data, potentially leading to discrepancies in the included variables and their accuracy. To address this limitation, a concerted effort could be directed toward incorporating additional key and standardized variables that are universally relevant across jurisdictions. This expansion of 69 variables would enhance the comprehensiveness of the model and contribute to its robustness, making it more adaptable to varying data collection practices. Furthermore, the temporal scope of the dataset, spanning from 2019 to 2023, may be considered relatively limited in capturing the full spectrum of crash occurrences and associated influencing factors. Extending the dataset's timeline to encompass a more extensive range of years could lead to more substantial and statistically significant findings. A longer timeframe would enable the identification of trends and patterns that may emerge over time, thereby providing a more comprehensive understanding of the dynamics driving collision severity outcomes. Also, by looking deeper into the relationships between the variables, we can uncover hidden patterns and nuanced insights that enhance the predictive accuracy of our models by analyzing the interdependency of variables. For instance, investigating scenarios where factors like alcohol impairment intersect with adverse weather conditions, such as slippery roads due to snow, could reveal critical risk factors that increase the likelihood of severe injuries. This multidimensional analysis promises a more comprehensive understanding of collision dynamics and paves the way for more effective accident prevention and response strategies. Future studies in this direction hold immense potential to contribute significantly to the field of road safety. Sampling can also be considered as an issue or limitation. Usually, most of the collisions can be retrieved from Police reports, and there could be many instances when due to no injury or vehicle damage nothing was reported. However, a critical issue demanding attention is the potential under-reporting of crashes within the dataset. It is evident that not all crashes of significance are reported and subsequently collected. Back in 2009, a study was done by the National Highway Traffic Safety Administration to get insight into the realm of crash reporting, 70 revealing intriguing insights. This investigation found that approximately 25% of minor injury crashes and fifty percent of no-injury crashes went unreported. This can play a huge role in making the result or outcomes biased. This inconsistency stands in stark contrast to the reporting rate observed for fatal crashes, which hovered around the 100 percent mark (National Highway Traffic Safety Administration, 2009; Blincoe et al., 2002). This under-reporting phenomenon could stem from a variety of factors, such as mild collisions not being deemed report-worthy or certain crashes occurring in remote areas with limited documentation. This limitation underscores the importance of taking a cautious interpretation of the results and recognizing that they may not fully encapsulate the entire spectrum of crash severity incidents. Moreover, a specific challenge unique to the Canadian jurisdiction pertains to its reporting thresholds for collisions. Consequently, the crash records commonly referred to operate as outcome-based samples. This type of sampling pulls on the fact that the injury severities captured within police reports don't precisely give the genuine spectrum of crash incidents, primarily due to the underreporting of less severe injuries. This distinction casts a shadow on the accuracy of parameter estimates derived from outcome-based samples, potentially resulting in biased outcomes. Considering this, understanding the reporting criteria and their influence on data collection is pivotal for an accurate assessment of crash severity (Savolainen et al., 2011). For future investigations, a more comprehensive exploration of these limitations could provide fruitful insights. Further research could focus on refining and expanding the model by incorporating additional covariates that address the variations in data collection practices among different entities. In conjunction, a cross-jurisdictional analysis could be pursued to assess the impact of reporting thresholds on the dataset's representativeness. Additionally, conducting a 71 comparative study that encompasses multiple countries with varying reporting practices could offer a broader perspective on the relationship between collision severity and data collection methodologies. In conclusion, while this research study has provided valuable insights into the relationships between predictor variables and collision severity outcomes, it is crucial to recognize the limitations inherent in the data and methodology. Future work could transcend these limitations by enhancing the model's comprehensiveness, extending the dataset's temporal scope, and investigating under-reporting and reporting thresholds. Also, to get a better insight into regions of Canada and specific to Vancouver, we can retrieve reliable data from the Police reports and then finalize 2-3 relevant models that can give us desired outcomes. Moreover, depending on the dataset, we can try to include more advanced Machine Learning models that can consider the broader picture including multiple variables, and result in better output. Addressing these aspects would contribute to the refinement and enrichment of the analytical approach, ultimately advancing our understanding of the intricate dynamics shaping collision severity outcomes. 72 References 1. Agyemang, W., Li, J., & Wu, C. (2019). Behavioral factors contributing to traffic crash severity. Journal of Transportation Safety & Security, 11(2), 151-165. https://doi.org/10.1080/19439962.2017.1389633 2. Alghnam, S., Towhari, J., Alkelya, M., Alsaif, A., Alrowaily, M., Alrabeeah, F., & Albabtain, I. (2019). The association between Mobile phone use and severe traffic injuries: a case-control study from Saudi Arabia. International journal of environmental research and public health, 16(15), 2706. 3. Ahsan, H. M., & Sufian, A. A. (2014). Present condition and safety issues of non-motorized vehicles in Bangladesh. Journal of Civil Engineering (IEB), 42(1), 93-101 4. Anjuman, T., Siddiqui, C. K. A., Hasanat-E-Rabbi, S., & Hoque, M. M. (2013, March). Heavy vehicle driver involvement in road safety and multiple vehicle accidents in Bangladesh. In Proceedings of the International Conference on Heavy Vehicles (pp. 257267). John Wiley & Sons, Hoboken, NJ 5. Azimian, A., Pyrialakou, V. D., Lavrenz, S., & Wen, S. (2021). Exploring the effects of area-level factors on traffic crash frequency by severity using multivariate space-time models. Analytic Methods in Accident Research, 31, 100163. 6. Aziz, H. A., Ukkusuri, S. V., & Hasan, S. (2013). Exploring the determinants of pedestrian–vehicle crash severity in New York City. Accident Analysis & Prevention, 50, 1298-1309. 73 7. Bahrololoom, S., Young, W., & Logan, D. (2017, November). A random parameter model of factors influencing bicycle fatal and serious injury crashes in Victoria, Australia. In 39th Australasian Transport Research Forum (ATRF), Auckland, New Zealand. 8. BREIMAN, L., 1999, Random forests—random features. Technical Report 567, StatisticsDepartment, University of California, Berkeley, ftp://ftp.stat.berkeley.edu/pub/users/breiman 9. Canadian Institute of Actuaries. (2019). Motor Vehicle Collisions in Canada: Cost and Frequency Analysis. Retrieved from https://www.cia-ica.ca/docs/default- source/2019/219099e.pdf 10. Chen, P., & Shen, Q. (2019). Identifying high-risk built environments for severe bicycling injuries. Journal of safety Research, 68, 1-7. 11. Ehsani, J. P., Bingham, C. R., Ionides, E., & Childers, D. (2014). The impact of Michigan's text messaging restriction on motor vehicle crashes. Journal of Adolescent Health, 54(5), S68-S74. 12. Eluru, Naveen, Chandra R. Bhat, and David A. Hensher. "A mixed generalized ordered response model for examining pedestrian and bicyclist injury severity level in traffic crashes." Accident Analysis & Prevention 40, no. 3 (2008): 1033-1054. 13. Hammad, H. M., Ashraf, M., Abbas, F., Bakhat, H. F., Qaisrani, S. A., Mubeen, M., ... & Awais, M. (2019). Environmental factors affecting the frequency of road traffic accidents: a case study of sub-urban area of Pakistan. Environmental Science and Pollution Research, 26, 11674-11685. 74 14. Gamage, P., Karim, M. S., & Dozza, M. (2021). Understanding the influence of behavioral aspects in traffic collisions severities and its impact on economy. Transportation Research Interdisciplinary Perspectives, 10, 100367. https://doi.org/10.1016/j.trip.2021.100367 15. Hamer, M., Grzebieta, R., Williamson, A., & Olivier, J. (2021). Investigating the relationship between road design and serious injury crashes in Canada. Accident Analysis & Prevention, 153, 105975. 16. Jacobsen, M. R. (2013). Fuel economy and safety: The influences of vehicle class and driver behavior. American Economic Journal: Applied Economics, 5(3), 1-26. 17. Kang, B. (2019). Identifying street design elements associated with vehicle-to-pedestrian collision reduction at intersections in New York City. Accident Analysis & Prevention, 122, 308-317. 18. Lestina, D. C., Williams, A. F., Lund, A. K., Zador, P., & Kuhlmann, T. P. (1991). Motor vehicle crash injury patterns and the Virginia seat belt law. Jama, 265(11), 1409-1413. 19. Musa, M. F., Hassan, S. A., & Mashros, N. (2020). The impact of roadway conditions towards accident severity on federal roads in Malaysia. PLoS one, 15(7), e0235564. 20. Nguyen, H., Ivers, R. Q., Jan, S., Martiniuk, A. L., Li, Q., & Pham, C. (2013). The economic burden of road traffic injuries: evidence from a provincial general hospital in Vietnam. Injury prevention, 19(2), 79-84. 21. Noland, R. B., & Quddus, M. A. (2004). A spatially disaggregate analysis of road casualties in England. Accident Analysis & Prevention, 36(6), 973-984. https://doi.org/10.1016/j.aap.2004.03.002 75 22. Osman, M., Paleti, R., Mishra, S., & Golias, M. M. (2016). Analysis of injury severity of large truck crashes in work zones. Accident Analysis & Prevention, 97, 261-273. 23. Paleti, R., Eluru, N., & Bhat, C. R. (2010). Examining the influence of aggressive driving behavior on driver injury severity in traffic crashes. Accident Analysis & Prevention, 42(6), 1839-1854. 24. Pang, T. Y., Cheung, Y. K., & Wong, S. C. (2019). Impact of cell phone use while driving on traffic safety in Canada: A review. Journal of Safety Research, 70, 129-136. 25. Pasha, M., Rifaat, S. M., Tay, R., & De Barros, A. (2016). Effects of street pattern, traffic, road infrastructure, socioeconomic and demographic characteristics on public transit ridership. KSCE Journal of Civil Engineering, 20(3), 1017. 26. Quddus, M. (2015). Effects of geodemographic profiles of drivers on their injury severity from traffic crashes using multilevel mixed-effects ordered logit model. Transportation research record, 2514(1), 149-157. 27. Rahimi, E., Shamshiripour, A., Samimi, A., & Mohammadian, A. K. (2020). Investigating the injury severity of single-vehicle truck crashes in a developing country. Accident Analysis & Prevention, 137, 105444. 28. Robartes, E., & Chen, T. D. (2017). The effect of crash characteristics on cyclist injuries: An analysis of Virginia automobile-bicycle crash data. Accident Analysis & Prevention, 104, 165-173 29. Rogeberg, O., & Elvik, R. (2016). The effects of cannabis intoxication on motor vehicle collision revisited and revised. Addiction, 111(8), 1348-1359. 76 30. Rovšek, V., Batista, M., & Bogunović, B. (2017). Identifying the key risk factors of traffic accident injury severity on Slovenian roads using a non-parametric classification tree. Transport, 32(3), 272-281. 31. Safaei, B., Safaei, N., Masoud, A., & Seyedekrami, S. (2021). Weighing criteria and prioritizing strategies to reduce motorcycle-related injuries using combination of fuzzy TOPSIS and AHP methods. Advances in transportation studies, 54. 32. Sapkota, D., Bista, B., & Adhikari, S. R. (2021). Economic costs associated with motorbike accidents in Kathmandu, Nepal. Journal of Health and Allied Sciences, 11(1), 1-5. 33. Savolainen, P. T., Mannering, F. L., Lord, D., & Quddus, M. A. (2011). The statistical analysis of highway crash-injury severities: A review and assessment of methodological alternatives. Accident Analysis & Prevention, 43(5), 1666-1676. 34. Shrestha, R., Callaghan, J. P., & Taylor, E. (2017). The economic burden of alcohol-related collisions in Canada. Traffic Injury Prevention, 18(7), 724-730. 35. Tang, J., Gao, F., Liu, F., Han, C., & Lee, J. (2020). Spatial heterogeneity analysis of macro-level crashes using geographically weighted Poisson quantile regression. Accident Analysis & Prevention, 148, 105833. 36. Tlaiss, H. A., & Baaj, M. H. (2020). Economic cost of road crashes: A systematic review. Traffic Injury Prevention, 21(1), 6-12. https://doi.org/10.1080/15389588.2019.1673778 37. Toran Pour, A., Moridpour, S., Tay, R., & Rajabifard, A. (2017). Neighborhood influences on vehicle-pedestrian crash severity. Journal of urban health, 94, 855-868. 77 38. Transport Canada. (2018). Road safety in Canada 2018. Retrieved from https://www.tc.gc.ca/eng/motorvehiclesafety/tp-tp15145-1201.htm 39. Transport Canada. (2021). Canadian Motor Vehicle Traffic Collision Statistics: 2019. Retrieved from https://www.tc.gc.ca/eng/motorvehiclesafety/canadian-motor-vehicle- traffic-collision-statistics-2019.html 40. WHO. (2021). Road traffic injuries. Retrieved from https://www.who.int/news-room/factsheets/detail/road-traffic-injuries 41. Wali, B., Khattak, A. J., & Xu, J. (2018). Contributory fault and level of personal injury to drivers involved in head-on collisions: Application of copula-based bivariate ordinal models. Accident Analysis & Prevention, 110, 101-114. 42. Wang, C., Quddus, M. A., & Ison, S. G. (2013). The effect of traffic and road characteristics on road safety: A review and future research direction. Safety science, 57, 264-275. 43. Zahabi, S.A.H., Strauss, J., Manaugh, K. and Miranda-Moreno, L.F., 2011. Estimating potential effect of speed limits, built environment, and other factors on severity of pedestrian and cyclist injuries in crashes. Transportation research record, 2247(1), pp.8190. 44. Zhang, G., Yau, K. K., & Zhang, X. (2014). Analyzing fault and severity in pedestrian– motor vehicle accidents in China. Accident Analysis & Prevention, 73, 141-150. 45. Zhang, G., Yau, K. K., & Chen, G. (2013). Risk factors associated with traffic violations and accident severity in China. Accident Analysis & Prevention, 59, 18-25. 78 46. Zhou, H., Yuan, C., Dong, N., Wong, S. C., & Xu, P. (2020). Severity of passenger injuries on public buses: A comparative analysis of collision injuries and non-collision injuries. Journal of Safety Research, 74, 55-69. 47. Zou, Y., Zhang, Y., & Cheng, K. (2021). Exploring the impact of climate and extreme weather on fatal traffic accidents. Sustainability, 13(1), 390. 79