2024Birchler-etalVR-Study-SDC

.pdf

School

Tesst College of Technology, Baltimore MD**We aren't endorsed by this school

Course

LANG 231

Subject

Industrial Engineering

Date

Dec 20, 2024

Pages

Uploaded by ChefTreeNarwhal44

How Does Simulation-based Testing for Self-driving CarsMatch Human Perception?CHRISTIAN BIRCHLER,Zurich University of Applied Sciences & University of Bern, SwitzerlandTANZIL KOMBARABETTU MOHAMMED,University of Zurich, SwitzerlandPOOJA RANI,University of Zurich, SwitzerlandTEODORA NECHITA,Zurich University of Applied Sciences, SwitzerlandTIMO KEHRER,University of Bern, SwitzerlandSEBASTIANO PANICHELLA,Zurich University of Applied Sciences, SwitzerlandSoftware metrics such as coverage or mutation scores have been investigated for the automated qualityassessment of test suites. While traditional tools rely on software metrics, the field of self-driving cars (SDCs)has primarily focused on simulation-based test case generation using quality metrics such as the out-of-bound(OOB) parameter to determine if a test case fails or passes. However, it remains unclear to what extent thisquality metric aligns with the human perception of the safety and realism of SDCs. To address this (reality)gap, we conducted an empirical study involving 50 participants to investigate the factors that determine howhumans perceive SDC test cases as safe, unsafe, realistic, or unrealistic. To this aim, we developed a frameworkleveraging virtual reality (VR) technologies, calledSDC-Alabaster, to immerse the study participants intothe virtual environment of SDC simulators. Our findings indicate that the human assessment of safety andrealism of failing/passing test cases can vary based on different factors, such as the test’s complexity and thepossibility of interacting with the SDC. Especially for the assessment of realism, the participants’ age leads toa different perception. This study highlights the need for more research on simulation testing quality metricsand the importance of human perception in evaluating SDCs.CCS Concepts:•Software and its engineering→Empirical software validation.Additional Key Words and Phrases: Software Testing, Self-driving Cars, Simulation, VR, Human PerceptionACM Reference Format:Christian Birchler, Tanzil Kombarabettu Mohammed, Pooja Rani, Teodora Nechita, Timo Kehrer, and SebastianoPanichella. 2024. How Does Simulation-based Testing for Self-driving Cars Match Human Perception?. 1, 1(February 2024), 22 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn1INTRODUCTIONIn recent years, the development of autonomous systems has impacted our society in many aspectsof our life [14,19]. For instance, humans no longer rely on vacuuming their houses or mowingtheir grasses; nowadays, we have robots that can do (or will do) our chores [8]. However, specificsafety-critical instances of autonomous systems such as unmanned aerial vehicles (UAVs) andAuthors’ addresses: Christian Birchler, Zurich University of Applied Sciences & University of Bern, , Switzerland, christian.birchler@unibe.ch; Tanzil Kombarabettu Mohammed, University of Zurich, , Switzerland, tanzil.kombarabettumohammed@uzh.ch; Pooja Rani, University of Zurich, , Switzerland, rani@ifi.uzh.ch; Teodora Nechita, Zurich University of AppliedSciences, , Switzerland, teodora.nechita@zhaw.ch; Timo Kehrer, University of Bern, , Switzerland, timo.kehrer@unibe.ch;Sebastiano Panichella, Zurich University of Applied Sciences, , Switzerland, sebastiano.panichella@zhaw.ch.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from permissions@acm.org.©2024 Association for Computing Machinery.XXXX-XXXX/2024/2-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn, Vol. 1, No. 1, Article . Publication date: February 2024.

2Birchler et al.(a) Failing Test: SDC driving off-lane (unsafe).(b) Passing Test: SDC driving in-lane (safe).Fig. 1. Examples of simulation-based tests of an SDC.self-driving cars (SDCs) [35,36,60,62,64,74] can experience failures that can harm humans or theenvironment [28].Testing safety-critical autonomous systems is crucial to avoid harmful incidents in real environ-ments [2,10,22,37,71]. To that end, simulation environments have been widely adopted to testcyber-physical systems (CPS) in general [37,38,49], and SDCs in particular [9,21]. Simulation-basedtesting is easier to replicate and is more cost-efficient than field testing [30]. Figure 1 illustratestwo test cases where an SDC model is deployed in a virtual environment, and the simulated carbehaves according to the control algorithms. A test case is said to pass if the car’s behavior can beconsidered safe, while unsafe behavior constitutes a failing test case. Figure 1a shows an unsafebehavior (failing test) as the SDC drives off the lane, while Figure 1b shows a passing test.Current research on simulation-based test case generation of SDCs relies on an oracle thatdetermines if a system under test is safe or unsafe based on safety metrics [10,24,50], particularlythe out-of-bound (OOB) metric. The metric is largely adopted for assessing the safety behavior ofSDCs [24, 48, 50] and is referred to as a metric related to the lateral position of the SDC [35].Both test cases illustrated in Figure 1 are classified using the OOB metric [11] and align withthe human perception of safety. However, it is yet unclear whether STSG metrics (e.g., OOB) serveas meaningful oracles for assessing the safety behavior of SDCs. For instance, the test cases inFigure 2 are marked pass according to the OOB metric, as the SDC is keeping the lane. On thecontrary, from a human standpoint, we can consider the behavior of the SDC hardly as safe. Inthe first test case using the BeamNG.tech simulator [26] (see Figure 2a), the SDC approaches soliddelineators after ignoring a speed bump. Despite maintaining the lane at a speed of 50 km/h, thereis a high risk of an accident when classifying this test case as a pass based on the OOB metric. Inthe second test case using the CARLA simulator [21], shown in Figure 2b, the SDC ignores the redsignal. Since the car stays in the lane, it meets the OOB metric, leading to a false passing test case.Inspecting the OOB metric reveals that it is measured at a single point in time in simulation,which is insufficient to identify unsafe behaviors. For instance, Figure 2a shows the speed bumps onthe right lane, and evaluating the SDC at a single point is insufficient to assess its safety over thesespeed bumps. Unlike real-world speed bumps, which are smooth and rounded, the test bumps havesharp edges that damage the SDC even at reasonable speeds (from a human viewpoint). Similarly,Figure 2b shows another instance where we observe the red light signal, but the SDC ignores it. It isunclear whether the red signal was already there before the SDC drove past it or the signal turnedred just after the SDC analyzed the simulation scene. We hypothesize that current simulation-based testing of SDCs does not always align with the human perception of safety [24,48,50] andrealism [4,47,54,70], which are relevant aspects for an effective assessment. Hence, our primarygoal is to understand and characterize this mismatch by answering the following question:When and why do safety metrics of simulation-based test cases of SDCs match humanperception?, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?3(a) SDC in BeamNG.tech driving with 50km/h close to obstacles(b) SDC in CARLA crossing a red signal with-out stoppingFig. 2. Examples of unsafe tests with valid OOB criteriaTo address the problem ofsafetyandrealismof test cases described in our motivating examples,we conducted an empirical study involving 50 participants using our framework namedSDC-Alabaster. The framework employs virtual reality (VR) technologies [59] (i) to immerse humansin virtual SDCs so that they can sense and experience the virtual environment similarly to the realworld, and (ii) to enable SDC developers and researchers to analyze the human perception ofsafetyandrealismof SDC test cases. The participants in our study are asked to assess the level ofsafetyandrealismof multiple, diverse simulation-based test cases. Moreover, we provide the participantsthe possibility to experience simulation-based test cases in which they have the possibility toinfluence the behavior of (i.e., interact with) the SDC. We experimented with two representativeSDC simulators as virtual environments, BeamNG.tech and CARLA, which are widely used inacademia and industry.The paper contributes and complements previous research as follows:•We propose a methodology implemented in theSDC-Alabasterframework, a VR-basedtechnological approach, to examine how quality metrics align with human perception of safetyand realism in simulation-based testing. This is to address theReality Gap[4, 36, 47, 54, 70]problem, a significant concern in simulation-based testing (Section 7);•We propose the first empirical study that investigates the perception ofrealismandsafetyin SDC test cases of 50 participants using VR technology. We publicly share a replicationpackage with the code to reproduce our results (Section 9);•We share a first taxonomy of factors influencing the perceived realism of SDC simulatorsand discuss the confounding factors and implications of our work.Our results show the impact of using VR in assessing SDCs, highlighting the dynamic natureof safety perception in test cases:“Safety perception of SDC test cases is not static."Such resultsemphasize the importance of human interaction with the vehicle when evaluating SDCs using VR.The paper covers background (Section 2), study design (Section 3), our framework, experiments,and methodology. Section 4 presents our results, followed by discussions in Section 5 and threatsto validity in Section 6. We discuss related work and conclusions in Section 7 and Section 8.2BACKGROUNDThis section provides a background on existing technologies used in our study. Specifically, webriefly overview the simulators, test generators, and a test runner for SDCs, as well as the VRtechnology.2.1SDC simulatorsTo investigate when the safety metrics for SDCs match the human perception, we use two state-of-the-art SDC simulators, namely BeamNG.tech and CARLA. The selection of these simulators is basedon two criteria: (i) BeamNG.tech is mainly used in academia and gets more attention in industrialcontexts [10,24,46,50], and (ii) CARLA is well-known in industry and academia [21,29,75]. We, Vol. 1, No. 1, Article . Publication date: February 2024.

4Birchler et al.did not use Udacity, Apollo, SVL, and DeepDrive simulators since their active development hasbeen stopped or they have too long release cycles.2.1.1BeamNG.tech.BeamNG.tech is a well-known reference simulator used in recent yearsin several software engineering studies and SDC testing competitions [10,11,24,27,50]. TheBeamNG.tech simulator comes along with a soft-body physics engine that allows deformationsand more realistic crashes and impacting forces on objects.2.1.2CARLA.Another widely used simulator in academia and practice is CARLA [21,29,75,77].The differences between CARLA and BeamNG.tech are twofold. On the one hand, CARLA comeswith a rigid-body physics engine that differently than a soft-body physics engine (of BeamNG.tech)does not deform objects; e.g., when a crash happens, the objects remain rigid.2.2Test generators & Test RunnerWe use existing test generators to generate test cases automatically for both simulators. We usetest generators from the tool competition of theSBFTworkshop [24,50] where the actual road inthe simulation environments is the result of interpolating the road points that are generated by thetest generator. To run test cases in simulation environments, we need a test runner that executesthe test cases and reports the test outcomes. For this we use theSDC-Scissor[10] tool since it hasimplemented a test runner that monitors the OOB metrics, which is suitable for our study.2.3Virtual realityThe notion of VR refers to the immersive experience of users being inside a virtual world. In ourstudy, we want to provide the study participant with an immersive experience of the test cases tohave more accurate feedback on their perception of the safety and realism of SDC. We leverage VRheadsets and tooling for the simulation environments to achieve this goal.2.3.1Headset & VR connection with simulation environments.We use the HTC Vive Pro 2 headsetto provide the study participants with a 360°VR experience. The headset connects via wire toan external device with a dedicated GPU for high-resolution VR rendering. Most SDC simulatorsdo not support VR out of the box. This is also the case for BeamNG.tech and CARLA. Therefore,for our study, we use third-party tools to enable the missing VR support for both simulators. ForBeamNG.tech, we usevorpX, a specialized tool to transform any visual output to the screen to acompatible input for VR headsets so that it provides an immersive feeling for the user. ThevorpXsoftware gives a broader view angle when wearing a VR headset. The user can move the headand explore the virtual environment according to the head movement. In the case of the CARLAsimulator, Silvera et al. [59] implemented an extension of CARLA, allowing the simulator to becompatible with the HTC Vive Pro 2 VR headset. When launching the CARLA application, passingthe-VRflag puts the simulator into VR mode so that can be used with the headset.3METHODOLOGYUsingSDC-Alabaster(Section 3.3.2), we conducted an empirical study involving 50 participants(recruiting explained in Section 3.4), with several steps (summarized by Figure 4) devised tocollect different types of evidence and data to answer our main question:When and why do safetymetrics of simulation-based test cases of self-driving cars match human perception?The usage ofSDC-Alabasterimmerses the study participants in virtual SDCs (Figure 3) by leveraging VRtechnologies (Section 3.3)., Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?5(a) Participant with a VR headset(b) Outside view(c) Driver viewFig. 3. One of our participants immersed in virtual SDCs withSDC-Alabaster3.1Research questionsWe structured our study around three main research questions (RQs), in which the participantswere asked to assess the level of safety and realism of multiple simulation-based test cases for SDCs.3.1.1RQ1: Human-based assessment of safety.Our first research question focuses on the perceptionof safety compared to the OOB metric:RQ1:To what extent does the OOB safety metric for simulation-based test cases of SDCsalign with human safety assessment?RQ1explores participants’ perceptions of safety levels for SDC test failures with/without VRtechnology. We hypothesize that the OOB safety metric may not align with human safety perception.We evaluate alignment through Likert-scale responses from participants, correlating it with testcase outcomes (Section 4.1). Statistical tests on experimental and survey data are used to investigatethe impact of simulators (BeamNG.tech vs. CARLA), driving views (outside and driver’s view), andtest case complexity (with/without obstacles/vehicles) on SDC safety perception.3.1.2RQ2: Impact of human interaction on the assessments of SDCs.Once we know how humansperceive the safety of SDC test cases and how this is related to the OOB metric (RQ1), we investigatewhether human-based interactions with the virtual SDC affect the safety perception of the testcase. We argue that the safety perception of a SDC can vary when having the ability to interact,i.e., the possibility to accelerate/deaccelerate the vehicle manually, and previous VR research hasshown that interactions can influence the environment positively or negatively [32,33,45,52].This aspects deserves investigation since it can help developers and researchers in designing bettertest cases and evaluation metrics, which lead us to our second research question:RQ2:To what extent does the safety assessment of simulation-based SDC test cases varywhen humans can interact with the SDC?3.1.3RQ3: Human-based assessment of Realism.We argue that the level of realism of SDC simulation-based test cases is another important factor influencing the safety perception of SDCs. The notionof realism relates to theReality Gap(discussed in Section 7), a critical concern regarding the oracleproblem in simulation-based testing:“due to the different properties of simulated and real contexts,the former may not be a faithful mirroring of the latter". While recent studies provide solutions foraddressing this problem, e.g., by leveraging domain randomization techniques or using data fromreal-world observations [16,36,39,76], there is no prior study that studied/characterized the percep-tion of realism of SDC test cases from human participants when using VR technologies [32,45,52].Hence, we complement RQ1and RQ2, by addressing a third research question:RQ3:What are the main reality-gap characteristics perceived by humans in SDC testcases?, Vol. 1, No. 1, Article . Publication date: February 2024.

6Birchler et al.1Introductionto the study2BeamNG.techno VRWarm-up: No questionsNo obstacles: Q1 and Q2With obstacles: Q1 and Q23BeamNG.techwith VRoutside view4BeamNG.techwith VRdriver view5BeamNG.techgeneralfeedback: Q3,Q4, Q5, Q6,Q7, and Q86CARLAno VRWarm-up: No questionsNo obstacles: Q1 and Q2With obstacles: Q1 and Q27CARLAwith VRoutside view8CARLAwith VRdriver view9InteractiveCARLAwith VRdriver view10Final interac-tive CARLAwith VRdriver view11CARLAgeneralfeedback:Q3, Q4, Q5,Q6, Q7, Q8,and Q912Generalfeedbackon theexperiments:Q10 and Q11Fig. 4. Design overview with survey question IDs from Table 1Hence, after the experiments for RQ1and RQ2, we ask the study participants to evaluate the levelof realism for BeamNG.tech and CARLA. Then, we develop a taxonomy of aspects influencingthese environments’ realism to help improve simulation environments for effective testing of SDCsso that different properties of simulated and real contexts are minimized.3.2Design overviewFigure 4 overviews the design of our study involving 12 steps: In step1, we welcome and introducethe study participant by explaining the context and the procedure for the experiments. The partici-pant in step2sits before a computer screen and experiences three simulation-based test cases withthe BeamNG.tech simulator. While sitting before a computer, the participant wears a VR headsetfor the next steps. In step3, the participant experiences three test cases with the BeamNG.techsimulator observing the SDC from anoutside viewperspective while in step4, the participantexperiences three test cases with the BeamNG.tech simulator from adriver viewperspective. Thestep5focuses on general feedback on the experiments with the BeamNG.tech simulator. Then,the steps2,3,4are repeated for the CARLA simulator in6,7,8. In step9with CARLA, theparticipant, while wearing a VR headset from a driver’s view, experiences three test cases in whichthey can control the SDC speed with a keyboard. In addition to step9, one group of participantsin step10will experience a crash with the SDC. The step11focuses on general feedback on theexperiments involving CARLA while the step12focuses on general feedback on the study.For the steps2-4, and6-9, the participant experiences three test cases. The first test case isthe warm-up so that the participant can familiarize with the simulation environment. The secondtest case has no obstacles, and the third test case has obstacles (i.e., has higher complexity). At step10, the participant only experiences the complex test case with obstacles.3.3Design implementationWe implement our design by conducting experiments with our test runner calledSDC-Alabaster.The test runner uses three distinct test cases created by a test generator (see Section 2.2). Theparticipants give responses to our survey questionnaires usingGoogle Forms.3.3.1Test cases.We use three test cases generated by theFrenetictest generator, the top-rankedstate-of-the-art tool in the SBST tool competition [15]. The first test case is the warm-up that letsthe participant familiarize with the simulation environment and view setting, e.g., the VR headsetand the simulator. Hence, no survey question for this first warm-up test case is provided. Thesecond test case does not have obstacles, while the third involves obstacles (higher complexity).3.3.2SDC-Alabaster.We extend the existing test runnerSDC-Scissor(see Section 2.2) by imple-mentingSDC-Alabaster(SDC humAn-in-theLoop simulAtion-BASedTesting sElf-driving caRs).SDC-Alabasterimplements an interface to run test cases with the CARLA simulator in steps6-, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?7Table 1. Survey questions with Likert-scale (LS), Open answer (OA), and Single-choice (SC) typesIDQuestionTypeQ1What is the perceived safety of the Scenario?LSQ2Justify the perceived safety of the Scenario.OAQ3How would you scale the realism of scenarios generated by test cases in the simulator?LSQ4Justify the level of realism of scenarios generated by test cases.OAQ5How would you scale the driving of AI of the simulator?LSQ6Justify the driving of AI from the simulator.OAQ7How would you scale overall experience with the simulator?LSQ8Justify overall experience with the simulator.OAQ9How do you compare safety with and without interaction?OAQ10Did this experiment change the way you thought about the safety of self-driving cars?SCQ11Please write in a few words on your experience and suggestions.OA10. As for BeamNG.tech, we add obstacles to the test cases in CARLA to achieve similar complexitylevels. Additionally, for steps9-10, the participants controlled the SDC speed with the keyboard.Test cases generated are processed differently between BeamNG.tech and CARLA since CARLA.Automatically generated test cases in BeamNG.tech (Section 2) consist of a sequence of XY-coordinates (i.e., the road points). The CARLA simulator, however, does not have all the roadpoints defined in the test.SDC-Alabastersegments road definitions using start and end points ofsegments for test cases in CARLA. It facilitates user immersion and safety evaluation by adjustingtest cases for CARLA and utilizing VR headsets for proviging an immersive experiences.3.3.3Survey questionnaires.We employGoogle Formsas a survey tool for our questionnaires.Table 1 summarizes participant questions, having multiple choice (MC), open answer (OA), andLikert scale (LS) questions (with values from 1-5, where 1 for very unsafe, 5 for very safe, and 3for neutral) to address our research questions (RQs). Participants answered Q1 and Q2 after thesecond and third test cases, respectively, with the first test case serving as a warm-up without safetyassessment. To limit biases, the participants took breaks between sessions. For Q3-Q8, participantsprovide responses after all three simulator test executions, i.e., at step5for BeamNG.tech andstep11for CARLA. Note that at step11, we include an additional question, Q9, for experimentsinvolving CARLA, which includes interactive scenarios requiring keyboard inputs to control theSDC’s speed.3.3.4Experimental setting.We conduct experiments in a dedicated, soundproof room to eliminateexternal distractions. Participants sit at a table equipped with a desktop computer, laptop, and aVR headset. They use the laptop, running theGoogle Formsapplication, to complete the surveyquestionnaires and the desktop computer for non-VR experiments. For VR experiments, participantsuse the HTC Vive Pro 2 headset powered by anVidia GeForce RTX 3080andWindows 10operatingsystem. Additional extensions are employed to allow a full VR experience to participants, suchasvorpXfor BeamNG.tech’s VR support and theDReyeVRextension for CARLA, are used. Wealso integrateSDC-Alabasterto facilitate testing with both BeamNG.tech and CARLA simulators.Participants can interact with specific SDC test cases, adjusting the speed using the keyboard. Theduration of the experiments varies between 70 and 90 minutes.3.4Study participantsWe recruit participants via email invitations sent to industrial partners, university students, andresearchers across departments. We target various mailing lists, including non-computer science or-ganizations, and leverage social media platforms (e.g., Twitter and LinkedIn). We use physical/digitalflyers to attract diverse participants, ensuring a broad range of backgrounds and education levels., Vol. 1, No. 1, Article . Publication date: February 2024.

8Birchler et al.3.4.1Pre-survey.When participants sign up for our experiments, we email them a pre-surveycreated withGoogle Formsto collect demographic information. This survey includes an intro-duction to the topic, an overview of the experiment (including expected time and location), anda recommendation to wear contact lenses; it also provides details about the simulator and VRheadset used. Furthermore, the pre-survey includes a disclaimer regarding confidentiality andanonymity and a warning about potential VR-related accidents or fatalities that the participantscould experience. Following this section, we gather background information on participants, asdetailed in the Appendix (appx.) of our replication package (Section 9). These questions covertesting and driving experience, VR technology usage, age, and gender. This additional informationhelps us investigate potential confounding factors affecting safety and realism perception.3.5Data collectionWe gather data from two primary sources: the survey (both pre-experiment and during the experi-ments) and the simulation logs collected during participant experiments.3.5.1Survey data.For both BeamNG.tech and CARLA simulators, participants evaluate test casesconsidering the questions reported in Table 1. Specifically, for steps2-4and6-9, Likert-scaleand text data are collected for each test case except the warm-up case. For step10, only Likert-scaleand text data are collected for test cases with obstacles. Additionally, at steps5and11, generalfeedback on the simulators is collected after the test executions with all viewpoints. Complementary,participants rate the perceived safety and realism of each simulator using Likert-scale values basedon their own driving experiences. Finally, general feedback on the experiments is collected at step12. In total, we collected 21 Likert-scale, 23 open, and 1 single-choice response per participantduring the experiments. In addition to the experimental survey, we gather data from the pre-survey(Section 3.4.1) to obtain participant demographics.3.5.2Simulation data.For each test case in each participant’s experiment, we collect relevantdata, saving logs (see Section 9) in JSON files ofSDC-Alabaster. These logs include timestampedvehicle position coordinates, sensor data (e.g., fuel, gear, wheel speed), and OOB metric violations(i.e., driving off the lane), categorizing the test as pass or fail based on this metric. Additionally, onCARLA, the log structure includes also weather condition details. It is important to note that toenhance our findings further, we also analyze participants’ quantitative and qualitative insightsboth with and without VR headsets, as well as when experiencing different driving views.3.6Data analysis3.6.1RQ1& RQ2: Perceived level of safety.We utilize various visualizations, including stackedbarplots and boxplots, to assess safety and realism perceptions. We apply statistical tests: Wilcoxonrank-sum, and Vargha-Delaney to determine the effective size. For RQ1, we mainly analyze responsesfrom the test cases where the participant has no interaction with the SDC; for RQ2, we analyze thedata where the participant has some direct interactions with the SDC by a keyboard to control thevehicle’s speed. In RQ2, we explore how SDC interactions affect the safety and realism perceptionsof participants. For this, we analyze Likert-scale scores and qualitative feedback. We employ stackedbar plots to examine data spread across the two categories in steps8and9.3.6.2RQ3: Taxonomy on realism.With RQ3, we examine the realism of SDC test cases and theircorrelation with human safety assessments. We identify and categorize factors affecting test caserealism in a taxonomy based on the participant responses in question Q4 at steps5and11.We adopt a two-step approach for the initial taxonomy creation. Initially, two authors analyzeresponses grouped by the simulators; one author focuses on Q4 from step5with the BeamNG.tech, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?9Passing Failing12345Test outcome (OOB)Likert-scalePerceived safety of passing and failing test casesNoYes12345ObstaclesLikert-scalePerceived safety of passing test casesNoYes12345ObstaclesLikert-scalePerceived safety of failing test casesFig. 5. Perceived safety of failing and passing tests grouped by scenario’s complexitysimulator, and the other on Q4 from step11with the CARLA simulator. Each author proposescategories via an open-card sorting method [61]. In the second step, both authors collaborativelydefine a meta-taxonomy by discussing their proposed categories. Subsequently, this meta-taxonomyis employed to label all Q4 responses for BeamNG.tech and CARLA (steps5and11). To do this,the two authors designing the meta-taxonomy and a third author conduct a hybrid card sortinglabeling process using online spreadsheets. They individually assign each response to the meta-taxonomy categories or create new categories when necessary. A collaborative approach is employedfor validation, where each of the three co-authors reviews and addresses any disagreements inassignments during an online meeting.4RESULTSThis section presents the results for RQ1, focusing on participants’ safety perception of test cases,and RQ2, examining how this perception changes when participants can interact with the SDC. ForRQ3, we discuss the taxonomy obtained by classifying participants’ comments on test case realism.4.1RQ1: Human-based assessment of safety metricsTo address RQ1, we analyzed Likert scale values across various data subgroups. These subgroupsincluded comparisons between test outcomes (failures/successes based on OOB) and different testcase complexities (with/without obstacles). This allowed us to identify factors influencing perceivedsafety among participants. We present boxplots and statistical tests (appx. B.1) for each subgroup.4.1.1Safety perception of failing vs. passing test cases.Figure 5 illustrates perceived safety distri-butions for test cases grouped by test outcome (OOB metric). We found a significant difference(appx. B.1) in how participants rate safety for failing and passing test cases on a Likert scale.Finding 1:The passing test cases (i.e., the cases where the OOB metric is not violated) have ahigher perception of safety from the participants than those failing (OOB metric is violated).The aforementioned Finding 1 is somewhat expected and is aligned with comments from studyparticipants (appx. C.1). These comments pertain to the BeamNG.tech simulator, excluding VR andobstacles. We selected these comments for their exclusive focus on SDC lane-keeping, providingqualitative insights into the OOB metric without obstacle influence. Notably, among commentswhere the SDC violates the OOB metric (test case failure), safety concerns are recurrent:“As thecar did not drive all the time on the street, I felt unsafe. [...].”- (P3/B1/S1)”;“When the car starts to gooff the road when driving in a curve, it feels pretty unsafe.”- (P31/B1/S1)”;“Not Very Safe since the carsometimes drove a bit from the road.”- (P45/B1/S1).On passing test cases where the OOB metric is not violated, we can find that the participants gaveconsistent comments in terms of safety:“The car was driving in lane and at a safe speed considering, Vol. 1, No. 1, Article . Publication date: February 2024.

10Birchler et al.020406080100yesnoVRDistribution of the perceived safety with and without VRVery unsafeUnsafeNeutralSafeVery safe12345YesNoVRFig. 6. VR vs. no VRthe road is empty.” - (P16/B1/S1);“The car was following the path in a safe way and was not speedingup too much.” - (P25/B1/S1).All comments that support Finding 1 are listed in appx. C.1.4.1.2Safety perception with and without obstacles.Participants assessed test cases with varyingcomplexity, including additional obstacles. Figure 5 displays differences in perceived safety, withstatistical significance reported in appx. B.1. Concretely, failing test cases are generally seen asless safe, but those with added obstacles are perceived as even less safe. In contrast to passing testcases, perceived safety remains largely unaffected by an increasing complexity of scenarios (e.g.,additional obstacles). As shown in appx. B.1, no significant statistical differences were observed inthe samples, leading us to conclude:Finding 2:There is no statistical difference in safety perception between scenarios with andwithout obstacles when the OOB metric is not violated. However, when the car goes out ofbounds, the scenario is perceived as significantly less safe with obstacles (࠵?=3.52∗10−16).From participants, we received qualitative support for Finding 2. For those feeling unsafe withscene obstacles, here are representative answers:“The car crashed toward an obstacle, and evenrunning over bumps was not so smooth as humans would do. Definitively more unsafe than the previousscenario.”- (P1/B1/S2);“Ran off the road in a curve and hit obstacles without slowing down, whichresulted in flat tires.”- (P24/B1/S2).In participants who felt safe or neutral when obstacles were present, consistent comments werereported:“It car was running smooth with obstacles, there was a moment when it was too close toone of the obstacle” - (P16/B1/S2);“The vehicle does well to avoid obstacles while maintaining the safespeed” - (P18/B1/S2);“The driver accelerated over all the obstacles and did not have a perfect finish.”- (P40/B1/S2);“Car was driving well. Only at the end, it went off the road, but there was no object itbumped into.” - (P45/B1/S2).All comments that support Finding 2 are reported in appx. C.1.4.1.3Safety perception, with VR and without VR.To assess the impact of VR on safety perception,we categorized data intowith VRandwithout VRgroups. Appx. B.1 shows no statistically significantdifference. However, Figure 6 reveals thatwithout VRhad morevery unsafeandvery unsaferesponses.This is also evident from the smaller interquartile range inwith VR(compared to thewithout VR).Finding 3:The utilization of VR had a minor impact on safety perception. However, participantsusing VR tended to perceive scenarios as somewhat less safe, though this difference was notstatistically significant (Wilcoxon rank-sum test,࠵?=0.16).Certain participant comments support Finding 3. For instance, a neutral participant stated:“Theprespective doesnt change much with the VR” - (P22/B2/S1). Another example is a comment from aparticipant who felt very unsafe:“The same as without the VR glasses. The car was not able to keepthe middle of the lane and was driving badly compared to a human.” - (P28/B2/S1).In Figure 7, we note a decrease in test case safety perception across various viewpoints. Statisticaldifferences are evident in appx. B.1, supporting the following general finding:, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?11NoYes12345ObstaclesNo VR outside viewNoYes12345ObstaclesVR outside viewNoYes12345ObstaclesVR driver viewFig. 7. Different VR-related views grouped by scenario’s complexity020406080100yesnoInteractionDistribution of the perceived safety with and without interactionVery unsafeUnsafeNeutralSafeVery safe020406080100yesnoObstaclesInteractive scenario with and without obstaclesVery unsafeUnsafeNeutralSafeVery safe020406080100yesnoObstaclesNon-interactive scenario with and without obstaclesVery unsafeUnsafeNeutralSafeVery safeFig. 8. Safety perception with and without interaction with the SDC (grouped by complexity)Finding 4:Overall, participants found the test cases less safe with obstacles.Participants’ general comments during the experiment for each simulator qualitatively supportFinding 4. Representative comments on BeamNG.tech driving behavior include:“It did not lookat safety lines, which is very dangerous if other traffic is involved. It also ran off the road multipletimes, which can easily lead to a loss of control. Also, the car crashed into easily avoidable obstacles.” -(P24/B);“At least the AI seems to have an understanding of the general elements of the simulation, likethe road. However, it seems to struggle with bumps in the middle of the road and also seems to drivetoo fast in curvy situations.” - (P31/B).4.1.4Different views with different complexity.In the case of CARLA, we got the following repre-sentative comments on the driving behavior with regard to different complexity of the scenario:“Except at the roundabouts, the car followed traffic rules, signals, and speed limits. However, it keptcrashing and losing control in the roundabouts.” - (P27/C);“In most scenarios, the AI did well. Fromwhat I have seen during the simulations, it is not able to drive around roundabouts and does not stopat stop signs.” - (P31/C);“Very slow driving, unsmooth behavior, always too close to roundabout andabrupt stopping in front of obstacles.” - (P41/C).We observed that the perception of safety dropped when the complexity increased (i.e., addingobstacles to the scenario). This observation is coherent among both simulators, BeamNG.tech andCARLA, as reported by the participants during the experiment.4.2RQ2: Impact of human interaction on the assessments of SDCsTo assess the safety perception of test cases with human interaction with the SDC, participantscontrolled the SDC’s speed during the test execution. Figure 8 shows the Likert scale of responses.We compared the responses where participants could and could not control the car when obstacleswere present., Vol. 1, No. 1, Article . Publication date: February 2024.

12Birchler et al.4.2.1Safety perception with and without interaction with the SDC.In general, interacting withthe SDC enhances participants’ perception of safety. From appx. B.2, we observe a statisticallysignificant difference, leading to the following finding:Finding 5:Safety perception of test cases is not static: When users can interact with the SDC,participants feel significantly safer (࠵?=0.013) compared to when they cannot.The participants’ justification supports Finding5, e.g., controlling the SDC speed enhances safetyperception, as P1 reported:“The fact I could control the car when needed gave me a safer perception ofthe driving experience. Moreover, I could speed up the car when I wanted to.” - (P1). However, not allparticipants perceived interaction-based test cases as inherently safe. For instance, participant P4commented:“With a bit of control, it feels safer, especially being able to adjust the speed in dangeroussituations. However, it is still not safe since the car ends up going off-road at the end of the scenario.” -(P4). While the SDC remains self-steering, it may still crash despite having speed control capability.4.2.2Safety perception for with and without obstacles.When interactive test cases involved obsta-cles, participants perceived them as less safe than obstacle-free scenarios. A statistically significantdifference leads to the following finding:Finding 6:Incorporating obstacles into the simulation, where participants interact with theSDC, leads to significantly lower perceived safety in test cases (࠵?=0.026) compared toobstacle-free interactive scenarios.This finding is also coherent with the answers of the study participants, e.g., by P4:“It felt safer,especially since it was stopping the speed when it had another car in front. However, it still went to thefootpath, making it not safe” - (P4). From the comment, we observe safer perception through speedcontrol. P20 also states:“it could have stopped before hitting the camion” - (P20).However, as the study participant could not control the SDC’s steering, some accidents remainedunavoidable, as reported by P19:“Hit the bike driver” (P19). P40 gave a clearer comment:“Twomatters: 1) driver keeps its distance to the can in the front, but with sharp breaks instead of slowingdown the car. 2) unable to avoid strange behaviors and drove next to a car with unstable drive andhad an accident” (P40). The participant could maintain the distance by adjusting the speed, butaccidents could occur during lane changes.In non-interactive test cases, obstacles induced insecurity among participants. However, thelevel of safety they felt when obstacles were included was higher in the case where the participantscould interact with the SDC. This leads to the following finding:Finding 7:In the simulation, obstacles in non-interactive SDC test cases reduce the safetyperception (࠵?=0.013). Yet, the ability to interact with the car raises more discomfort (makingparticipants feel less safe) when obstacles are present.Besides the statistical tests, we also noted participant comments supporting Finding 7. Someexpressed discomfort in obstacle scenarios without the ability to control the car, as evident in thefollowing example:“The car was breaking and accelerating a lot while being behind the other car, andalso the other car was not behaving safely on the road, ending the simulation with an accident betweenthe two, so it felt quite unsafe overall.” (P25). Some participants also experienced the worst-casescenario without control, as reported by P28:“It drove extremely close up to the ambulance car andfinally crashed into it. therefore, the worst case happens.” (P28)., Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?13World ObjectsDynamicsRoadTraffic ElementsRule SystemImmersionOthers051015202530321691141601411514621OccurencesPositiveNegativeFig. 9. Taxonomy of positive and negative factors impacting the perceived test cases’ realismTable 2. Taxonomy description including # of positive and negative comments on the perception of realismCategoryDescriptionOccurrencesPositiveNegativeTotalWorld ObjectsThis category relates to comments of participants on the accuracy ofvisual looks and design of all elements in the virtual environment, suchas the weather, landscape, car design, traffic objects, etc., and how thegraphical resolution is perceived.321446DynamicsThis category relates to participants’ comments on the physical dy-namics of the elements in the virtual environment. For example, if themovement of the cars is physically realistic and reasonable or if crashesare realistically simulated from a physical perspective.161127RoadThis category relates to participants’ comments on the road itself; towhat extent the shape, surface, and structure are reasonably expectedin the real world.9514Traffic ElementsThis category relates to participants’ comments on the placement ofthe elements in the virtual environment. Furthermore, this categoryconsiders comments on the location and scale of the placed elementsbut also the quantity of the elements.111425Rule SystemThis category relates to participants’ comments on the traffic laws andthe common sense of humans for resolving certain issues in specifictraffic situations. A car should, for example, stop at a red signal andstop signs. Furthermore, the car should not drive recklessly and avoiddangerous situations (e.g., driving too close to other vehicles).4610ImmersionThis category relates to participants’ comments on the immersive expe-riences. It applies to comments where participants express their feelingson how they experience the virtual environment and how they acousti-cally, visually, physically, and haptically sense it.16218OthersThis category relates to participants’ comments that do not fit into theabove categories.0114.3RQ3: Taxonomy on realismRealism is a crucial aspect to consider when evaluating test casesafety. We created a taxonomyto gauge the perceived realism of study participants. Two coders used open card sorting on 50comments each to establish categories, which were later reviewed by a third coder. Table 2 presentsthe seven resulting categories with their descriptions.Next, two coders independently classified 100 comments using the designed taxonomy. Disagree-ments were resolved by a third coder. Table 2 and Figure 9 show the classification of commentsrelated to question Q4 in steps5and11. We categorized comments aspositives(increasing realism), Vol. 1, No. 1, Article . Publication date: February 2024.

14Birchler et al.andnegatives(decreasing realism) in the taxonomy. We observe that most classifications fall underWorld Objects, totaling 46, with 32 positives and 14 negatives.Finding 8:Several factors (e.g., the surroundings, car design, and object scale) impact theparticipants’ perceived realism. TheWorld Objectscategory dominates with 32positive(e.g.,car design) and 14negative(e.g., traffic objects) aspects affecting realism perception.Examples of positive comments with the BeamNG.tech simulator:“The realism is quite good,especially in the car design. The car structure was damaged after crashing; the wheels were gettingbroken, and there was smoke coming out. The inside view of the car was also pretty real, with thedriver’s hand moving the steering wheel and all the car panel commands. [...] .” - (B/P4);“They respectthe scale from the objects.” - (B/P22). Examples of positive comments for the CARLA simulator:“Thesurroundings have more detail, which made it feel more realistic.” - (C/P31);“The environment (lighting,obstacles) feels quite real.” - (C/P17). An example of a negative comment:“The grass, the horizon aswell, and the red vertical lines do not look very realistic.” - (B/P3). Besides finding in Section 8, wenoted that theImmersioncategory generally received positive comments about perceived realism.Finding 9:TheImmersioncategory primarily comprises comments on factors that affect realism(e.g., view, perspective). It includes 16positive(e.g., the realism of driver’s seat) and 2negative(e.g., low realism outside the vehicle) comments influencing participants’ perceived realism.This finding is reasonable since a driver sits in the driver’s seat, unlike the perspective in a videogame. The following quotes support this:“The driver seat simulator felt very realistic.” - (B/P14);“Itwas different when I sat in the car than from outside, so it felt more real. But still looked like a game,so not that realistic.” - (B/P21). In summary, comments onImmersionwere positive, indicating thatthe driver seat viewpoint and VR usage enhanced perceived realism.5DISCUSSIONWe first discuss safety considerations for simulation-based tests, including RQ1and interactive testcases RQ2. Then, we delve into realism by discussing the taxonomy of influencing factors.5.1RQ1& RQ2: Human-based safety assessment of simulation-based test casesThe study participants perceived passing test cases (OOB metric not violated) as safer than failingones (Finding 1), aligning with the OOB metric-based test oracle. This observation is supportedby [35], where participants’ assessment of driving quality correlates with metrics related to theSDC’s lateral position. The OOB metric generally reflects test case safety. However, the extent towhich the safety perception varies depending on certain simulation factors (e.g., obstacle inclusion)remains unclear. Hence, we conducted experiments with test cases featuring additional obstacles.In Section 2, we found that adding obstacles to a passing test case does not significantly affectsafety perception. However, participants perceive failing test cases as less safe with additionalobstacles. Therefore, human safety perception does not proportionally align with the OOB metric.The OOB metric can be violated, but it still does not distinguish the case if there are additionalobstacles in the test case, but the human does and perceives the test case unsafer.We experimented with different immersion levels (i.e., various viewpoints), and as reported inFinding 3, participants using VR headsets perceived test cases as slightly less safe. This perceptionchange is minimal when evaluating VR. Consequently, when using humans as oracles, outcomesvary based on immersion levels in virtual environments. Hence, similar human-based studies onsimulation-based test cases for SDCs [35] may exhibit a slight bias if immersion is not considered.When grouping safety perceptions of test cases by their assessed viewpoints, cases with obstacles, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?15were generally perceived as less safe than those without obstacles (Finding 4). Thus, using theOOB metric as an oracle may not always accurately represent safety perceptions from a humanperspective. This observation aligns with the example illustrated by Figure 2a and Figure 2b.As shown in Finding 5, participants perceived test cases as safer when they could control thevehicle’s speed (i.e., they express a higher trust level in the SDC behavior), which means that thesafety perception of simulation-based test cases depends on the user interaction levels. Havingcontrol over the vehicle impacts safety perception, which may not align with the OOB metric. Inthe case of test cases involving participant interaction, safety perception generally decreases whenobstacles are present, as indicated by Finding 6. This aligns with the findings for non-interactivetest cases, as highlighted in Finding 7.5.2RQ3: Taxonomy on test cases’ realismAs shown in Finding 8, most participants’ comments on Question Q4 fall under theWorld Objectscategory. As discussed in Section 1, we conjecture that assessing test case safety should alsoconsider realism. The importance ofWorld Objects, with respect to realism, confirms the fact thatpure lane-keeping (as it is the focus of OOB) is not enough for doing a realistic safety assessment.Given that most comments related to test case realism are categorized asWorld Objects, it becomesessential to prioritize when evaluating test case safety. TheImmersioncategory predominantlyfeatures comments expressing a positive or heightened sense of realism, as revealed in Finding 9.Participants’ immersion, particularly their viewpoint, influences perceived realism. Notably, thedriver seat perspective yields a higher realism perception, as evident in comments on Finding 9,consequently impacting safety perception. The importance of immersion, with respect to realism,confirms that static 2D assessment (again, as it is the focus for OOB) is not enough for doing arealistic safety assessment.When we take a closer look at the participants’ demographics and how they assess the level ofrealism, we observed that the participants in the age range between 18 and 30 years tend to assessthe test cases 17% more realistically (Likert scale) than the older participants. Another insight isthat we do not observe a different assessment of realism among the genders. Hence, there areconfounding factors that influence the perception of realism, such as the age of the participant.This aspect suggests that the reality-gap characteristics are not deterministic measures as theydepend on the human perception that might vary, as for the case of the participants’age.5.3Implications & Lessons learnedThe oracle definition for SDCs is many-fold as the safety has different aspects characterizing it.The OOB metric may not always reflect human safety perception in test cases due to variousunaccounted factors. To enhance simulation-based testing, SDC testers and practitioners shouldconsider devising alternative metrics that better align with human safety perception. Interactingwith the car boosts perceived safety, potentially due to distrust in the AI driving the SDC. Futureresearch should explore this further, ruling out other influencing factors. If low trust in AI is the mainissue, this suggests shaping the direction of autonomous driving research toward increasing thelevel of trustworthiness of SDCs, which represents an important limiting factor to SDC real-worldadoption.As motivated in Section 1, realism significantly influences the safety perception of SDCs, asreflected in participants’ comments on Q4. For this reason, we have created a taxonomy of factorsthat affect realism in simulation-based SDC testing, to guide future research in the field. Thetaxonomy provides an overview of factors impacting the realism of SDC simulation-based testing.We argue that our taxonomy is instrumental in supporting future research on the perceivedreality-gap, which is critical to bridge the gap between the simulation-based outcome of a test case and, Vol. 1, No. 1, Article . Publication date: February 2024.

16Birchler et al.what happens eventually in the real world. Furthermore, we think the taxonomy provides a basefor investigating similar limitations in other CPS application domains, which leverage simulationenvironments and target to improve the human perception of the realism and safety of CPSs.6THREATS TO VALIDITY6.1Threats to internal validityThe study participants rated safety and realism based on their immersion into the scenario. Tolimit the risks of unbiased assessments, we employed modern VR technology (HTC Vive Pro 2) toenhance immersion. The simulators, BeamNG.tech and CARLA, utilize distinct predefined maps.BeamNG.tech employs a flat map from the SBST tool competition [50], while CARLA uses built-inurban-like maps, which impose some constraints on road definition. These differing maps maylead to varying perceptions of test case safety and realism due to their distinct natures. This issomething we plan to investigate for future work.The different personal interactions with the study participants might influence the participants’focus during the experiments. To limit this risk, we used a protocol sheet during the experimentsto ensure that all steps of the experiments were equally performed to minimize this threat.6.2Threats to external validityWe recruited study participants primarily from an academic computer science background, whichmay not represent the general population. To address this potential bias, we ensured diversity interms of age, gender, and driving experience, reducing the influence of factors beyond professionalbackground. Another concern is the focus on the OOB metric, which may introduce bias as thereare various metrics for evaluating SDCs in simulation environments. We chose OOB due to itswidespread use among researchers and practitioners, as documented in recent studies [11,24,27,35,50]. Our study’s limited use of only two simulators, BeamNG.tech and CARLA, restricts thegeneralizability of our findings to these specific platforms. However, we selected them because theyare widely adopted in academia and industry, ensuring the reproducibility of our results comparedto less-maintained options such as Udacity1and SVL [55].7RELATED WORKIn this section, we elaborate on related work on testing in virtual environments and assessing thequality of oracles in the context of CPS. We group the recent and ongoing research concerningtopics that are relevant to our investigation such as (i) simulation-based testing, (ii) the testingmetrics adopted, the oracle problem, and (iii) VR in software engineering.7.1Simulation-based testingThe automated testing of cyber-physical systems (CPSs) remains an ongoing research challenge [60,74]. In this context, simulation-based testing emerges as a promising approach to enhance testingpractices for safety-critical systems [10,11,13,48,53] and to support test automation [4,5,69,70,72]. Past research on testing CPS in simulation environments focused on monitoring CPS andpredicting unsafe states [60,64] of the systems using simulation environments [64,73] as well asgenerating scenarios programmatically [51] or based on real-world observations [23,63]. Recentresearch also proposed cost-effective regression testing techniques, including test selection [10],prioritization [7,11] and minimization techniques to expose CPS faults or bugs earlier in thedevelopment and testing process. This research effort fundamentally contributed toward morerobust and reliable simulation-based testing practices. However, it remains challenging to replicate1https://github.com/udacity/self-driving-car, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?17the same bugs observed in physical tests within simulations [3,70] and generate representativesimulated test cases that uncover realistic bugs [4]. Hence, previous research in the field wasconducted on the premise that simulation environments sufficiently represent, with high fidelity,safety-critical aspects of the real world according to human judgments. In our paper, we hypothesizethat the current simulation-based testing of SDCs (and general CPSs) does not always align withthe human perception of safety and realism, which heavily impacts the effectiveness of simulation-based testing in general. To that end, in our research, we investigated when and why the safetymetrics of simulation-based test cases of SDCs match human perception.7.2Testing metrics & Oracle ProblemAutomatically inferring the expected test outcome from a given input remains an unsolved challenge,which is known as the oracle problem. Many research papers propose some techniques to addressthis problem in the context of traditional software systems, such as generating oracles [6] orimproving already existing test oracles [34,66–68]. In either case, the previous research does notshow an approach that produces fully optimal and effective oracles. However, while the oracleproblem still remains an open challenge that requires humans to define the oracle, for the sakeof test automation, several code coverage and mutation score metrics have been proposed forquantitatively assessing the quality of traditional software systems.Software engineering for CPS is increasingly explored, with recent efforts mainly focused onbug characterization [25], testing [1,20,78], and verification [17] of self-adaptive CPSs. Anotheremerging area of research is related to the automated generation of oracles for testing and localizingfaults in CPSs based on simulation technologies. For instance, Menghiet al.[43] proposed SOCRaTes,an approach to automatically generate online test oracles in Simulink able to handle CPS Simulinkmodels featuring continuous behaviors and involving uncertainties. The oracles are generatedfrom requirements specified in a signal logic-based language. In this context, for the sake of testautomation, just like traditional software testing, simulation-based testing of SDCs relies on anoracle that determines whether the observed behavior of a system under test is safe or unsafe. To thatend, current research on automated safety assessment focuses primarily on a limited set of temporaland non-temporal safety metrics for SDCs [10,24,50,65]. In particular, the out-of-bound (OOB)non-temporal metric is largely adopted for assessing SDCs in simulation-based testing [24,48,50],to determine if a test case fails or passes. However, it is yet unclear whether this metric serves as ameaningful oracle for assessing the safety behavior of SDCs in simulation-based testing in general.This study is built on our hypothesis that current simulation-based testing of SDCs does notalways align with the human perception of safety and realism, and for this reason, we focus onunderstanding and characterizing this mismatch in our research. Close to our work, a recentstudy [35] conducted a human-based study and observed that correlations between the computedquality metrics and the perceived quality by humans are meaningful for assessing the test qualityfor SDCs. However, such previous work did not investigate the factors that define the test qualityand realism of the simulation environments from a human point of view with the use of virtualreality [59] as done in our work.A critical concern concerning the oracle problem in simulation-based testing is represented bytheReality Gap[4,36,47,54,70]. Due to the different properties of simulated and real contexts,the former may not be a faithful mirroring of the latter. Simulations are necessarily simplifiedfor computational feasibility yet reflect real-world phenomena at a given level of veracity, theextent of which is the result of a trade-off between accuracy and computational time [18]. Roboticssimulations rely on the replication of phenomena that are difficult to accurately replicate, e.g.,simulating actuators (i.e., torque characteristics, gear backlash), sensors (i.e., noise, latency), andrendered images (i.e., reflections, refraction, textures). This gap between reality and simulation, Vol. 1, No. 1, Article . Publication date: February 2024.

18Birchler et al.is commonly referred to as thereality-gap[18]. A closely related problem concerns the concreterealisticbug reproductionand exposure in simulation environments [4,70]. It is indeed challengingto capture the same bugs as physical tests [3,70] and togenerate effective test casesthat canexpose real-world bugs in simulation [4]. While recent studies provide solutions for addressingthe reality gap (e.g., leveraging domain randomization techniques or using data from real-worldobservations) [16,18,39,40,56,76] in the development phase of CPS, there is no prior studythat investigated and/or characterized the perception of realism of SDC test cases from humanparticipants. This study focuses on addressing this specific open question in the context of RQ3.7.3Immersion technology in software engineeringFurthermore, using VR for software engineering was also considered by [31,42] but with anotherfocus as well. They used VR to gain design knowledge from legacy systems by using different visu-alization approaches and immersion technologies. Furthermore, most papers [41,57,58] referringto the potential use of VR and AR for the workspace of software development teams. In general,the use of VR and AR in software engineering is not well studied yet, and the only papers availableor mainly vision papers for future research [44]. However, in our work, we present a practicalapplication of VR for assessing the test oracles with a Human-in-the-Loop approach.8CONCLUSIONIn this study, we explored when and why safety metrics align with human perception in SDCtesting. We conducted an empirical study with 50 participants from diverse backgrounds, evaluatingtheir perception of test case safety and realism. We observed that the safety perception of SDCsignificantly decreases as test case complexity rises. Interestingly, safety perception improveswhen participants can control the SDC’s speed, indicating that OOB metric is not sufficient tomatch/model human (more subjective) factors. Additionally, realism perception varies with thecomplexity of scenarios (i.e., object additions) and different participant viewpoints. These findingsemphasize the need for more meaningful safety metrics that align with human perception ofsafetyandrealismto bridge the current problem of thereality-gapin simulation-based testing. In futurework plans, we aim to extend our study by varying weather and light conditions, adding moreobjects, and incorporating alternative safety metrics beyond the conventional single-objective OOBmetric used in BeamNG.tech and CARLA [65].9DATA AVAILABILITYA replication package with data, code, and appendices is publicly available on Zenodo [12].ACKNOWLEDGMENTSWe thank the Horizon 2020 (EU Commission) support for the project COSMOS, Project No. 957254.10CREDIT AUTHORSHIP CONTRIBUTION STATEMENTChristian Birchler: Conceptualization, Data Curation, Formal Analysis, Investigation, Method-ology, Project Administration, Resources, Software, Validation, Visualization, Writing – OriginalDraft Preparation.Tanzil Kombarabettu Mohammed: Conceptualization, Data Curation, FormalAnalysis, Investigation, Methodology, Project Administration, Resources, Software, Visualization.Pooja Rani: Methodology, Supervision, Writing – Review & Editing.Teodora Nechita: DataCuration, Methodology.Timo Kehrer: Methodology, Resources, Supervision, Writing – Review &Editing.Sebastiano Panichella:Conceptualization, Funding Acquisition, Methodology, ProjectAdministration, Resources, Supervision, Writing – Review & Editing., Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?19REFERENCES[1]Raja Ben Abdessalem, Shiva Nejati, Lionel C. Briand, and Thomas Stifter. 2018. Testing vision-based control systemsusing learnable evolutionary algorithms. InInternational Conference on Software Engineering. 1016–1026.https://doi.org/10.1145/3180155.3180160[2]Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C. Briand, and Thomas Stifter. 2020. Automatedrepair of feature interaction failures in automated driving systems. InInternational Symposium on Software Testing andAnalysis. ACM, 88–100. https://doi.org/10.1145/3395363.3397386[3]Afsoon Afzal, Deborah S. Katz, Claire Le Goues, and Christopher Steven Timperley. 2020. A Study on the Challengesof Using Robotics Simulators for Testing. arXiv:2004.07368 https://arxiv.org/abs/2004.07368[4]Afsoon Afzal, Deborah S. Katz, Claire Le Goues, and Christopher Steven Timperley. 2021. Simulation for RoboticsTest Automation: Developer Perspectives. InConference on Software Testing, Verification and Validation. IEEE, 263–274.https://doi.org/10.1109/ICST49551.2021.00036[5]Miguel Alcon, Hamid Tabani, Jaume Abella, and Francisco J. Cazorla. 2021. Enabling Unit Testing of Already-IntegratedAI Software Systems: The Case of Apollo for Autonomous Driving. InConference on Digital System Design. IEEE,426–433. https://doi.org/10.1109/DSD53832.2021.00071[6]Aitor Arrieta, Maialen Otaegi, Liping Han, Goiuria Sagardui, Shaukat Ali, and Maite Arratibel. 2022. Automating TestOracle Generation in DevOps for Industrial Elevators. InInternational Conference on Software Analysis, Evolution andReengineering. IEEE, 284–288. https://doi.org/10.1109/SANER53432.2022.00044[7]Aitor Arrieta, Shuai Wang, Goiuria Sagardui, and Leire Etxeberria. 2019. Search-Based test case prioritization forsimulation-Based testing of cyber-Physical system product lines.J. Syst. Softw.149 (2019), 1–34. https://doi.org/10.1016/j.jss.2018.09.055[8]BBC. 2023. Robots to do 39% of domestic chores by 2033, say experts. https://www.bbc.com/news/technology-64718842.Accessed: 2023-01-04.[9] BeamNG.tech. [n. d.]. BeamNG.research. https://documentation.beamng.com/beamng_tech/. Accessed: 2022-07-31.[10]Christian Birchler, Sajad Khatiri, Bill Bosshard, Alessio Gambi, and Sebastiano Panichella. 2023. Machine learning-based test selection for simulation-based testing of self-driving cars software.Empir. Softw. Eng.28, 3 (2023), 71.https://doi.org/10.1007/s10664-023-10286-y[11]Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella. 2023. Singleand Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments.ACM Trans. Softw. Eng.Methodol.32, 2 (2023), 28:1–28:30. https://doi.org/10.1145/3533818[12]Christian Birchler, Tanzil Kombarabettu Mohammed, Pooja Rani, Teodora Nechita, Timo Kehrer, and SebastianoPanichella. 2024. Replication Package - "How does Simulation-based Testing for Self-driving Cars match HumanPerception?".https://doi.org/10.5281/zenodo.10570960[13]Christian Birchler, Cyrill Rohrbach, Hyeongkyun Kim, Alessio Gambi, Tianhai Liu, Jens Horneber, Timo Kehrer, andSebastiano Panichella. 2023. TEASER: Simulation-Based CAN Bus Regression Testing for Self-Driving Cars Software. InInternational Conference on Automated Software Engineering. 2058–2061. https://doi.org/10.1109/ASE56229.2023.00154[14]Tim Bohne, Gurunatraj Parthasarathy, and Benjamin Kisliuk. 2023. A systematic approach to the development oflong-term autonomous robotic systems for agriculture. In43. GIL-Jahrestagung, Resiliente Agri-Food-Systeme (LNI,Vol. P-330). Gesellschaft für Informatik e.V., 285–290. https://dl.gi.de/20.500.12116/40260[15]Ezequiel Castellano, Ahmet Cetinkaya, Cédric Ho Thanh, Stefan Klikovits, Xiaoyi Zhang, and Paolo Arcaini. 2021.Frenetic at the SBST 2021 Tool Competition. InInternational Workshop on Search-Based Software Testing. IEEE, 36–37.https://doi.org/10.1109/SBST52555.2021.00016[16]Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan D. Ratliff, and Dieter Fox.2019. Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience. InInternationalConference on Robotics and Automation. IEEE, 8973–8979. https://doi.org/10.1109/ICRA.2019.8793789[17]Shafiul Azam Chowdhury, Sohil Lal Shrestha, Taylor T. Johnson, and Christoph Csallner. 2020. SLEMI: equivalencemodulo input (EMI) based mutation of CPS models for finding compiler bugs in Simulink. InInternational Conferenceon Software Engineering. 335–346. https://doi.org/10.1145/3377811.3380381[18]Jack Collins, Ross Brown, Jurgen Leitner, and David Howard. 2020. Traversing the reality gap via simulator tuning.arXiv preprint arXiv:2003.01369(2020).[19]Hugo Leonardo da Silva Araujo, Mohammad Reza Mousavi, and Mahsa Varshosaz. 2023. Testing, Validation, andVerification of Robotic and Autonomous Systems: A Systematic Review.ACM Trans. Softw. Eng. Methodol.32, 2 (2023),51:1–51:61. https://doi.org/10.1145/3542945[20]Jyotirmoy V. Deshmukh, Marko Horvat, Xiaoqing Jin, Rupak Majumdar, and Vinayak S. Prabhu. 2017.TestingCyber-Physical Systems through Bayesian Optimization.ACM Trans. Embed. Comput. Syst.16, 5s (2017), 170:1–170:18.https://doi.org/10.1145/3126521, Vol. 1, No. 1, Article . Publication date: February 2024.

20Birchler et al.[21]Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. 2017. CARLA: An OpenUrban Driving Simulator. InAnnual Conference on Robot Learning (Proceedings of Machine Learning Research, Vol. 78).PMLR, 1–16. http://proceedings.mlr.press/v78/dosovitskiy17a.html[22]Alessio Gambi, Tri Huynh, and Gordon Fraser. 2019. Automatically reconstructing car crashes from police reports fortesting self-driving cars. InInternational Conference on Software Engineering: Companion Proceedings. IEEE / ACM,290–291. https://doi.org/10.1109/ICSE-Companion.2019.00119[23]Alessio Gambi, Tri Huynh, and Gordon Fraser. 2019. Generating effective test cases for self-driving cars from policereports. InJoint Meeting on European Software Engineering Conference and Symposium on the Foundations of SoftwareEngineering. ACM, 257–267. https://doi.org/10.1145/3338906.3338942[24]Alessio Gambi, Gunel Jahangirova, Vincenzo Riccio, and Fiorella Zampetti. 2022. SBST Tool Competition 2022. InInternational Workshop on Search-Based Software Testing. IEEE, 25–32. https://doi.org/10.1145/3526072.3527538[25]Joshua Garcia, Yang Feng, Junjie Shen, Sumaya Almanee, Yuan Xia, and Qi Alfred Chen. 2020. A comprehensivestudy of autonomous vehicle bugs. InInternational Conference on Software Engineering. ACM, 385–396.https://doi.org/10.1145/3377811.3380397[26] BeamNG GmbH. 2023. BeamNG.tech.https://beamng.tech/[27] BeamNG GmbH. 2023. Publications based on BeamNG.tech.https://beamng.tech/research/[28]The Guardian. 2018. Self-driving Uber kills Arizona woman in first fatal crash involving pedestrian.https://www.theguardian.com/technology/2018/mar/19/uber-self-driving-car-kills-woman-arizona-tempe[29]Rodrigo Gutiérrez-Moreno, Rafael Barea, Elena López Guillén, Javier Araluce, and Luis Miguel Bergasa. 2022. Rein-forcement Learning-Based Autonomous Driving at Intersections in CARLA Simulator.Sensors22, 21 (2022), 8373.https://doi.org/10.3390/s22218373[30]Carl Hildebrandt and Sebastian G. Elbaum. 2021. World-in-the-Loop Simulation for Autonomous Systems Validation.InInternational Conference on Robotics and Automation. IEEE, 10912–10919. https://doi.org/10.1109/ICRA48506.2021.9561240[31]Adrian Hoff, Michael Nieke, and Christoph Seidl. 2021. Towards immersive software archaeology: regaining legacysystems’ design knowledge via interactive exploration in virtual reality. InJoint European Software EngineeringConference and Symposium on the Foundations of Software Engineering. ACM, 1455–1458.https://doi.org/10.1145/3468264.3473128[32]Jiawei Huang and Alexander Klippel. 2020.The Effects of Visual Realism on Spatial Memory and ExplorationPatterns in Virtual Reality. InSymposium on Virtual Reality Software and Technology. ACM, 18:1–18:11.https://doi.org/10.1145/3385956.3418945[33]Jiawei Huang, Melissa S. Lucash, Mark B. Simpson, Casey Helgeson, and Alexander Klippel. 2019. Visualizing NaturalEnvironments from Data in Virtual Reality: Combining Realism and Uncertainty. InConference on Virtual Reality and3D User Interfaces. IEEE, 1485–1488. https://doi.org/10.1109/VR.2019.8797996[34]Gunel Jahangirova, David Clark, Mark Harman, and Paolo Tonella. 2016. Test oracle assessment and improvement. InInternational Symposium on Software Testing and Analysis. ACM, 247–258. https://doi.org/10.1145/2931037.2931062[35]Gunel Jahangirova, Andrea Stocco, and Paolo Tonella. 2021. Quality Metrics and Oracles for Autonomous VehiclesTesting. InConference on Software Testing, Verification and Validation. IEEE, 194–204. https://doi.org/10.1109/ICST49551.2021.00030[36]Sajad Khatiri, Sebastiano Panichella, and Paolo Tonella. 2023. Simulation-based Test Case Generation for UnmannedAerial Vehicles in the Neighborhood of Real Flights. InInternational Conference on Software Testing, Verification andValidation. IEEE, 281–292. https://doi.org/10.1109/ICST57152.2023.00034[37]Sajad Khatiri, Sebastiano Panichella, and Paolo Tonella. 2024. Simulation-based Testing of Unmanned Aerial Vehicleswith Aerialist. InInternational Conference on Software Engineering (ICSE).[38]Sajad Khatiri, Prasun Saurabh, Timothy Zimmermann, Charith Munasinghe, Christian Birchler, and SebastianoPanichella. 2024. SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track. InIEEE/ACM InternationalWorkshop on Search-Based and Fuzz Testing, SBFT@ICSE 2024.[39]Sylvain Koos, Jean-Baptiste Mouret, and Stéphane Doncieux. 2013. The Transferability Approach: Crossing the RealityGap in Evolutionary Robotics.IEEE Trans. Evol. Comput.17, 1 (2013), 122–145. https://doi.org/10.1109/TEVC.2012.2185849[40]Timothy E Lee, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Oliver Kroemer, Dieter Fox, and Stan Birchfield.2020. Camera-to-robot pose estimation from a single image. InInternational Conference on Robotics and Automation.IEEE, 9426–9432.[41]Rohit Mehra, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, and Adam P. Burden. 2020. Immersive IDE:Towards Leveraging Virtual Reality for creating an Immersive Software Development Environment. InInternationalConference on Software Engineering, Workshops. ACM, 177–180. https://doi.org/10.1145/3387940.3392234, Vol. 1, No. 1, Article . Publication date: February 2024.

How Does Simulation-based Testing for Self-driving Cars Match Human Perception?21[42]Rohit Mehra, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, and Adam P. Burden. 2020. Towards ImmersiveComprehension of Software Systems Using Augmented Reality - An Empirical Evaluation. InInternational Conferenceon Automated Software Engineering. IEEE, 1267–1269. https://doi.org/10.1145/3324884.3418907[43]Claudio Menghi, Shiva Nejati, Khouloud Gaaloul, and Lionel C. Briand. 2019. Generating automated and online testoracles for Simulink models with continuous and uncertain behaviors. InJoint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineering. 27–38. https://doi.org/10.1145/3338906.3338920[44]Leonel Merino, Mircea Lungu, and Christoph Seidl. 2020. Unleashing the Potentials of Immersive Augmented Realityfor Software Engineering. InInternational Conference on Software Analysis, Evolution and Reengineering. IEEE, 517–521.https://doi.org/10.1109/SANER48275.2020.9054812[45]Elena Molina, Alejandro Ríos Jerez, and Núria Pelechano Gómez. 2020. Avatars rendering and its effect on perceivedrealism in Virtual Reality. InInternational Conference on Artificial Intelligence and Virtual Reality. IEEE, 222–225.https://doi.org/10.1109/AIVR50618.2020.00046[46]Saasha Nair, Sina Shafaei, Daniel Auge, and Alois C. Knoll. 2021. An Evaluation of "Crash Prediction Networks" (CPN)for Autonomous Driving Scenarios in CARLA Simulator. InWorkshop on Artificial Intelligence Safety (CEUR WorkshopProceedings, Vol. 2808). CEUR-WS.org. http://ceur-ws.org/Vol-2808/Paper_10.pdf[47]Anthony Ngo, Max Paul Bauer, and Michael Resch. 2021. A Multi-Layered Approach for Measuring the Simulation-to-Reality Gap of Radar Perception for Autonomous Driving. InInternational Intelligent Transportation Systems Conference.IEEE, 4008–4014. https://doi.org/10.1109/ITSC48978.2021.9564521[48]Vuong Nguyen, Stefan Huber, and Alessio Gambi. 2021. SALVO: Automated Generation of Diversified Tests forSelf-driving Cars from Existing Maps. InInternational Conference on Artificial Intelligence Testing. IEEE, 128–135.https://doi.org/10.1109/AITEST52744.2021.00033[49] Nvidia 2020. NVIDIA DRIVE Constellation.https://developer.nvidia.com/drive/drive-constellation[50]Sebastiano Panichella, Alessio Gambi, Fiorella Zampetti, and Vincenzo Riccio. 2021. SBST Tool Competition 2021. InInternational Workshop on Search-Based Software Testing. IEEE, 20–27. https://doi.org/10.1109/SBST52555.2021.00011[51]Mingyu Park, Hoon Jang, Taejoon Byun, and Yunja Choi. 2020. Property-based testing for LG home appliancesusing accelerated software-in-the-loop simulation. InInternational Conference on Software Engineering. ACM, 120–129.https://doi.org/10.1145/3377813.3381346[52]Yi-Hao Peng, Carolyn Yu, Shi-Hong Liu, Chung-Wei Wang, Paul Taele, Neng-Hao Yu, and Mike Y. Chen. 2020.WalkingVibe: Reducing Virtual Reality Sickness and Improving Realism while Walking in VR using UnobtrusiveHead-mounted Vibrotactile Feedback. InConference on Human Factors in Computing Systems. ACM, 1–12.https://doi.org/10.1145/3313831.3376847[53]Andrea Piazzoni, Jim Cherian, Mohamed Azhar, Jing Yew Yap, James Lee Wei Shung, and Roshan Vijay. 2021. ViSTA:a Framework for Virtual Scenario-based Testing of Autonomous Vehicles. InInternational Conference on ArtificialIntelligence Testing. IEEE, 143–150. https://doi.org/10.1109/AITEST52744.2021.00035[54]Fabio Reway, Abdul Hoffmann, Diogo Wachtel, Werner Huber, Alois C. Knoll, and Eduardo Parente Ribeiro. 2020. TestMethod for Measuring the Simulation-to-Reality Gap of Camera-based Object Detection Algorithms for AutonomousDriving. InIntelligent Vehicles Symposium. IEEE, 1249–1256. https://doi.org/10.1109/IV47402.2020.9304567[55]Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, Qiang Lu, Steve Lemke, Martins Mozeiko, Eric Boise, GeehoonUhm, Mark Gerow, Shalin Mehta, Eugene Agafonov, Tae Hyung Kim, Eric Sterner, Keunhae Ushiroda, Michael Reyes,Dmitry Zelenkovsky, and Seonman Kim. 2020. LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving.(2020), 1–6. https://doi.org/10.1109/ITSC45102.2020.9294422[56]Erica Salvato, Gianfranco Fenu, Eric Medvet, and Felice Andrea Pellegrino. 2021. Crossing the Reality Gap: A Surveyon Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning.IEEE Access9 (2021), 153171–153187.https://doi.org/10.1109/ACCESS.2021.3126658[57]Vibhu Saujanya Sharma, Rohit Mehra, Vikrant Kaulgud, and Sanjay Podder. 2018. An immersive future for softwareengineering: avenues and approaches. InInternational Conference on Software Engineering: New Ideas and EmergingResults. ACM, 105–108. https://doi.org/10.1145/3183399.3183414[58]Vibhu Saujanya Sharma, Rohit Mehra, Vikrant Kaulgud, and Sanjay Podder. 2019. An extended reality approach forcreating immersive software project workspaces. InInternational Workshop on Cooperative and Human Aspects ofSoftware Engineering. IEEE / ACM, 27–30. https://doi.org/10.1109/CHASE.2019.00013[59]Gustavo Silvera, Abhijat Biswas, and Henny Admoni. 2022. DReyeVR: Democratizing Virtual Reality Driving Simulationfor Behavioural & Interaction Research. InInternational Conference on Human-Robot Interaction, Daisuke Sakamoto,Astrid Weiss, Laura M. Hiatt, and Masahiro Shiomi (Eds.). IEEE / ACM, 639–643. https://doi.org/10.1109/HRI53351.2022.9889526[60]Andrea Di Sorbo, Fiorella Zampetti, Aaron Visaggio, Massimiliano Di Penta, and Sebastiano Panichella. 2023. AutomatedIdentification and Qualitative Characterization of Safety Concerns Reported in UAV Software Platforms.ACM Trans.Softw. Eng. Methodol.32, 3 (2023), 67:1–67:37. https://doi.org/10.1145/3564821, Vol. 1, No. 1, Article . Publication date: February 2024.

22Birchler et al.[61] Donna Spencer. 2009.Card sorting: Designing usable categories. Rosenfeld Media.[62]Jack Stilgoe. 2021. How can we know a self-driving car is safe?Ethics Inf. Technol.23, 4 (2021), 635–647.https://doi.org/10.1007/s10676-021-09602-1[63]Andrea Stocco, Brian Pulfer, and Paolo Tonella. 2023.Mind the Gap! A Study on the Transferability of VirtualVersus Physical-World Testing of Autonomous Driving Systems.IEEE Trans. Software Eng.49, 4 (2023), 1928–1940.https://doi.org/10.1109/TSE.2022.3202311[64]Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. 2020. Misbehaviour prediction for autonomousdriving systems. InInternational Conference on Software Engineering. ACM, 359–371. https://doi.org/10.1145/3377811.3380353[65]Shuncheng Tang, Zhenya Zhang, Yi Zhang, Jixiang Zhou, Yan Guo, Shuang Liu, Shengjian Guo, Yan-Fu Li, Lei Ma,Yinxing Xue, and Yang Liu. 2023. A Survey on Automated Driving System Testing: Landscapes and Trends.ACMTrans. Softw. Eng. Methodol.32, 5 (2023), 124:1–124:62. https://doi.org/10.1145/3579642[66]Valerio Terragni, Gunel Jahangirova, Mauro Pezzè, and Paolo Tonella. 2021.Improving assertion oracles withevolutionary computation. InGenetic and Evolutionary Computation Conference, Companion Volume. ACM, 45–46.https://doi.org/10.1145/3449726.3462722[67]Valerio Terragni, Gunel Jahangirova, Paolo Tonella, and Mauro Pezzè. 2020. Evolutionary improvement of assertionoracles. InJoint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.ACM, 1178–1189. https://doi.org/10.1145/3368089.3409758[68]Valerio Terragni, Gunel Jahangirova, Paolo Tonella, and Mauro Pezzè. 2021. GAssert: A Fully Automated Tool toImprove Assertion Oracles. InInternational Conference on Software Engineering: Companion Proceedings. IEEE, 85–88.https://doi.org/10.1109/ICSE-Companion52605.2021.00042[69]Christopher Steven Timperley, Afsoon Afzal, Deborah S Katz, Jam Marcos Hernandez, and Claire Le Goues. 2018.Crashing simulated planes is cheap: Can simulation detect robotics bugs early?. InInternational Conference on SoftwareTesting, Verification and Validation. IEEE, 331–342.[70]Dinghua Wang, Shuqing Li, Guanping Xiao, Yepang Liu, and Yulei Sui. 2021. An exploratory study of autopilotsoftware bugs in unmanned aerial vehicles. InJoint European Software Engineering Conference and Symposium on theFoundations of Software Engineering. ACM, 20–31. https://doi.org/10.1145/3468264.3468559[71]Lingfeng Wang and K.C. Tan. 2005. Software testing for safety critical applications.IEEE Instrumentation & MeasurementMagazine8, 2 (2005), 38–47. https://doi.org/10.1109/MIM.2005.1438843[72]Franz Wotawa. 2021. On the Use of Available Testing Methods for Verification & Validation of AI-based Softwareand Systems. InWorkshop on Artificial Intelligence Safety (CEUR Workshop Proceedings, Vol. 2808). CEUR-WS.org.http://ceur-ws.org/Vol-2808/Paper_29.pdf[73]Qinghua Xu, Shaukat Ali, and Tao Yue. 2021. Digital Twin-based Anomaly Detection in Cyber-physical Systems. InConference on Software Testing, Verification and Validation. IEEE, 205–216. https://doi.org/10.1109/ICST49551.2021.00031[74]Fiorella Zampetti, Ritu Kapur, Massimiliano Di Penta, and Sebastiano Panichella. 2022. An empirical characterizationof software bugs in open-source Cyber–Physical Systems.Journal of Systems and Software192 (2022), 111425.https://doi.org/10.1016/j.jss.2022.111425[75]Eleni Zapridou, Ezio Bartocci, and Panagiotis Katsaros. 2020. Runtime Verification of Autonomous Driving Systems inCARLA. InRuntime Verification - International Conference (Lecture Notes in Computer Science, Vol. 12399). Springer,172–183. https://doi.org/10.1007/978-3-030-60508-7_9[76]Fangyi Zhang, Jürgen Leitner, Zongyuan Ge, Michael Milford, and Peter Corke. 2019. Adversarial discriminative sim-to-real transfer of visuo-motor policies.Int. J. Robotics Res.38, 10-11 (2019). https://doi.org/10.1177/0278364919870227[77]Wei Zhang, Siyu Fu, Zixu Cao, Zhiyuan Jiang, Shunqing Zhang, and Shugong Xu. 2020. An SDR-in-the-Loop CarlaSimulator for C-V2X-Based Autonomous Driving. InConference on Computer Communications. IEEE, 1270–1271.https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162743[78]Husheng Zhou, Wei Li, Zelun Kong, Junfeng Guo, Yuqun Zhang, Bei Yu, Lingming Zhang, and Cong Liu. 2020.DeepBillboard: systematic physical-world testing of autonomous driving systems. InInternational Conference onSoftware Engineering. 347–358. https://doi.org/10.1145/3377811.3380422, Vol. 1, No. 1, Article . Publication date: February 2024.