The Autonomous Siemens Tram
Andrew W. Palmer, Albi Sema, Wolfram Martens, Peter Rudolph, Wolfgang Waizenegger
©©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works.The published version of this article can be found at https://doi.org/10.1109/ITSC45102.2020.9294699 a r X i v : . [ c s . R O ] F e b he Autonomous Siemens Tram Andrew W. Palmer, Albi Sema, Wolfram Martens, Peter Rudolph, and Wolfgang Waizenegger
Abstract — This paper presents the Autonomous SiemensTram that was publicly demonstrated in Potsdam, Germanyduring the InnoTrans 2018 exhibition. The system was built ona Siemens Combino tram and used a multi-modal sensor suite tolocalize the vehicle, and to detect and respond to traffic signalsand obstacles. An overview of the hardware and the developedlocalization, signal handling, and obstacle handling componentsis presented, along with a summary of their performance.
I. I
NTRODUCTION
Autonomous driving in the rail industry is in its infancy.The Autonomous Siemens Tram (AST) project was anopportunity to investigate the applicability of autonomousdriving technologies developed by the automotive industry tothe rail domain, and demonstrate what capabilities a futureautonomous tram may offer. In many respects, autonomousdriving for trams is very similar to autonomous drivingfor cars—they operate in similar environments where theyinteract with other road users such as cars, pedestrians, andcyclists, and they must obey similar traffic rules and signals.Some aspects of the problem are simplified by the rail-boundnature of the vehicles—there are limited areas that need tobe mapped, path planning is not required, and the possiblelocations of the vehicle are heavily restricted. However,the rail-bound nature also makes the problem of avoidingobstacles considerably more challenging. Not only are tramsunable to steer in order to avoid potential collisions, theyalso cannot decelerate as fast as cars, both due to physicallimitations and the risk of injuring unsecured passengers.The AST, shown in Figure 1, was developed during 2018and publicly demonstrated in Potsdam, Germany during theInnoTrans 2018 exhibition, where it performed successfuldemonstration drives for hundreds of passengers. Note thata safety driver was always present during autonomous op-eration. It was demonstrated on a 6km long section of thePotsdam tram network shown in Figure 2. This track wascomprised of a number of different environments that pre-sented many of the scenarios that trams generally encounter.Between the stations ‘H Abzweig Betriebshof ViP’ and ‘HTurmstr.’, the track is surrounded by heavily wooded areaswhere it was not uncommon to see an animal crossing thetracks. Around ‘H Turmstr.’ is a suburban area with severalunsignalised road crossings and low fences surrounding thetrack in between the crossings. The section of track aroundthe stations ‘H Johannes-Kepler-Platz’ and ‘H Max-Born-Str.’ has higher density buildings, with businesses, schools,and apartment buildings on both sides of the track. Thereare a number of signalised and unsignalised crossings in
The authors are with Siemens Mobility GmbH, Berlin, GermanyCorresponding author: [email protected]
Fig. 1: The Autonomous Siemens Tramthis area, and, between the crossings, there is a fence inbetween the tracks to encourage pedestrians to cross at thedesignated crossings. After the station ‘H. Gaußstr.’ there isa return loop through another wooded area where the trambegan the return journey. Electrical power was supplied to thevehicle via overhead catenary lines, with the poles positionedin between the two parallel tracks.This paper presents an overview of the hardware andsoftware comprising the AST and a summary of their per-formance. The hardware is first introduced in Section II,with the software architecture, algorithms, and performancefollowing in Section III. Section IV concludes and presentspossible avenues of improvement.II. H
ARDWARE O VERVIEW
This section presents an overview of the vehicle, comput-ing hardware, and sensors used in the project.ig. 2: Map of the route with station locations marked
A. Vehicle
The vehicle used, shown in Figure 1, was the prototypeSiemens Combino NF100 low floor tram built in 1996. It isapproximately 26m long, 3.5m high, and 2.3m wide, andruns on standard gauge track (1.435m gauge). It has anunladen weight of approximately 28 tonnes, and is rated for150 passengers. This corresponds to an additional load ofapproximately 12 tonnes, or 40% of the unladen weight. Thevehicle has a theoretical maximum speed of 70km/h, but onthe Potsdam track network it is limited to 50km/h. Comparedto regular road vehicles, its performance is very limited—ithas a maximum acceleration of 1.3m/s , an average brakingdeceleration when using the service brake of approximately1.2m/s , and with the emergency brake of up to 3m/s .The emergency brake, however, is rarely used, as unsecuredpassengers are likely to be injured during its operation.Consequently, the emergency brake was not used by the ASTand its operations was the responsibility of the safety driver. B. Computers
A number of computers were used to perform the variousparts of the autonomous operations of the vehicle. Severalcomputers equipped with graphics cards were used to per-form camera-based signal recognition and object detectiontasks, and a number of railway-certified computers wereused for localization, lidar- and radar-based object detection,and object fusion. All computers were interconnected viaEthernet. In addition to these computers, a tablet was usedto display the current state of the system to the safety driver.
C. Sensors
The following subsections introduce the multi-modal suiteof sensors that the vehicle was equipped with to perform thelocalization, signal handling, and obstacle handling tasks.
1) Localization:
A dual antenna GNSS-aided InertialNavigation System (INS) was used to provide highly accuratepose and velocity information. Combined with a Real-TimeKinematic (RTK) correction data service, the INS was capa-ble of providing measurements at 100Hz with up to 0.01mposition accuracy, 0.05km/h velocity accuracy, 0.03° roll andpitch accuracy, and 0.1° heading accuracy. The 2 antennas (a) Lidars (b) Radars(c) Object detection cameras (d) Signal recognition cameras
Fig. 3: Sensor field of viewwere mounted above the roof of the vehicle with a separationof 1.6m.
2) Lidar:
Two separate types of lidars were used on thevehicle—a forward-facing 4-layer automotive laser scannerwith integrated object tracking was used for long-rangeobject perception, while two 16-layer lidars were mountedon the front corners of the vehicle to provide a wide field ofview for near-range object perception. The field of view ofthe lidars is shown in Figure 3a.
3) Radar:
A set of 3 automotive radars were mountedat the front of the vehicle at various angles. The field ofview of the radars is shown in Figure 3b. The radars havetwo separate fields of view shown in yellow (up to 70m)and green (up to 250m) respectively. Similar to the forward-ig. 4: Software architecturefacing lidar, the radar has integrated object tracking and iscapable of tracking up to 100 objects at a time.
4) Camera:
The vehicle was equipped with a total of 9cameras. A set of 6 cameras were positioned behind thewindshield and side windows in the front cab of the vehicleto provide a wide field of perception for the purpose of objectdetection, as shown in Figure 3c. The lens on the camerafacing directly forwards had a 60° Field of View (FoV), andthe other 5 cameras each had a 120° FoV. The placementof the cameras on the right side of the vehicle was chosenin order to provide perception of pedestrians who may walkin front of the tram from the platform while the tram isstationary, and of vehicles travelling parallel on the road tothe right of the track who present a particularly high riskof collisions at road crossings when they turn left acrossthe track. A further set of 3 cameras were used to provideforward facing perception for signal recognition. They werepositioned behind the windshield of the vehicle as shown inFigure 3d. The left and right cameras cameras had a 60° FoV,while the centre camera had a 35° FoV.III. S
OFTWARE O VERVIEW
A high-level overview of the software architecture used inthe AST is shown in Figure 4. It consisted of four mainsubsystems—vehicle control, localization, signal handling,and obstacle handling—each of which are summarised inthe following sections.
A. Vehicle Control
Vehicle control was provided by a proprietary system.Using the position and speed of the vehicle on the map,it determines the throttle and brake commands necessary todrive from platform to platform while obeying speed limits.A Movement Authority Limit (MAL) input is additionallyused to specify the distance that the vehicle is allowed tomove. The MAL input is used to respond to the dynamicenvironment, and is calculated by taking the minimum of theMAL outputs of the signal handling and obstacle handlingcomponents.
B. Localization Subsystem
The localization subsystem used the INS in combinationwith a map of the track and associated infrastructure toprovide state information to the vehicle control, as wellas a digital horizon to the signal and obstacle handlingsubsystems. The digital horizon provided information aboutthe upcoming track including the track geometry, signals,platforms, and crossings. The following subsections present details on how the map was created, and how the localizationwas performed.
1) Maps:
The first step in creating the map was toconvert position data collected using the INS into the tracknetwork. The raw INS data was post-processed to min-imise any effects from poor GNSS signal. The number ofpoints in the resultant trajectory were reduced primarilyby automatically identifying straight sections of track andremoving unnecessary points on these sections. Infrastructurewas then mapped by hand, including the start and end of eachplatform, the stopping point of the vehicle at each platform,the signals (including their height and type), the start andend of each road and pedestrian crossing, and the electricalgrid separators (which are locations where the vehicle is notallowed to stop as it is where there is a physical gap inthe overhead power supply). Finally, any other additionalinformation, such as speed limits, were added manually.
2) Localization:
In many railway applications, localiza-tion can be very challenging [1], [2], and difficulties inparticular arise from ambiguity over which track the vehicleis on. This can be reduced to a 1-dimensional problem usinginformation from the operation centre about which track isallocated for the current trip. This, combined with the INS,was sufficient for providing the location of the vehicle onthe track.
3) Performance:
The performance of the INS was excel-lent in open sky conditions, providing a reported positionaccuracy of a few centimetres. However, in certain sectionsof the track the view of the sky was partially obscured byoverhanging vegetation and tall buildings. In these cases,any degradation in GNSS signal was overcome using deadreckoning based on the IMU and motion model. It wasobserved that, during longer periods without GNSS (forexample, inside depot buildings), the drift of the INS washigher than would normally be expected. It is hypothesisedthat this was due to the INS using a sophisticated motionmodel based on car motion which does not accurately reflectthe motion of a rail-bound vehicle.
C. Signal Handling Subsystem
The signal handling subsystem consisted of threecomponents—signal state detection, signal state filtering,and a signal planner. The signal state detection componentidentified the state of a signal in a single camera image. Asthe aim of this project was to use only on-board technologieswithout modifying existing infrastructure, technologies suchas V2X were not adopted for signal state identification.Detections from multiple cameras and frames were thencombined and filtered in the signal state filtering component,with the resultant filtered signal state being used by thesignal planner to determine the MAL necessary to obey thesignal. The following subsections present an overview ofeach component, and a summary of the performance of thesubsystem.
1) Signal State Detection:
German tram signals are quitedifferent to regular traffic signals. There are two categories ofsignals—stop-go signals (which correspond in functionalityo regular red-yellow-green traffic signals), and switch statesignals (which indicate which direction a railway switch isset to). Stop-go signals, shown in Figures 5a-5f, will alwayscontain an F0 (stop) and an F4 (get ready) signal, and an F1,F2, or F3 go signal which also indicates the track directionthat it applies to (in the case of a branching track). In manycases, they will also include an A signal (which stands for
Anforderung in German) which, when lit, means that theapproaching tram has been registered by the signal and theupcoming light sequence will include a go signal for thetram. Switch signals, shown in Figures 5g-5i, contain a W0signal, which indicates that the switch state is locked for anapproaching vehicle, and W12 and W13 signals specifyingthe turning direction.As there is no flow of information from the signal tothe tram about its state, the signal state is detected us-ing camera images. Multiple approaches in the automotiveindustry determine the state of a signal by detecting andclassifying the whole signal housing [3]. In comparison totraffic signals, tram signals can have varying numbers ofchambers, each indicating one of the states mentioned above,giving many possible combinations that need to be detected.As a result, methods that detect signals as a whole werenot pursued. The approach that was developed detects thestate of each chamber separately, and the combination ofthe detected chambers along with prior knowledge from thedigital horizon enables the overall signal state to be resolved.The family of Single Shot Detectors (SSD) provide a goodtrade-off between accuracy and speed, making them suitablefor real-time systems. Through multiple experiments, it wasdetermined that the best combination of good real-timeperformance and accurate detections of small signals at fardistances could be achieved using a modified version of theoriginal SSD architecture [4]. This modified network useda MobileNet v2 feature extractor [5] with an input size of512x512. The default size of the anchors was set to a squareratio of 1:1 with smaller default dimensions than the originalnetwork, fitting to the scope of the problem. The location ofthe signal in the digital horizon was used to identify a Regionof Interest (ROI) in the image where the signal was expectedto be. Using the ROI as the input to the network made thesignal chambers a significant enough size to be accuratelyrecognized by a fast SSD algorithm. An example result ofthe algorithm is shown in Figure 6.The network was trained on the labels shown in Figure 5.One of the challenges that was encountered was that theshape of the signal in each chamber was partially visible,even if the chamber itself was not lit—this can be observedin the bottom chamber in Figure 6. To prevent these frombeing classified as a lit signal chamber, an extra “empty”label was added to the training set to represent the chambersthat are not lit. Training of the detection network wasdone on a manually labelled dataset of 10,000 images, ofwhich 1,000 random images were chosen for validationduring training. Image augmentation was also applied duringtraining, excluding rotation and mirroring. A further 700hand-picked images from multiple challenging scenarios and TABLE I: Signal detection results
Dataset mAP mAR Max. detection distance
Cloudy 0.65 0.75 80-100mRaining 0.6 0.73 80mSunny w/ reflections 0.49 0.48 70-80m weather situations were used as the test set.
2) Signal State Filtering:
A Bayes filter was used to trackthe state of a signal. As each chamber of a signal wasdetected separately, a Hidden Markov Model (HMM) wasused to describe which combinations of chambers could beactive at any one time, and the possible transitions betweenstates, in order to reduce misclassifications. Before integrat-ing the detections into the filter, implausible detections wereidentified and discarded using the expected size and possiblesignal states as indicated in the digital horizon.
3) Signal Planning:
A stopping location was determinedfor each signal that ensured that the signal would alwaysbe in the field of view of at least one of the cameras whenthe vehicle came to a stop. When an upcoming signal wasindicated in the digital horizon, the MAL was set to thestopping point of the signal until the necessary combinationof activated chambers was identified in the filtered signalstate to allow the vehicle to proceed past the signal. Thismeant that it would always stop in front of the signal if thesignal detection or filtering failed to detect the signal state.A second location was also specified for each signal whichwas the position at which the vehicle, if it could not stop bythat point, would continue to drive past the signal. This wasnecessary to prevent the vehicle from braking in situationswhere the signal changed to a prepare to stop state (F4),but the braking distance of the vehicle was greater than thedistance to the point where none of the cameras would beable to detect the signal state.
4) Performance:
The performance of the signal detec-tion on the camera images was measured by using objectdetections metrics such as mean Average Precision (mAP),mean Average Recall (mAR), and custom metrics such as themaximum distance at which the state could be accuratelydetected. Table I shows the performance metrics for thedeveloped network on three of the most challenging weathercategories in the test set—cloudy, raining, and sunny withreflections.During operation, the passive behaviour of always plan-ning on stopping at a signal unless it was positively identifiedensured that the vehicle never drove past a signal where itshould stop. Missed and incorrect detections were filteredout, and the end behaviour of the system was robust, evenin poor weather conditions.
D. Obstacle Handling Subsystem
The goal of the obstacle handling subsystem was to detectand respond to dynamic objects in the environment. Twodifferent approaches were implemented for detecting possibleobstacles—an object-based approach, in which objects weredetected in multiple sensor modalities and combined usinga late fusion approach, and a free space detection approach a) A (b) F0 (c) F1 (d) F2 (e) F3 (f) F4 (g) W0 (h) W12 (i) W13
Fig. 5: The possible signal states along the test track: (a) Approaching tram has been registered by the signal (b) Stop (c) Gostraight (d) Go right (e) Go left (f) Get ready (g) Switch state locked (h) Right branch selected (i) Left branch selectedFig. 6: Example detection of signal chambers. The expectedsignal location from the digital horizon is shown in green,the large red box is the ROI generated from the expectedlocation, and the small boxes show the individual detectionsof lit chambers (blue) and empty chambers (red).which worked purely with lidar sensors. An obstacle plannerused the detected objects and free space, along with thedigital horizon, to determine the MAL necessary to avoidobstacles on the track and, additionally, whether the warningbell should be activated. The following subsections detail thetwo obstacle detection approaches, along with the obstacleplanner, and the overall performance of the obstacle handlingsubsystem.
1) Object Detection and Filtering:
The object detectionand filtering approach utilised the full multi-modal sensorsuite consisting of cameras, lidars, and radars. A commer-cially available deep learning-based object detector devel-oped for automotive applications was used for detectingand classifying objects from the camera data, while theautomotive lidar and radars provided classified object outputsout of the box. These detections were combined usinga Kalman filter, where measurements were associated toobjects using Mahalanobis distance and nearest-neighbourassignment. The success of such an approach is heavilydependent on the quality of the object detections. In thisproject, sensors and object detection algorithms designed for the automotive industry were used with the hypothesisthat, given trams operate in a similar environment to roadvehicles, the performance should be similar. However, thetram environment introduced a number of challenges whenusing automotive sensors.The largest challenge for using the object detections fromthe radars was that the strongest returns typically come fromobjects with lots of metal. As most of the infrastructure onand around the track has a very high metal content, the mostcommonly detected objects by the radar were the powerpoles and the railway sleepers. Returns from pedestrianswere indistinguishable amongst the infrastructure detections.Another problem encountered was that, unlike for othersensor modalities, large objects were typically detected asmultiple smaller objects—this was especially the case forfences and other trams.Similar to the radar, the majority of detections fromthe lidar were also of the infrastructure. The high densityof infrastructure objects (in particular, fences and powerpoles) again posed a significant challenge for the objectfusion, as object detections that were close together would beassociated with one another, leading to non-zero velocities.As the infrastructure is typically very close to the track, anyimprecision in the position of an object generated from apiece of infrastructure can lead to a collision being predicted.The object detections from the cameras were, in gen-eral, quite good, but were sensitive to the position of thecamera—when using cameras that were positioned high onthe windscreen, objects that were close to the tram weresometimes not detected. Despite this, cars and pedestrianswere almost always detected. However, the false positive ratewas also high. Integrating the object detections into the objectfusion was particularly challenging. As the detections weremade in image space, the projection of the object detectionto 3D relied on very accurate calibration of the sensorsand accurate bounding boxes—estimating the position ofpartially occluded objects was especially difficult.Predicting the future trajectory of an object is a chal-lenging topic, and reliable results are achievable for a fewseconds [6]. In this case, the large braking distances oftrams limit the usefulness of prediction, as the large brakingdistances can necessitate a similarly long prediction horizon.
2) Free Space Detection:
A free space-based approachwas also implemented to provide redundancy in the obstaclehandling. It combined the raw point clouds of the lidars inorder to determine which areas of the environment werefree or occupied space using the following process. Theatest point clouds from each sensor were first spatially andtemporally aligned. Then, irrelevant points were filtered outconsidering the clearance gauge of the vehicle. Followingthis, the remaining points were clustered, projected to theground plane, and converted to polygons—these polygonsrepresented the occupied space in the environment.A more conservative approach was also implemented,in which areas that were not visible due to occlusionswere explicitly modelled as occupied space. While such anapproach is safer than using the convex hull of each cluster,in practice it led to undesirable obstacle detections whenapproaching curves, where pieces of infrastructure wouldpartially occlude the track.A challenging problem that was encountered was vegeta-tion growing on the track. Unlike in the automotive domainwhere the road surface is essentially flat, the ground surfacebetween and around the rails was sometimes loose graveland could have plants growing in it. Choosing the correctclearance gauge was a trade-off between being able to detectobjects close to the ground and not detecting tall vegetation.
3) Obstacle Planner:
The obstacle planner calculatedwhether collision and warning detection zones defined foreach part of the track intersected with the objects andoccupied space polygons in order to determine the actionsto be taken. Default collision and warning detection zoneswere used for the majority of the track, with special zonesdefined for high-risk areas such as platforms and crossings.For platforms in particular, pedestrians tend to stand veryclose to the track and should be warned when the tramis approaching. The basic response strategy was to set theMAL to a specified offset in front of a detected collision.If the distance to the obstacle was less than the brakingdistance (i.e., the tram was predicted to collide with theobstacle) or the obstacle was within a predefined distance,then the warning bell would additionally be used. Activatingthe warning bell was the only response for obstacles detectedin a warning zone.
4) Performance:
To evaluate the performance of the ob-stacle handling, a pram was used as a test obstacle—thiswas able to be reliably detected over 80m away. As thebraking distance of the vehicle at the maximum velocity of50km/hr was approximately 80m, the velocity of the vehiclewhen approaching pedestrian crossings was limited duringoperation to 40km/hr to provide a safety buffer. As thisapproach projects the detected objects and free space ontothe map, it is very sensitive to errors in sensor-to-vehiclecalibration, and errors in the estimated vehicle position andorientation. With poles and fences positioned very closeto the track, even small angular errors can lead to falsedetections. Sensor-to-sensor calibration was performed usinga set of calibration targets placed around the vehicle, withthe sensor-to-vehicle manually estimated using a straightsection of track as the reference. In combination with the high accuracy of the localization, the alignment of the sensordata with the track map was sufficient and false collisiondetections occurred very rarely.IV. C
ONCLUSION
This paper presented a summary of the AutonomousSiemens Tram, with a focus on the software developed andthe challenges that were encountered. As was shown, theapplication of automotive sensors and technologies to therailway domain is not straight forward and requires signifi-cant adaptation for the unique situations that these vehiclesencounter. Despite successful demonstrations of the system,there are a number of possible avenues for improving therobustness of the system. It is already clear that GNSS-basedsolutions for localization will not be sufficient for reliableoperation in all environments. Research is underway on usingvisual odometry in the rail domain [7], and perception-based localization approaches are likely to be necessary tofully cover the situations where GNSS-based solutions fail.Object-based obstacle handling could also be improved byusing a map of the infrastructure to determine whether ameasurement is likely to be of a piece of infrastructureand can be discarded. However, the challenge with such anapproach is ensuring that dynamic objects positioned next toinfrastructure are not erroneously discarded.A
CKNOWLEDGMENT
Our thanks go to the large team of Siemens Mobilityemployees who contributed to the success of this project.We would also like to acknowledge the support of theVerkehrsbetrieb Potsdam GmbH who provided the vehicle,safety drivers, and access to the track.R
EFERENCES[1] M. Lauer and D. Stein, “A Train Localization Algorithm for TrainProtection Systems of the Future,”
IEEE Transactions on IntelligentTransportation Systems , vol. 16, no. 2, pp. 970–979, 2014.[2] A. W. Palmer and N. Nourani-Vatani, “Robust Odometry using SensorConsensus Analysis,”
IEEE International Conference on IntelligentRobots and Systems , pp. 3167–3173, 2018.[3] M. B. Jensen, M. P. Philipsen, A. Møgelmose, T. B. Moeslund, andM. M. Trivedi, “Vision for looking at traffic lights: Issues, survey, andperspectives,”
IEEE Transactions on Intelligent Transportation Systems ,vol. 17, no. 7, pp. 1800–1815, 2016.[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, andA. C. Berg, “SSD: single shot multibox detector,”
European conferenceon computer vision , pp. 21–37, 2016.[5] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “In-verted residuals and linear bottlenecks: Mobile networks for classifica-tion, detection and segmentation,”
IEEE/CVF Conference on ComputerVision and Pattern Recognition , pp. 4510–4520, 2018.[6] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic Occupancy GridPrediction for Urban Autonomous Driving: A Deep Learning Approachwith Fully Automatic Labeling,”
IEEE International Conference onRobotics and Automation , pp. 2056–2063, 2018.[7] F. Tschopp, T. Schneider, A. W. Palmer, N. Nourani-Vatani, C. Cadena,R. Siegwart, and J. Nieto, “Experimental comparison of visual-aidedodometry methods for rail vehicles,”