[PDF] Modeling a Cache Coherence Protocol with the Guarded Action Language

Abstract

We present a formal model built for verification of the hardware Tera-Scale ARchitecture (TSAR), focusing on its Distributed Hybrid Cache Coherence Protocol (DHCCP). This protocol is by nature asynchronous, concurrent and distributed, which makes classical validation of the design (e.g. through testing) difficult. We therefore applied formal methods to prove essential properties of the protocol, such as absence of deadlocks, eventual consensus, and fairness.

Full PDF

JJohn P. Gallagher, Rob van Glabbeek and Wendelin Serwe (Eds):Models for Formal Analysis of Real Systems (MARS’18)and Veriﬁcation and Program Transformation (VPT’18)EPTCS 268, 2018, pp. 88–103, doi:10.4204/EPTCS.268.3 © Q. Meunier, Y. Thierry-Mieg, E. EncrenazThis work is licensed under theCreative Commons Attribution License.

Modeling a Cache Coherence Protocol with the GuardedAction Language

Quentin L. Meunier Yann Thierry-Mieg Emmanuelle Encrenaz

Sorbonne Universit´e, CNRSLaboratoire d’Informatique de Paris 6LIP6, F-75005 Paris, France

[email protected] [email protected] [email protected]

We present a formal model built for veriﬁcation of the hardware Tera-Scale ARchitecture (TSAR),focusing on its Distributed Hybrid Cache Coherence Protocol (DHCCP). This protocol is by na-ture asynchronous, concurrent and distributed, which makes classical validation of the design (e.g.through testing) difﬁcult. We therefore applied formal methods to prove essential properties of theprotocol, such as absence of deadlocks, eventual consensus, and fairness.

Testing and simulation are unfortunately not sufﬁcient to provide strong correctness guarantees expectedfrom hardware designs, where patching is not an option. However, proving a design correct is a difﬁcultprocess despite the improvement of veriﬁcation tools, as complexity of hardware designs has also grownalong with Moore’s law.The TSAR (Tera-Scale ARchitecture) shared memory architecture [1] studied in this paper is ageneral-purpose multicore architecture, in which cache-coherence is entirely supported by the hardware.The main technical challenge of this architecture is scalability, as it is intended to integrate up to 1024cores. Its embedded cache-coherence protocol is a key architectural point, and has been designed scalewell when the number of cores grows.To formally prove properties of the designed protocol, named Distributed Hybrid Cache CoherenceProtocol (DHCCP), we built several formal models over a number of student internships, designed toinvestigate how different model-checking tools could address our need. This case study is the result of acollaboration between the experts in systems-on-chip design that developed TSAR, and experts in formalveriﬁcation that provide model-checking tools.We ﬁrst present the DHCCP protocol and the relevant characteristics of the hardware that must becaptured by the semantics of the model. We then present the formal models that we built over time inPromela, Divine and GAL, together with the results we were able to obtain.All these models are made available as part of this submission process and will be accessible in theMARS repository as well as on github . Memory Layout.

The architecture is clustered and has a 2D-mesh topology, as represented in Figure 1.Each cluster typically contains 4 processor cores with their L1 caches, a local interconnect, one L2-cachebank, and some peripherals. https://github.com/lip6/TSAR-DHCCP . Meunier, Y. Thierry-Mieg, E. Encrenaz P0 P1 P2 P3

DMA XICU

MemCache(L2) RouterI$ D$ Local InterconnectI$ D$ I$ D$ I$ D$GlobalInterconnect

Figure 1:

Tsar Architecture Overview: Mesh Topology and Cluster DetailsL1–L2 Communication.

For L1-L2 communication, TSAR uses a

L1-L2 Interconnect composed oftwo hierarchical levels: a

Local Interconnect for intra-cluster communication, and a

Global Interconnect for inter-cluster communication, implemented as a network-on-chip with a 2D-mesh topology. For a L1cache, accessing a L2 located in another local interconnect results in a longer latency, but both levels im-plement a logically ﬂat address space. The L1-L2 interconnect provides a built-in broadcast service usedby the coherence protocol to efﬁciently broadcast invalidation messages and contains ﬁve independentnetworks.

Hardware Coherence Mechanism.

For scalability purpose, TSAR implements a directory-basedcache-coherence policy. From a conceptual point of view, the coherence protocol is supported by aglobal directory located in the L2 controller: this global directory stores the status of each cache linereplicated in at least one L1 cache of the architecture. The policy between L1 and L2 caches is write-through, meaning that the L2 cache always contains the most recent value of a cache line, and thereis no exclusive ownership state for a L1 cache. This global directory is physically distributed in thecorresponding L2 banks.The basic coherence mechanism is the following: when the L2 controller receives a W

RITE requestfor a given cache line, it sends an U

PDATE or I

NVAL request to all L1 caches containing a copy except thewriter. The write request is acknowledged only when all U

PDATE or I

NVAL transactions are completed.The L2 cache is inclusive: a cache line present in at least one L1 cache must be present in the L2cache. Thus for any evicted line, the corresponding copies must be invalidated. When a shared piece ofdata is modiﬁed, the DHCCP protocol uses two different strategies depending on the number of copies:• M

ULTICAST U PDATE : when the number of copies is smaller than a certain threshold, called

DHCCP threshold , the L2-cache controller registers the locations of all the copies and sends anU

PDATE request to each concerned L1 cache.• B

ROADCAST I NVAL : when the number of copies is larger than the

DHCCP threshold or if thereis no room left to register the copies locations, the L2-cache controller registers only the number0

Modeling a Cache Coherence Protocol with GAL of copies without their location, and sends an I

NVAL request to all L1 caches. Only the L1 cacheswhich own a copy of the line must respond to this request, thus reducing the trafﬁc.The list of sharers of a given cache line is stored in the L2 directory with a per bank hardware heapshared between lines. A counter of sharers is also maintained in the directory entry. When the thresholdof copies is exceeded or when the heap is full, the sharers list is freed and only the counter of copies isused. In this case, broadcast invalidates are used to maintain cache-coherence.

Types of Transactions.

Two types of transactions are deﬁned for L1-L2 communication:

Direct transactions , containing the messages read , write , ll (Load-Linked), sc (Store Conditional), cas (Compare-and-Swap), along with their responses. These transactions use two separated networks forcommands and responses. They are initiated by L1-cache controllers. For these transactions, the targetcan be any L2-cache controller. Coherence transactions , containing the messages cleanup (eviction from L1 or invalidation ac-knowledgement), clack (CLeanup ACK), multicast_update , multicast_inval , broadcast_inval , and multi_ack . There are four types of coherence transactions, each requiring two or three steps. Coherence Transactions. • A local cleanup is a two step transaction initiated by a L1-cache controller when it makes acache line replacement, usually following a miss. The L1 cache sends a cleanup request indi-cating it is invalidating the cache line, and the L2-cache controller returns a clack response toacknowledge the line invalidation.• A multicast update is a two step transaction initiated by the L2-cache controller when itreceives a write request to a replicated cache line, for which the number of copies does not exceedthe DHCCP threshold. It sends as many multi_updt requests as the number of registered copies,but the writer. The expected response is a multi_ack sent by each involved L1 cache. The L2-cache controller counts the number of responses to detect the completion of the multicast update transaction.• A multicast invalidate transaction is a three step transaction initiated by a L2-cache con-troller when it makes a cache line replacement (following a miss) and the victim line has a numberof copies smaller than the DHCCP threshold. It sends as many multi_inval requests as the numberof registered copies; each L1 cache returns a cleanup response, and the L2 cache acknowledgesthe invalidation with a clack . The L2-cache controller counts the number of responses to detectthe completion of the multicast invalidate transaction.• A broadcast invalidate is a three step transaction initiated by a L2-cache controller whenit either replaces a line, or receives a

Write request to a replicated cache line, and this cache linehas a number of copies larger than the DHCCP threshold. The L2 cache sends a single message broadcast_inval , which is dynamically replicated by the network to all L1 caches. L1 cacheswhich have a copy must respond by a cleanup message. All these cleanup responses are countedby the L2 cache to detect the completion of the transaction. Finally, the L2-cache controller ac-knowledges each invalidation with a clack .The multi_updt , multi_inval , broadcast_inval and clack messages use the direction L2 → L1Cache , while the cleanup and multi_ack messages use the direction L1 → L2 Cache .A L1-cache controller must be sure that a sent cleanup request has arrived to the L2 cache beforesending a read request for the same line; otherwise, there would be a risk of inconsistency if the latterrequest passes the cleanup. To enforce this, L2 caches must respond to each cleanup message with a . Meunier, Y. Thierry-Mieg, E. Encrenaz clack . In order to avoid deadlocks, clack responses must use a physically separated network. Therefore,the coherence transactions require three separated networks:• cleanup and multi_ack share the ﬁrst network.• multi_updt , multi_inval and broadcast_inval share the second network.• clack are conveyed by the third network.Direct transactions use two further networks: one for requests and one for responses, for a total ofﬁve independent networks. Veriﬁcation Objectives.

If intensive testing is mandatory, formal veriﬁcation can help detect subtlebugs due to some uncommon interleavings of messages on the different networks. In our case, one of themain challenges in the DHCCP protocol consists in counting the correct number of responses for eachcoherence transaction, since an incorrect count will eventually result in a deadlock.Our goals were ﬁrst to model formally the DHCCP protocol so as to prove the absence of deadlocks.We also wanted to prove simple functional properties, such as every request receives a response, or ashared copy modiﬁcation eventually leads to an invalidation or an update of the other copies. Provingthese properties would greatly strengthen our conﬁdence in the protocol.We did not expect veriﬁcation to scale up to the full 1024 core design, but that is not truly necessarydue to the symmetry of DHCCP. Our main goal was then to target a platform conﬁguration that wouldexhibit all the characteristics of DHCCP by working up progressively from smaller conﬁgurations. Toexhibit all relevant properties we determined that we needed a threshold for DHCCP that is strictly lessthan the number of possible copies of a data, so all types of coherence transactions can occur, so we needto scale to a design with at least three processors. Another parameter we identiﬁed is that at least twoaddresses should be represented to show that no problem arises from the sharing of the channels betweenthe different L1 caches and L2 banks.

Architecture Parameters.

In order to be able to verify properties, the components in the architectureneed to be abstracted, and some restrictions have to be made. The modeled components are processor,L1 cache, L2 cache, memory and interconnect channels. All of these components except channels aremodeled as reactive communicating automata, they have a control location or state that deﬁnes whichmessages they can send or receive.As seen in the section on the interconnect, the topology is logically “ﬂat” , and is modeled as such.Channels have the capability to serialize requests coming from different sources, and to route a requestto different destinations.The architecture parameters are the following:• Number of processors and their L1 caches:

NB_PROC .• Number of cache lines in the L2 cache:

NB_L2 . Each L2-cache bank contains only one line in themodel, and thus several cache lines translate into several L2-cache banks.• DHCCP threshold:

CACHE_TH .This parametrized description should be preserved during the modelling to be able to easily considerconﬁgurations of increasing complexity.There are three signiﬁcant restrictions in the models we built: 1 . the single line L2-cache bank, 2 . theassumption that the L2 is large enough to contain all memory lines, and 3 . the absence of data modelling.2 Modeling a Cache Coherence Protocol with GAL

The ﬁrst restriction prevents a L2-cache bank from receiving on its port requests at different addresses.This is a reasonable simpliﬁcation, as there is globally not much interaction between requests targetingdifferent cache lines in the L2 cache. The second restriction prevents several memory addresses frombeing in the same L2 bank; lifting this restriction would require to modify the L2 state machine andadd several variables into it (thus yielding even quicker combinatorial explosion). The third restrictionmeans that we only model the control part of the protocol, and abstract away the content of the data inthe lines of cache. Adding data is easy from a modelling point of view, and would enable veriﬁcation ofsome coherency properties but these data values do not impact correctness of the protocol (e.g. deadlockfreedom) and they would participate in state space explosion.Table 1 describes the variables we considered relevant inside each component. The processor modelis very simple and sends random read or write requests at a random address. The L1 cache containsonly one line, and the line validity is deﬁned by the cache state. Each L1 cache is deﬁned with a uniqueidentiﬁer. Each L2 cache contains the line address it can hold, along with arrays for the explicit list ofcopies, and a variable to store the copies count. A channel interconnects two components and is modeledas a one-place buffer. There is one channel per network, i.e. 5 between L1 caches and L2 banks for thecase of DHCCP.Figure 2 shows a system instance with

NB_PROC = 2 , NB_L2 = 2 , and

CACHE_TH = 2 , and withsome of the components variables values. Appendices A and B show the ﬁnite state machines for the L1and L2 caches. All states are represented, but some transition actions are omitted for clarity. Appendice Cdescribes the different messages modeled and the channels on which they are conveyed. These messagestypes are directly obtained from the DHCCP description in section 2. Figure 3 shows the dynamics,through a sequence diagram describing a cache miss.Table 1: Components modeled with their variables

Processor Channel state

FSM state address : target address of the request addr : address of the last id (optional): identiﬁer of the request senderrequest emitted type : request type L1 cache state

FSM state id : identiﬁer of the cache (different for all caches) v_addr : address contained in the cache when it is valid(validity is encoded in the sate) addr_save : address of the last request sent L2 cache state

FSM state line_addr : address of the line mapped in this L2-cache bank n_copies : number of copies for this line dirty : true if the line has been modiﬁed w.r.t. the memory src_save and src_save_clnup : used to save a request source id when a coherence request is needed cpt , cpt_clnup and rsp_cpt : counters used for the sending of multicast or broadcast requests v_c_id[] and c_id[] : arrays of size CACHE TH storing the explicit list of copies;if v_c_id[i] == 1, then entry i in c_id is valid and c_id[i] contains the corresponding cache identiﬁer. . Meunier, Y. Thierry-Mieg, E. Encrenaz ProcessorProcessor L1 Cache v_addr: 1id: 0

L1 Cache v_addr: 1id: 1

PL1DTREQL1PDTRSPPL1DTREQL1PDTRSP L2 Cache n_copies: 0line_addr: 0dirty: 0 v_c_id: {0, 0}c_id: {0, 0}

L2 Cache n_copies: 2line_addr: 1dirty: 1 v_c_id: {1, 1}c_id: {0, 1}

MemoryL1L2DTREQL2L1DTRSPL2L1CPREQL2L1CLACKL1L2CPRSP L2MEMDTREQMEML2DTRSP

Figure 2: Example of an architectural state modelisation, in which the line with address 1 has a copy inboth L1 caches

Processor L1 Cache 1 L2 Cache (@0)

RDRSP_RDDT_RD GETRSP_DT_RD

Memory

RSP_GET

Figure 3: Message Sequence in case of misses in the L1 and L2 caches. The messages are the following:(1) The processor connected to the L1 cache sends a read request (message

DT_RD ) for the line withaddress 0 (

PL1DTREQ channel).(2) The L1 cache does not own a valid copy of the line and sends a RD message for the line at address 0( L1L2DTREQ channel).(3) The L2 cache in charge of address 0 receives the request and, as it misses, sends a

GET message foraddress 0 to the memory (

L2MEMDTREQ channel).(4) The memory responds with a

RSP_GET message to the requesting L2 (

MEML2DTRSP channel).(5) The L2 cache updates it state for the line and responds to the L1 with a

RSP_RD message (

L2L1DTRSP channel).(6) Upon receiving the response, the L1 cache updates its state and responds to the processor with a

RSP_DT_RD message (

L1PDTRSP channel).

Promela Models.

To model DHCCP we ﬁrst used the Promela language and its supporting toolSpin [10]. It offers as modeling framework asynchronous process communicating over channels. Thelanguage itself is relatively comfortable, each component is described using code-like control structures(case, loop). The tool can exhibit traces as sequence diagrams, which are particularly valuable to developand debug the model.We ﬁrst built [8] up automata of the behavior of the L1 cache and the memory controller, and ab-stracted the activity of processors to arbitrary read and write requests. These automata were then encodedinto Promela using labels and goto.4

Modeling a Cache Coherence Protocol with GAL

To validate the models, we used properties encoded into “observation” automata, synchronized withthe system. In some cases, adding these automata proved to be a problem, as they can incorrectly blockthe system if not well designed. Also, to observe channels we had to duplicate channels and messages(one for the true channel, one for the observer) which is quite intrusive. Overall, this observation mech-anism was quite cumbersome, and participated in state space explosion.We separately analyzed the components in simple conﬁgurations before assembling them. The sim-ple conﬁgurations helped validate that the Promela code correctly reﬂected the automata. Full model-checking was however only possible on the simplest instances, with a single processor and two addresses.For the full system, we were only able to use simulation and bounded model-checking (up to roughly10 states).This Promela model was then extended and reﬁned with the same goals in mind [7]. We simpliﬁedand abstracted the data manipulation, and removed the observers. We were still unable to explore thefull state space for three processors, reaching 270 steps in depth for 10 states. We explored instead twoconﬁgurations with two processors, while varying the threshold variable. We were also able to include theLL/SC (Load-link/Store-conditional) support in the TSAR architecture, leading unfortunately to muchmore complex automata, as more control messages were introduced.While partial-order reduction was activated, we could not activate d step as every set of actionscontains at least one channel interaction. The channels themselves are shared for both writing and reading(precluding use of xr / xs keywords) since it models a bus.In conclusion, despite some aggressive simpliﬁcations (e.g. full data abstraction), we were unable tofully verify deadlock freedom for Promela models for even the smallest truly relevant conﬁguration, i.e.at least three processors, two addresses, and a threshold at 2 so that both multicasts and broadcasts canoccur. The language and simulator were however quite comfortable to use. Divine Models.

We then built a second set of models [5], but this time using the Divine [2] language.Similarly to Promela, Divine is a language to express systems as (asynchronous) processes communicat-ing over channels. However, it is a much simpler language with less features. Each process is describedusing an automaton with “local” variables ; transitions have a source and target state, a guard or enablingcondition over variables, may send or receive messages from channels and update variables.We chose to use Divine mostly because we wanted to try other tools than Spin. Thanks to hav-ing a relatively simple syntax and semantics, and also due to the existence of a nice set of benchmarkmodels [9], several tools provided support for the language. The Divine tool coming with the languagehandled LTL in multi-core and distributed settings, LTSmin [3] offered support for both explicit andsymbolic exploration, and building upon new results [4], a prototype path to input Divine models to oursymbolic model-checker ITS-Tools had been recently built.Building the model in Divine was harder than in Promela however. On the positive side, we havegood control over the atomicity of statements; in the Promela models, due to interaction of send/receiveactions with the atomic keyword, some interleavings of independent progression of different processwere still observable. For instance, a (passive) component that consumes a message and immediatelysends an answer on a different channel could not be modeled without the state between receive and sendbeing materialized.On the other hand, we fell into some difﬁculties to model channels; Promela let us peek at thecontent of a channel without consuming it, enabling a routing mechanism where an entity only consumesmessages which are addressed to it. Without this mechanism, we were in fact unable to model thesemantics we needed; we had to resort to explicit modeling of the channel as shared global variables,and simulating read and write operations with instructions. Divine’s support for parametric modeling . Meunier, Y. Thierry-Mieg, E. Encrenaz m4 , which is notcomfortable.During this internship, we again built separate components, then progressively assembled them tobuild more complex conﬁgurations. We used LTL (instead of observation automata) to express expectedproperties of the system, under fairness constraints that force all of the processes to progress. Withoutfairness, most of the properties were not valid, which was expected, but unfortunately fairness is notsupported by all the tools.With these models, we were able to reproduce a real deadlock found on a previous version of theDHCCP protocol. On the conﬁguration with two processors, two addresses and a threshold set at 1,the scenario exhibited by the sequence diagram in Figure 4 could occur. It leads to an incorrect countervalue, that ultimately leads to a deadlock by propagation as the head message in the FIFO channel cannotbe consumed and communications lock down.This bug was detected in the cycle-accurate TSAR prototype by its designers, using testing andsimulation, but it took one year to detect this issue and correctly diagnose it. Building a solution tocorrect the problem was non trivial, and took another six months. In contrast, once the formal modelswere built, detecting it was easy, even for a relatively non expert Master 2 student, and model-checkingcould exhibit readable diagnostic traces. We were then able to modify our model to match the next(correct) version of the DHCCP protocol described in section 2, and we checked that the deadlock issuewas indeed resolved.However, we were still unable to fully verify larger conﬁgurations with three or more processors.The use of ITS-tools was possible only for deadlock detection, as fairness constraints were unavailableto check the more complex LTL properties. The input from Divine to ITS-tools also involves severalautomated transformation steps, that yielded a relatively complex model due in part to channel commu-nications modeled as shared variables, and to the loss of structural information in the transformation (i.e.from a set of processes to a single large speciﬁcation). Experimentation with LTSmin was not extensive,but we measured performance similar to that of the Divine tool; again chains of transformations andrelatively poor tool integration made this path less comfortable than just using Divine natively.In conclusion, using Divine models, we were able to successfully reproduce a critical bug in theTSAR coherence protocol, and to prove that the patch did solve the problem. However, we were stillunable to perform veriﬁcation for larger instances with three or more processors. The third set of models were built using the Guarded Action Language (GAL) [6] during a MasterThesis [11]. GAL is a language offering very ﬁne control over the expression of concurrent semantics,with no assumptions on the existence of higher level constructs such as processes or channels and veryfew keywords.

Guarded Action Language.

GAL is a formalism supporting hierarchical descriptions of compo-nents; terminal or leaf components are speciﬁed as GAL type declarations, while composite type deﬁni-tions allow to instantiate existing types (of GAL or composite nature) and synchronize these instances.A GAL speciﬁcation is then composed of a set of type declarations, and a speciﬁc instance main whichis designated as the full system. These characteristics of the language, borrowed from architecturaldescription languages, help reuse model elements in various conﬁgurations easily.A GAL type declaration deﬁnes a set of integer variables and ﬁxed-size arrays of integers as variables.A state of a GAL is then a complete assignment of an integer value to each of these variables. Transitions6

Modeling a Cache Coherence Protocol with GAL

L1 Cache 0 L1 Cache 1 L2 Cache (@0) (Broadcast mode)

B_INVRD RSP_RDRSP_B_INV RDRSP_RDRDRSP_RDB_INVRSP_B_INV RSP_B_INVWRB_INV

N_COPIES = 0N_COPIES = 1N_COPIES = 2N_COPIES = 3N_COPIES = 2N_COPIES = 1

Figure 4: Part of a deadlock sequence in a previous version of the DHCCP protocol. Blue arrowsrepresent direct messages while red arrows represent coherence messages. Messages can be received outof order due to the existence of several channels between the memory controller and the L1 cache. The

RSP_B_INV message was exclusive to that version of DHCCP. At the end of the sequence, the line countin the L2 is one whereas there are no copies left.are deﬁned as a triplet (cid:104) g , l , a (cid:105) , where g is a boolean enabling condition or guard, l is a label chosen froma ﬁnite set, and a is a sequence of statements or actions that must be executed atomically. A statementcan be an assignment to a variable of an expression computed over variables, or a call to a label. Thecall statement is resolved by ﬁnding a transition bearing the target label, whose guard is enabled, andexecuting it.A composite type declaration deﬁnes a set of instances and ﬁxed-size arrays of instances as variables.A state of a composite is then a complete assignment of a subcomponent state to each of these instances.Synchronizations are deﬁned as a pair (cid:104) l , a (cid:105) , where l is a label chosen from a ﬁnite set, and a is a sequenceof call statements that must be executed atomically. A call statement has a target, which can be eitherthe enclosing instance itself, or any nested instance. The target of the call must then evolve through atransition (or synchronization in composite case) that bears the target label.GAL also offers parametric modeling features, in the form of parameters deﬁned over a discrete rangeof values . These parameters let us deﬁne parametric transitions, that correspond to a set of non para-metric transitions, one per possible value of the parameters. Parameters can be used to deﬁne parametriclabels, in order to model communications over discrete data types as calls to labels.GAL is mostly designed to be the target in a model transformation process, where the speciﬁcationis typically expressed using a domain speciﬁc notation such as Promela or Divine, and automaticallytranslated to GAL for analysis purposes. Going for direct modeling in GAL however, gives us proximityto the symbolic solution engine, enabling the use of advanced features that are not built into the generalpurpose transformation, e.g. from Divine.In particular, the automatic transformation loses structural information, yielding a single GAL com- In the syntax, parameters are distinguished from variables by a $ sign. . Meunier, Y. Thierry-Mieg, E. Encrenaz typedef a d d r t = 0 . . $ NB L2 − typedef t y p e t = 0 . . 19 ; typedef i d t = 0 . . $ NB PROC − gal ChannelAddrType { i n t i s F u l l = 0 ; i n t addr = 0 ; i n t type = 0 ; t r a n s i t i o n read ( a d d r t $ addr , t y p e t $ rtype ) [ i s F u l l == 1 && addr == $ addr && type == $ rtype ] label ” read ” ( $ addr , $ rtype ) { i s F u l l = 0 ; addr = 0 ; type = 0 ; } t r a n s i t i o n w r i t e ( a d d r t $ addr , t y p e t $ wtype ) [ i s F u l l == 0] label ” w r i t e ” ( $ addr , $ wtype ) { i s F u l l = 1 ; addr = $ addr ; type = $ wtype ; } } gal CacheL1 { i n t s t a t e = $ INIT ; / / s t a t e i n the automaton i n t v addr = 0 ; / / address i n cache i f VALID i n t addr save = 0 ; / / saves the address / / of a sent request i n t i d ; / / f i x e d i d e n t i f i e r of t h i s L1 cache t r a n s i t i o n t i n i t ( i d t $ id ) [ s t a t e == $ INIT ] label ” i n i t ” ( $ id ) { s t a t e = $ L1 EMPTY ; i d = $ id ; } t r a n s i t i o n t Empty WriteWaitEmpty ( i d t $ id , a d d r t $ addr ) [ s t a t e == $ L1 EMPTY && i d == $ id ] label ” read PL1DTREQ write L1L2DTREQ ” ( $ id , $ addr , $ DT WR, $ WR) { s t a t e = $ L1 WRITE WAIT EMPTY ; addr save = $ addr ; } . . . Figure 5: GAL encoding of a channel carrying messages (5–26) featuring a type (8) and an address ﬁeld(7). Part of the GAL modeling the L1 cache (27–46).ponent instead of a composition of process. The encoding of channels as global variables also preventsthe transformation from automatically building a representation using synchronizations on labels.

Elementary Components.

We separately developed the models of the communication channels andof the various components. For normal components, we model automata by deﬁning a state variable,then adding a GAL transition for every transition of the automata.For channels, we built a GAL with enough variables to store the message, and a boolean ﬂag to test ifthe channel is full. For each possible message going through the channel (which must be a ﬁnite domain)we generate two transitions read and write with a distinct label for each of them. In this example (seeFigure 5) the channel messages carry a target address (among possible memory slots) and a message type(among 20 possible values, e.g. update request). The read operation also ﬂushes the state of the channel,to prevent this unreadable state information from participating in the state space explosion.The processor is modeled as a three state automaton, alternating between an idle state and a statewhere a read (resp. write) request is sent to the L1 cache on an arbitrary address. Then, it awaits thereply to go back to idle. The Memory cells are even simpler: since data values have been abstractedaway, they feature a single state and two transitions that simply acknowledge and reply appropriately to

PUT and

GET requests.The L1 and L2 cache are much larger with respectively 14 and 16 states. They also feature vari-ables that increase the state space size signiﬁcantly. When transitions of an automaton read or write onchannels, we add labels to those actions, indicating which channel is used. The message data is ﬁlteredby specifying values in the target label. For instance, the transition from

EMPTY to WRITE_WAIT_EMPTY shown in Figure 5 reads a

DT_WR request from the channel

PL1DTREQ and writes a WR to the channel8 Modeling a Cache Coherence Protocol with GAL composite ProcessorCacheL1 { Processor p ; CacheL1 c ; ChannelAddrType chan PL1DTREQ ; ChannelAddrType chan L1PDTACK ; synchronization i n i t ( i d t $ id ) label ” i n i t ” ( $ id ) { c . ” i n i t ” ( $ id ) ; } / / l o c a l communications are unlabeled synchronization s write PL1DTREQ ( a d d r t $ addr , t y p e t $ type ) { p . ” write PL1DTREQ ” ( $ addr , $ type ) ; chan PL1DTREQ . ” w r i t e ” ( $ addr , $ type ) ; } / / exposing the p o r t s of the L1 cache synchronization s read L2L1CPREQ ( i d t $ id , a d d r t $ addr , t y p e t $ type ) label ” c read L2L1CPREQ ” ( $ id , $ addr , $ type ) { c . ” read L2L1CPREQ ” ( $ id , $ addr , $ type ) ; } . . . Figure 6: Composite encoding a component nesting a Processor, a L1 cache and two channels.

L1L2DTREQ . Some components such as the L1 cache have a set identiﬁer deﬁned at initialization, that isused to tag outgoing messages, or to ﬁlter messages according to their target address. Transitions withoutcommunications on channels are left unlabeled, and can thus occur at any time.

Assembling a Conﬁguration.

From these models of channels and components we hierarchicallybuild a representation of the full system. A ﬁrst composite type

CompositeCacheL1 is deﬁned containingan instance of a Processor, an instance of a L1 cache, and two instances of channels connecting themtogether.Channels are connected to appropriate endpoints using synchronizations. Unlabeled synchroniza-tions are used to label local communications within the composite. The synchronization s_write_-PL1DTREQ shown in Figure 6 is an example of this, and models the processor p sending a write request.The contents of the message (address and type) could be anything at this synchronization level. The la-bels that correspond to communications between the L1 cache and the L2 cache are however reexposedas labels of the composite, which simply forwards the request to the L1 cache instance.We then build a top level composite acting as main that contains an array of instances of CompositeCacheL1 , an array of instances of L2 caches and all the channel instances connecting them.The whole description is parametric and controlled by the three parameters set at the top of the ﬁle.Several other conﬁgurations were built to test each component in isolation.

Experiments and Measurements.

On these models we were able to prove absence of deadlock andsome logical properties expressed in CTL. Overall we considered all conﬁgurations with up to

NB_PROC = 6 processors, up to

NB_L2 = 3 L2 addresses, and a

CACHE_TH between 1 and 3. For each conﬁguration,we ran the veriﬁcation on an Intel Xeon 2.6 GHz machine with a limit set to 8 hours to complete thesimulation, and a maximum of 192 GB of memory. Table 2 reports on the experiments that ﬁnishedwithin the time and memory constraints, and gives their time, the number of accessible states and thememory used.The state space size is relatively modest, compared to the models crafted in other languages, thanksto the ﬁne control over the atomicity of steps offered in GAL. We were indeed able to scale up to theconﬁgurations of interest, i.e. 3 processors, 2 L2 cache, and a threshold of 2.A set of 16 properties were expressed in CTL, covering request response scenarios such as ”any timetwo processors share an address and one writes on it, the other one eventually gets an invalidate request.” . Meunier, Y. Thierry-Mieg, E. Encrenaz

NB_PROC,NB_L2,CACHE_TH which could be explored.

PROC L2 TH

States Time Mem1 1 1 51 0.05 51 1 2 52 0.05 51 1 3 53 0.06 51 2 1 555 0.13 71 2 2 565 0.12 71 2 3 575 0.18 81 3 1 4503 0.47 151 3 2 4572 0.52 151 3 3 4641 0.44 152 1 1 7070 0.54 142 1 2 1892 0.46 152 1 3 2160 0.47 152 2 1 681471 14 2322 2 2 68401 5 982 2 3 77449 6 992 3 1 2.76e+07 640.14 3482

PROC L2 TH

States Time Mem2 3 2 1.13e+06 44 5682 3 3 1.27e+06 46 5993 1 1 175234 3 573 1 2 226329 35 3263 1 3 130450 51 6253 2 1 4.32e+07 860.58 3871

We presented a case-study focusing on the modeling of a cache-coherence protocol in GAL, and dis-cussed previous implementations using other languages and tools for the same protocol. We noticed acertain difﬁculty, for the students who worked on the subject, to assimilate the underlying formalism ofeach tool. We also noticed how small changes in the language semantics can result in big changes, eitherin the model description or in the state space size. In particular, the communication and synchronisationprimitives offered by a language are of high importance for getting clean and efﬁcient models.The parametric and compositional features of GAL proved adequate to write models directly byhand. The parametric features are useful both for writing a model with several instances of a samemodule (e.g. cache) and for exploiting internally the similarities between these modules to improveveriﬁcation efﬁciency.Overall, this case study showed that model-checking tools could highlight real bugs and could runon problems of substantial size, provided appropriate formal models can be built accurately.

Acknowledgements

This work would not have been possible without the contributions of the students working on the projectMohamad Najem, Akli Mansour, Zahia Gharbi and Di Zhao.

References [1]

TSAR: Tera-Scale Multiprocessor ARchitecture home page . Available at .[2] Jir´ı Barnat, Lubo Brim & Petr Rockai (2009):

DiVinE 2.0: High-Performance Model Checking . In: , IEEE ComputerSociety Press, pp. 31–32, doi:10.1109/HiBi.2009.10. Modeling a Cache Coherence Protocol with GAL [3] Stefan Blom, Jaco van de Pol & Michael Weber (2010):

LTSmin: Distributed and symbolic reachability . In:

Computer Aided Veriﬁcation (CAV) , Springer, pp. 354–359, doi:10.1007/978-3-642-14295-6˙31.[4] Maximilien Colange, Souheib Baarir, Fabrice Kordon & Yann Thierry-Mieg (2013):

Towards DistributedSoftware Model-Checking using Decision Diagrams . In:

Computer Aided Veriﬁcation (CAV) , LNCS 8044 ,Springer Verlag, pp. 830–845, doi:10.1007/978-3-642-39799-8˙58.[5] Zahia Gharbi (2013):

V´eriﬁcation compositionnelle du Protocole de Coh´erence de Cache de la MachineMultiprocesseur TSAR (in French) . Master’s thesis, Universit´e Pierre et Marie Curie.[6]

ITS-tools model checker and GAL language home page.

Available at http://ddd.lip6.fr/ .[7] Akli Mansour (2012):

Mod´elisation et Analyse du protocole de coh´erence de caches de la machine multi-processeur TSAR : Absence de deadlocks (in French) . First Year Master Student Project, Universit´e Pierre etMarie Curie.[8] Mohamad Najem (2011):

Mod´elisation et Analyse du protocole de coh´erence de caches de la machine mul-tiprocesseur TSAR (in French) . First Year Master Student Project, Universit´e Pierre et Marie Curie.[9] Radek Pel´anek (2007):

BEEM: Benchmarks for Explicit Model Checkers . In:

Model Checking Software,14th Int’l SPIN Workshop , LNCS

Spin model checker home page.

Available at http://spinroot.com/ .[11] Di Zhao (2015):

V´eriﬁcation de protocole de coh´erence de cache hybride multicast/broadcast avec les tech-niques de model-checking (in French) . Master’s thesis, Universit´e Pierre et Marie Curie. . Meunier, Y. Thierry-Mieg, E. Encrenaz

A L1 Cache Finite State Machine

EMPTY MISS MISS_RETRY_RDMISS_TO_RETRYMISS_M_UP MISS_CLNUPMISS_WAITMISS_RETRYVALID_DATA ZOMBIEWRITE_WAIT_VALIDWRITE_WAIT_CLACK WRITE_WAIT_CLACK2WRITE_WAIT_EMPTY ?B_INV !CLNUP?B_INV?B_INV?B_INV?(B_INV || M_INV)?M_UP–>!RSP_M_UP?B_INV ?RSP_RD ?RSP_RD?CLACK!RD?(B_INV || M_INV) &&L2L1CPREQ.ADDR == ADDR_SAVE?M_UP–>!RSP_M_UP?(B_INV || M_INV) &&L2L1CPREQ.ADDR != ADDR_SAVE?CLACK?RSP_WR–> !RSP_DT_WR?DT_WR–> !WR?(B_INV || M_INV)?M_UP –> !RSP_M_UP ?B_INV?B_INV ?CLACK ?RSP_WR–> !RSP_DT_WR?(B_INV || M_INV)?M_UP –> !RSP_M_UP?(B_INV || M_INV) &&L2L1CPREQ.ADDR == V_ADDR–> !CLNUP ?RSP_WR–> !RSP_DT_WR ?DT_WR–> !WR?(B_INV || M_INV) &&L2L1CPREQ.ADDR != V_ADDR?M_UP –> !RSP_M_UP ?B_INV?CLACK?RSP_RD–> !RSP_DT_RD ?CLACK?DT_RD && PL1DTREQ.ADDR != V_ADDR–>!CLNUP?(M_INV || B_INV) &&L2PL1CPREQ.ADDR != V_ADDR?M_UP –> !RSP_M_UP?DT_RD &&PL1DTREQ.ADDR == V_ADDR–> !RSP_DT_RD?(M_INV || B_INV) &&L2PL1CPREQ.ADDR == V_ADDR–> !CLNUP?DT_RD–> !RD

Figure 7: L1 cache ﬁnite state machine. ’?’ denotes the reception of a message; ’!’ denotes the sendingof a message; → denotes actions associated to a guard. Some actions are omitted for readability.02 Modeling a Cache Coherence Protocol with GAL

B L2 Cache Finite State Machine

EMPTY READ_WAITGET_WRITE_WAITWRITE_WAIT VALID_MULTICASTVALID_MULTICAST_READ VALID_MULTICAST_CLNUPVALID_MULTICAST_UPDATE VALID_BROADCAST_INITUPDATE_WAITUPDATE_WAIT_CLNUP VALID_BROADCASTVALID_BROADCAST_INVBROADCAST_INV_PUTPUT_WAIT VALID_MULTICAST_UPDATE_CLNUP BROADCAST_INV_WAIT ?WR –> !RSP_WR CPT < CACHE_TH&& V_C_ID[CPT] != 0?RD ?WR && N_COPIES == 0–> !RSP_WR !RSP_RD–> N_COPIES++–> V_C_ID[CPT] = 1?CLNUP!CLACK–> N_COPIES–––> V_C_ID[CPT] = 0C_ID[CPT] != SRC_SAVE–> CPT++CPT < CACHE_TH&& V_C_ID[CPT] == 0?RD && N_CACHE >= CACHE_TH–> !RSP_RD–> N_COPIES++ CPT == CACHE_TH?RD –> !GET?WR –> !GET !RSP_WR ?RSP_GET ?WR &&N_COPIES > 0–> RSP_CPT = N_COPIESCPT < CACHE_TH –> !M_UP –> CPT++?RSP_M_UP–> RSP_CPT––?CLNUP !CLACK–> N_COPIES–––> V_C_ID[CPT] = 0C_ID[CPT] != SRC_SAVE ?RSP_M_UP–> RSP_CPT––CPT == CACHE_TH RSP_CPT == 0–> !RSP_WR?CLNUP !CLACK–> N_COPIES––C_ID[CPT] != SRC_SAVE–> CPT++ ?RD –> !RSP_RD –> N_COPIES++?CLNUP –> !CLACK –> N_COPIES--?WR && N_COPIES == 0 –> !RSP_WR ?WR && N_COPIES > 0–> RSP_CPT = N_COPIESCPT < NB_CACHE –> !B_INV?CLNUP –> RSP_CPT–––> !CLACKCPT == NB_CACHERSP_CPT == 0–> !RSP_WR!PUT?RSP_PUT ?CLNUP–> RSP_CPT–––> !CLACK