[PDF] An Order-Aware Dataflow Model for Parallel Unix Pipelines

Abstract

We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines.

Full PDF

AAn Order-aware Dataflow Modelfor Extracting Shell Script Parallelism

Shivam Handa

CSAIL, MIT, USA [email protected]

Konstantinos Kallas

University of Pennsyvlania, USA [email protected]

Nikos Vasilakis

CSAIL, MIT, USA [email protected]

Martin Rinard

CSAIL, MIT, USA [email protected]

Abstract

We present a dataflow model for extracting data parallelismlatent in Unix shell scripts. To accurately capture the seman-tics of Unix shell scripts, the dataflow model is order-aware, i.e. , , the order in which a node in the dataflow graph con-sumes inputs from different edges plays a central role in thesemantics of the computation and therefore in the resultingparallelization. We use this model to capture the semanticsof transformations that exploit data parallelism available inUnix shell computations and prove their correctness. Weadditionally formalize the translations from the Unix shellto the dataflow model and from the dataflow model back to aparallel shell script. We use a large number of real scripts toevaluate the parallel performance delivered by the dataflowtransformations, including the contributions of individualtransformations, achieving an average speedup of 6.14 × anda maximum of 61.1 × on a 64-core machine. The Unix shell is an attractive choice for specifying succinctand simple scripts for data processing, system orchestration,and other automation tasks [38]. Consider, for example, thefollowing script based on the original spell program byJohnson [3], lightly modified for modern environments: cat f1.md f2.md | tr A-Z a-z | ( 𝑆𝑝𝑒𝑙𝑙 ) tr -cs A-Za -z '\n' | tr -d[: punct :] | sort | uniq | comm -13 dict .txt - > out ; cat out | wc -l | sed 's/$/ mispelled words !/ ' The script streams two markdown files into a pipeline thatconverts characters in the stream into lower case, removespunctuation, sorts the stream in alphabetical order, removesduplicate words, and filters out words from a dictionary file(lines 1–3). A second pipeline (line 4) counts the resultinglines to report the number of misspelled words to the user.As this example illustrates, the Unix shell promotes amodel of computation in which each command executessequentially, with pipelined parallelism available betweencommands executing in the same pipeline. This pipelined Johnson’s program additionally used troff , prepare , and col -bx toclean up now-legacy formatting metadata that does not exist in markdown. model of parallel computation leaves substantial data par-allelism, i.e. , parallelism achieved by splitting inputs intopieces and feeding the pieces to parallel instances of thescript, unexploited.This fact is known in the Unix community and has moti-vated the development of a variety of tools that attempt toautomatically exploit data parallelism in shell scripts [22, 49].Unfortunately, these tools do not accurately model the se-mantics of the shell and are therefore unsound, both in theoryand in practice. These tools can easily generate highly per-formant parallelizations that produce incorrect results. (Foran illustration of how one such tool [49] produces incorrectresults on two scripts, one being Spell , see Appendix A.1).To support the ability to reason about and transformUnix shell scripts, we present a new order-aware dataflowmodel (ODFM). In contrast to standard dataflow models [24–26, 30, 31], our dataflow model is order-aware — i.e. , the orderin which a node in the dataflow graph consumes inputsfrom different edges plays a central role in the semantics ofthe computation and therefore in the resulting paralleliza-tion. Explicitly modeling this order is crucial for accuratelycapturing the semantics of Unix shell scripts, as observedindependently by different authors [5, 43]. In the Spell scriptshown earlier, for example, while all commands consumeelements from an input stream in order—a property of Unixstreams e.g. , pipes and FIFOs—they differ in how they con-sume across streams: cat reads input streams in the order ofits arguments, sort -m reads input streams in interleavedfashion, and comm -13 first reads dict.txt before readingfrom its standard input.We use this model to capture the semantics of transfor-mations that exploit data parallelism available in Unix shellcomputations and prove their correctness. The model andproofs of correctness can therefore provide the foundationfor a source-to-source compiler that takes as input a POSIXshell script, translates it into a dataflow graph, applies aseries of parallelizing transformations, then translates theresulting parallel dataflow graph back to a parallel POSIXshell script, with the parallelism guided explicitly via shellconstructs. a r X i v : . [ c s . P L ] D ec onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard We apply our techniques by building a prototype of sucha parallelizing shell compiler. The experimental results showthat our transformations can deliver significant performanceimprovements in this context, specifically an average speedupof . × and up to . × on a 64-core machine over 43 un-modified shell programs. The research presented in this pa-per provides a model and correctness proofs, resulting in aprovably correct parallelizing compiler for Unix shell scripts.This paper makes the following contributions: • Order-Aware Dataflow Model:

It introduces the order-aware dataflow model (ODFM), a dataflow model tailoredto the Unix shell that captures information about theorder in which nodes consume inputs from different inputedges (§4). • Translations:

It formalizes the bidirectional translationsbetween shell and ODFM required to translate shell scriptsinto ODFMs and ODFMs back to parallel scripts (§5). • Transformations and Proofs of Correctness:

It presentsa series of ODFM transformations for extracting data par-allelism. It also presents proofs of correctness for thesetransformations (§6). • Results:

It presents experimental results that evaluatethe parallel performance delivered by the transformations,including an analysis of the contributions of individualtransformations to the overall parallel performance (§7).The paper starts with an informal development building thenecessary background (§2) and expounding on

Spell (§3). Itthen presents the three main contributions outlined above(4–7). It then discusses comparison with prior work (§8), andcloses with a short discussion on the broader applicabilityof this work (§9). The Appendix highlights the correctnessissues faced in two use cases of semi-manual script paral-lelizations (§A.1), and contains the proofs of our correctnessclaims (§A.2).

This section reviews background on commands and abstrac-tions in the Unix shell.

Basics: Commands and Streams : A key Unix abstractionis the data stream , operated upon by executing commandsor processes . Streams are sequences of bytes, but most com-mands process them as higher-level sequences of lines, withthe newline character delimiting each line and the EOF con-dition representing the end of a stream. Streams are oftenreferenced by some identifier in a global name-space madeavailable by the Unix file-system such as /home/user/x .Some streams can persist as files beyond the execution ofthe process, whereas other streams are ephemeral in thatthey only exist to connect the output of one process to theinput of another process during their execution.Each process is an independent computation unit thatreads one or more input streams, performs a computation,and produces one or more output streams. Several commands are streaming , in the sense that they read their inputs andproduce their outputs incrementally, i.e. , line by line. Stream-ing commands often maintain minimal, if any, state whenprocessing their inputs—in many cases they operate on eachelement individually or on pairs of adjacent elements.As processes may need to access multiple streams duringtheir execution, the order of these accesses is important. Insome cases, they read streams in the order of the streamidentifiers provided. In other cases, the order is different—forexample, an input stream may configure a process so thatit must be read before all the others. In other cases, readsfrom multiple streams are interleaved according to somecommand-specific semantics.

Composition: Unix Operators : Unix provides severalprimitives for program composition, each of which imposesdifferent scheduling constraints on the program execution.Central among them is the pipe ( | ), a primitive that passesthe output of one process as input to the next. The twoprocesses form a pipeline, producing output and consuminginput concurrently and possibly at different rates. The Unixkernel facilitates program scheduling, communication, andsynchronization behind the scenes. For example, Spell ’s first tr transforms each character in the input stream to lowercase, passing the stream to the second tr : the two tr s forma parallel producer-consumer pair of processes.Apart from pipes, the language of the Unix shell providesseveral other forms of program composition. These includethe sequential composition operator ( ; ) for executing oneprocess after another has completed, the parallel compositionoperator ( & ) for executing a process in the background, con-currently with command following, and the short-circuitinglogical operators ( && and || ) for executing a process only ifthe previous process succeeds or fails, respectively. Other details of the Unix shell : Apart from the seman-tics of composition operators and individual commands, theexecution of a shell script may additionally depend on otherdetails such as a command’s flags, any arguments that havenot been expanded by the shell, and the values of variablesavailable in the broader environment.Unix commands expose many flags that potentially af-fect their behavior—including their interaction with streams.Consider

Spell ’s comm command, which performs a join-likeoperation to identify common elements between its two in-put streams. Invoked with no flags, comm produces an outputstream with three columns: one with lines unique to thefirst input, one with lines unique to the second input, onewith lines that exist in both inputs. Invoking comm with anycombination of -1 , -2 , and -3 suppresses the correspond-ing output column(s). In Spell ’s comm -13 dict.txt - thefirst and third columns are suppressed, and thus comm canbe viewed as reading dict.txt as a static “configuration”input. n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA Given a script, our techniques identify its dataflow regions,translate them to DFGs (Shell → DFG), apply graph transfor-mations that expose data parallelism on these DFGs, and re-place the original dataflow regions with the now-parallel re-gions (DFG → Shell). We illustrate these phases using

Spell (§1).

DFG1 uniq | > out DFG2 wc out | cat 1.md 2.md | comm sed; Shell → DFG : Provided ashell script, our techniquesstart by identifying subex-pressions that are potentiallyparallelizable. The first stepis to parse the script, creat-ing an abstract syntax tree like the one presented on theright. Here we omit any non-stream flags and refer to the tr – sort stages as a dotted edge ending with cat .We then identify dataflow barriers within the shell script:operators that enforce synchronization constraints such asthe sequential composition operator ( ; ). We call any set ofcommands that does not include a dataflow barrier a dataflowregion . In our example there are two dataflow regions, whichcorrespond to the following dataflow graphs: DFG1 comm

DFG2 uniq wc sedcat f1 f2 cat out For the rest of this section we focus on DFG1 since we exploseparallelism in each DFG separately.

Command Aggreg. Function cat cat $*tr A-Z a-z cat $*tr -d 𝛼 cat $*sort sort -m $*uniq uniq $*comm -13 𝛼 - cat $*wc -l paste -d+ $*|bcsed ’s/ 𝛼 / 𝛽 /’ cat $* Parallelizable Commands :The nodes of the dataflowgraphs are shell commands.We target divide and conquerparallelism available in par-allelizable commands, split-ting the input into pieces (atstream element boundaries),running the command on thepieces in parallel, then applying the command’s correspond-ing aggregation function to produce the final output. Wehave developed aggregation functions for a range of shellcommands. tr catsplit tr

The table (above) presents aggrega-tion functions for the shell commandsin our example (all of which are par-allelizable). For example, applying tr over the entire inputproduces the same result as split ting the input into two, ap-plying lower-case conversion via tr to the two partial inputs,and then merging the partial results with a cat aggregationfunction. Note that both split and cat are order aware, i.e. , split sends the first half of its input to the first tr and therest to the second, while cat concatenates its inputs in order.This guarantees that the output of the DFG is the same asthe one before the transformation. Parallelization Transformations : The next step is to ap-ply graph transformations to exploit parallelism present inthe computation represented by the DFG. As each paralleliz-able Unix command comes with a corresponding aggrega-tion function, our transformations first convert the DFG intoone that exploits parallelism at each stage. After applyingthe transformation to the two tr stages, the DFG looks asfollows: split trtr cat DFG1 split trtr catcat f1 f2 sort After these transformations are applied to all DFG nodes, thenext transformation pass is applied to pairs of cat and split nodes: whenever a cat is followed by a split of the samewidth, the transformation removes the pair and connects theparallel streams directly to each other. The goal is to push data parallelism transformations as far down the pipeline aspossible to expose the maximal amount of parallelism. Hereis the resulting DFG for the transformation applied to thetwo tr stages: trtr cat DFG1 split trtrcat f1 f2 split sortsort cat A similar transformation is applied to the first cat , which Ap-plying this transformation to the first three stages— i.e. , cat , tr , and tr —of DFG1 produces the following transformedDFG. tr . DFG1 communiqsorttrtr trtr catcat f2cat f1 The next node to parallelize is the sort node. To merge thepartial output of parallel sort s, we need to apply a sortedmerge. (In GNU systems, this is available as sort -m so weuse this as the label of the merging node.) The transformationthen removes cat , replicates sort , and merges their outputswith sort -m : DFG1 communiqsort -mtrtr trcat f1cat f2 tr sortsort

It then continues pushing parallelism down the pipeline,after applying a split function to split sort -m ’s outputs.

DFG1 splitsort -mtrtr trtr sortsort uniquniqcat f1cat f2

As mentioned earlier, a similar pass of iterative transforma-tions is applied to DFG2, but the two DFGs are not mergedto preserve the synchronization constraint of the dataflowbarrier “ ; ”. Order Awareness : Data parallelism in dataflow based sys-tems (e.g. [9, 56]) is usually achieved using sharding, i.e. ,partitioning input based on some key, or using shuffling, i.e. , arbitrary partitioning of inputs to parallel instances ofan operator. For sharding to work, parallel instances mustbe stateless across different shards i.e. , keys; for shufflingto work, they must be commutative and associative. The onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard

Unix shell is different because for most commands there isno obvious key to shard on and because parallelizable com-mands are often not commutative ( e.g. , uniq , cat -n ) andthus data parallelism requires a more careful treatment ofinput partitioning. splitcat uniquniq catcat f1f2 As a first example, consider

Spell ’s catf1.md f2.md command that starts read-ing from f2.md only after it has com-pleted reading f1.md ; note that any or both input streamsmay be pipes waiting for results from other processes. Thisorder can be visualized as a label over each input edge. Cor-rectly parallelizing this command requires ensuring thatparallel cat (and possibly followup stages) maintain thisorder. splitcommdict.txtcat uniquniquniquniq As a more interesting example,consider

Spell ’s comm , whose ODFGis shown on the right. Paralleliz-ing comm without taking order intoaccount is not trivial, because comm -13 ’s set difference is notcommutative: we cannot simply split its input streams intotwo pairs of partial inputs fed into two copies of comm with-out adding metadata capturing which subset partial resultscame from. Taking input ordering into account, however,highlights an important dependency between comm ’s inputs.The dict stream can be viewed as configuring comm , andthus comm can be modeled as consuming the entire dict stream before consuming partial inputs. commdict.txt catuniquniq commtee

122 121

Armed with this insight, we canparallelize comm by passing thesame dict.txt stream to both comm copies. This requires an intermediary tee for dupli-cating the dict.txt stream to both copies of comm , each ofwhich consumes the stream in its entirety before consumingthe results of the preceeding uniq .Ordering is also important to the DFG translation backto a shell script—as we do not interpret it directly. In thespecific example we need to know how to instantiate thearguments of each comm of all possible options— e.g. , comm-13 p1 p2 , comm -13 p2 p1 , cat p1 | comm -13 - p2 , etc. Aggregators are Unix commands with their own orderingcharacteristics that need to be accounted for.The order of input consumption in the examples of thissection is statically known and can be represented as a se-quence of the input edges of a node. To accurately capturethe behavior of shell programs, however, our dataflow modelis more expressive, allowing arbitrary order of input con-sumption. The correctness of our parallelization transforma-tions is predicated upon static but configurable orderings:a command reads a set of “configuration” streams to setupthe consumption order of its input streams which are thenconsumed in-order, one after the other.

DFG → Shell : The transformed graph is finally compiledback to a script that uses POSIX shell primitives to driveparallelism explicitly. The written parallel script for

Spell can be seen below. mkfifo t {0 ,1...}tr A-Z a-z < f1.md > t0 &tr A-Z a-z < f2.md > t1 &tr -d[: punct :] < t0 > t2 &tr -d[: punct :] < t1 > t3 &sort < t2 > t4 &sort < t3 > t5 &sort -m t4 t5 > t6 &split t7 t8 < t6 &...comm -13 dict .txt - < t9 > t11 &comm -13 dict .txt - < t10 > t12 &cat t11 t12 > out &waitrm t {0 ,1...}mkfifo t {0 ,1...}split t0 t1 < out &wc -l < t0 > t2 &wc -l < t1 > t3 &paste -d+ t2 t3 | bc > t4 &split t5 t6 < t4 &sed 's/$/␣ mispelled ␣ words !/ ' < t5 > t7 &sed 's/$/␣ mispelled ␣ words !/ ' < t6 > t8 &cat t7 t8 &waitrm t {0 ,1...}

The two ODFGs are compiled into the two fragments thatstart with mkfifo and end with rm . Each fragment uses aseries of named pipes (FIFOs) to explicitely manipulate theinput and output streams of each data-parallel instance, effec-tively laying out the ODFG. Aggregation functions are usedto merge partial outputs from previous commands comingin through multiple FIFOs—for example, sort -m t4 t6 and cat t11 t12 for the first fragment, and paste -d+ t2t3 | bc and cat t7 t8 for the second. A wait blocks untilall commands executing in parallel complete.The parallel script is simplified for clarity of exposition:it does not show the details of input splitting, handling of SIGPIPE deadlocks, and other technical details that are han-dled by the current implementation.

In this section we describe the order-aware dataflow model(ODFM) and its semantics.

As discussed earlier (§2), the two main shell abstractionsare (i) data streams, and (ii) commands communicating viastreams. We represent streams as named variables and com-mands as functions that read from and write to streams. n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA 𝑃 : = I ; O ; EI : = input 𝑥 O : = output 𝑥 E : = 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 ) Figure 1.

Dataflow Description Language (DDL).We first introduce some basic notation formalizing datastreams on which our dataflow description language works.For a set 𝐷 , we write 𝐷 ∗ to denote the set of all finite wordsover 𝐷 . For words 𝑥, 𝑦 ∈ 𝐷 ∗ , we write 𝑥 · 𝑦 or 𝑥𝑦 to denotetheir concatenation. We write 𝜖 for the empty word and ⊥ for the End-of-File condition. We say that 𝑥 is a prefix of 𝑦 ,and we write 𝑥 ≤ 𝑦 , if there is a word 𝑧 such that 𝑦 = 𝑥𝑧 .The ≤ order is reflexive, antisymmetric, and transitive ( i.e. ,it is a partial order), and is often called the prefix order . Weuse the notation 𝐷 ∗ · ⊥ to denote a closed stream , abstractlyrepresenting a file/pipe stream that has been closed, i.e. , onewhich no process will open for writing. The notation 𝐷 ∗ isused to denote an open stream , abstractly representing anopen pipe. Later, other process may add new elements at theend of this value. Figure 1 presents the Dataflow Description Language (DDL)for defining dataflow graphs (DFG). A program 𝑝 ∈ 𝑃 inDDL is of the form I ; O ; E . I and O represent sets of edges,vectors of the form 𝑥 = ⟨ 𝑥 , 𝑥 , . . . 𝑥 𝑛 ⟩ . Variables 𝑥 , 𝑥 , . . . represent DFG edges, i.e. , streams used as a communicationchannel between DFG nodes and as the input and output ofthe entire DFG. I is of the form input 𝑥 , where 𝑥 is the set of the inputvariables. Each variable 𝑥 ∈ I represents a file file ( 𝑥 ) thatis read from the Unix filesystem. Note that more than oneinput variables could refer to the same file. O is of the form output 𝑥 , where 𝑥 is the set of outputvariables. Each variable 𝑥 ∈ O represents a file file ( 𝑥 ) thatis written to the Unix filesystem. E represents the nodes of the DFG. A node 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 ) represents a function from list of input variables (edges) 𝑥 𝑖 tooutput variables (edges) 𝑥 𝑜 . We require that 𝑓 is monotonewith respect to a lifting of the prefix order for a sequenceof inputs; that is, ∀ , 𝑣, 𝑣 ′ , 𝑣 𝑖 , if 𝑣 ≤ 𝑣 ′ , ⟨ 𝑣 , . . . 𝑣 𝑛 ⟩ = 𝑓 ( 𝑣, 𝑣 𝑖 ) and ⟨ 𝑣 ′ , . . . 𝑣 ′ 𝑛 ⟩ = 𝑓 ( 𝑣 ′ , 𝑣 𝑖 ) , then ∀ 𝑘 ∈ [ , 𝑛 ] . 𝑣 𝑘 ≤ 𝑣 ′ 𝑘 . Thiscaptures the idea that a node cannot retract output that ithas already produced.We wrap all functions 𝑓 with an execution wrapper (cid:74)(cid:75) that ensures that all outputs of 𝑓 are closed when its inputsare closed: (cid:74) 𝑓 ( 𝑣 · ⊥ , 𝑣 · ⊥ , . . . 𝑣 𝑛 · ⊥) (cid:75) = ⟨ 𝑣 ′ · ⊥ , 𝑣 ′ · ⊥ , . . . 𝑣 ′ 𝑘 · ⊥⟩ This is helpful to ensure termination. From now on, we onlyrefer to the wrapped function semantics. We also assume that commands do not produce output ifthey have not consumed any input, i.e. , the following is true: ⟨ 𝜖, . . . , 𝜖 ⟩ = (cid:74) 𝑓 ( 𝜖, . . . , 𝜖 ) (cid:75) A variable in DDL is assigned only once and consumed by only one node. DDL does not allow the dataflow graph tocontain any cycles . This also holds for variables in I and O ,which cannot refer to the same variables in I and never as-signed a different value in E . Similarly, variables in O are notread by any node in E . All variables which are not includedin I and O abstractly represent temporary files/pipes whichare created during the execution of a shell script. We assumethat within a dataflow program, all variables are reachablefrom some input variables. Execution Semantics : Figure 2 presents the small stepexecution semantics for DDL. Maps Γ associates variablenames to the data contained in the stream it represents. Map 𝜎 associates variable names to the data in the stream that hasalready been processed—representing the read-once seman-tics of Unix pipes. Let ⟨ 𝑥 ′ , . . . 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) be a nodein our DFG program. The function choice 𝑓 represents theorder in which a commands consumes its inputs by return-ing a set of input indexes on which the function blocks onwaiting to read. For example, the choice 𝑐𝑎𝑡 function for thecommand cat always returns the next non-closed index—as cat reads its inputs in sequence, each one until depletion. { 𝑘 + } = choice 𝑐𝑎𝑡 ( 𝑣 · ⊥ , . . . 𝑣 𝑘 · ⊥ , 𝑣 𝑘 + , . . . 𝑣 𝑛 ) For a choice 𝑓 function to be valid, it has to return an inputindex that has not been closed yet. Formally, S = choice 𝑓 ( 𝑣 , . . . 𝑣 𝑘 · ⊥ , . . . 𝑣 𝑛 ) = ⇒ 𝑘 ∉ S We assume that the set returned by choice 𝑓 cannot be empty unless all input indexes are closed, meaning that all nodesconsume all of their inputs until depletion even if they donot need the rest of it for processing.The small step semantics non deterministically picks avariable 𝑥 𝑘 , such that 𝑘 ∈ choice 𝑓 ( 𝑣 , . . . 𝑣 𝑛 ) , i.e. , 𝑓 is wait-ing to read some input from 𝑥 𝑘 , and 𝜎 ( 𝑥 𝑘 ) < Γ ( 𝑥 𝑘 ) , i.e. , thereis data on the stream represented by variable 𝑥 𝑘 that has tobe processed. The execution then retrieves the next mes-sage 𝑣 𝑥 to process, and computes new messages 𝑣 𝑚 , . . . 𝑣 𝑚𝑝 to pass on to the output streams 𝑥 ′ , . . . 𝑥 ′ 𝑝 . Note that any ofthese messages (input or output) might be ⊥ . We pass 𝑣 𝑘 ◦ 𝑣 𝑥 ,which denotes that the previous data 𝑣 𝑘 is now being com-bined with the new message 𝑣 𝑥 , to function 𝑓 . The preciseexecution semantics of function 𝑓 defines how the new mes-sage will be processed but formally, given ⟨ 𝑣 ′ , 𝑣 ′ , . . . 𝑣 ′ 𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 , . . . , 𝑣 𝑘 , . . . 𝑣 𝑛 ) (cid:75) the following constraint holds: ⟨ 𝑣 𝑚 , . . . 𝑣 𝑚𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 𝑖 , 𝑣 𝑥 ◦ 𝑣, 𝑣 𝑗 ) (cid:75) 𝑠 ⇐⇒ ⟨ 𝑣 ′ · 𝑣 𝑚 , . . . 𝑣 ′ 𝑝 · 𝑣 𝑚𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 . . . , 𝑣 𝑘 · 𝑣 𝑥 , . . . 𝑣 𝑛 ) (cid:75) The messages 𝑣 𝑚 , . . . 𝑣 𝑚𝑝 are passed on to their respectiveoutput streams (by updating Γ ). Note that the size of the onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard ⟨ 𝑥 ′ , . . . 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑘 . . . 𝑥 𝑛 ) ∈ E 𝑣 𝑘 · 𝑣 𝑥 ≤ Γ ( 𝑥 𝑘 ) | 𝑣 𝑥 | = ∨ 𝑣 𝑥 = ⊥ 𝑖 ∈ [ , 𝑘 − ] ∪ [ 𝑘 + , 𝑛 ] .𝑣 𝑖 = 𝜎 ( 𝑥 𝑖 ) 𝑘 ∈ choice 𝑓 ( 𝑣 , . . . 𝑣 𝑛 ) ⟨ 𝑣 𝑚 , . . . 𝑣 𝑚𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 , . . . 𝑣 𝑘 ◦ 𝑣 𝑥 . . . 𝑣 𝑛 ) (cid:75) 𝑠 I , O , E ⊢ Γ [ 𝑥 ′ → 𝑣 ′ , . . . 𝑥 ′ 𝑝 → 𝑣 ′ 𝑝 ] , 𝜎 [ 𝑥 𝑘 → 𝑣 𝑘 ] ⇝ Γ [ 𝑥 ′ → 𝑣 ′ · 𝑣 𝑚 , . . . 𝑥 ′ 𝑝 → 𝑣 ′ 𝑝 · 𝑣 𝑚𝑝 ] , 𝜎 [ 𝑥 𝑘 → 𝑣 𝑘 · 𝑣 𝑥 ] Step

Figure 2.

Small Steps Execution Semanticsoutput messages could vary, and they could even be empty.Finally, 𝜎 is updated to denote that 𝑣 𝑥 has been processed. Execution : Let ⟨I , O , E⟩ be a dataflow program, where I = input 𝑥 𝑖 are the input variables, and output 𝑥 𝑜 arethe output variables. Let 𝜎 𝑖 be the initial mapping from allvariable names in the dataflow program ⟨I , O , E⟩ to emptystring 𝜖 . Let Γ 𝑖 be the initial mapping for variables in thedataflow program, such that all non-input variables 𝑥 ∉ 𝑥 𝑖 ,map to the empty string Γ 𝑖 ( 𝑥 ) = 𝜖 . In contrast, all inputvariables 𝑥 ∈ 𝑥 𝑖 , i.e. , files already present in the file system,are mapped to the contents of the respective input file Γ 𝑖 ( 𝑥 ) = 𝑣 · ⊥ .When no more small step transitions can take place ( i.e. , allcommands have finished processing), the dataflow executionterminates and the contents of output variables in O can bewritten to their respective output files. Figure 3 representsthe constraint that has to be satisfied by Γ at the end ofexecution, i.e. , when all variables are processed. We nowprove some auxiliary theorems and lemmas to show thatdataflow programs always terminate and that when theyterminate, the constraint in Figure 3 holds. Theorem 4.1.

The Dataflow program will always terminate.And when it terminates, constraint 3 will be true at the endstate.Proof.

In appendix (§A.4). □ This section formalizes the translation between shell pro-grams and dataflow programs.

Given a shell script the compiler starts by recursing on theAST, replacing subtrees in a bottom-up fashion with dataflowprograms. Fig. 4 shows a relevant subset of shell syntax,adapted from Smoosh [18]. Intuitively, some shell constructs(such as pipes | ) allow for the composition of the dataflowprograms of their components, while others (such as ; ) pre-vent it. Figure 5 shows the translation rules for some interest-ing constructs, and Figure 6 shows several auxiliary relationsthat are part of this translation. We denote compilation froma shell AST to a shell AST as 𝑐 ↑ 𝑐 ′ , and compilation to adataflow program as 𝑐 ↑ ⟨ 𝑝, 𝑏 ⟩ where 𝑝 is a dataflow pro-gram and 𝑏 ∈ { bg , fg } denotes whether the program is to beexecuted in the foreground or background .The first two rules CommandTrans and CommandId de-scribe compilation of commands. The bulk of the work is done in cmd2node , which, when possible, defines a corre-spondence between a command and a dataflow node. Predi-cate pure indicates whether the command is pure, i.e. , whetherit only interacts with its environment by reading and writingto files. All commands that we have seen until now ( grep,sort, comm ) satisfy this predicate. The relations ins and outs define a correspondence between a commands argu-ments and the nodes inputs and outputs. We assume that avariable is uniquely identified from the file that it refers too,therefore if two variables have the same name, then they alsorefer to the same files. Finally, relation func extracts informa-tion about the execution of the command (such as its choice function and w ) to be able to reconstruct it later on. All four re-lations ( pure , ins , outs , and func ) can be constructed throughanalysis, developer annotations, or a combination thereof.The rule BackgroundDfg sets the background flag forthe underlying dataflow program; if the operand of a & is notcompiled to a dataflow program then it is simply left as is.The last part holds for all shell constructs, we currently onlycreate dataflow nodes from a single command.The next set of rules refer to the sequential compositionoperator “ ; ”. This operator acts as a dataflow barrier sinceit enforces an execution ordering between its two operands.Because of that, it forces the dataflow programs that are gen-erated from its operands to be optimized (with opt) and thencompiled back to shell scripts (with ⇓ ). However, there is onecase (SeqBothBg) where a dataflow region can propagatethrough a “ ; ” and that is if the first component is to be exe-cuted in the background. In this case “ ; ” does not enforce anexecution order constraint between its two operands and thegenerated dataflow programs can be safely composed into abigger one. The rules for “ && ” and “ || ” are similar (omitted).The relation compose unifies two dataflow programs bycombining the inputs of one with the outputs of the otherand vice versa. Before doing that, it ensures that the com-posed dataflow graph will be valid by checking that there isat most one reader and one writer for each internal and out-put variable, as well as all the rest of the dataflow programinvariants, e.g. , the absence of cycles (§4).The remaining rules (not shown) introduce synchroniza-tion constraints and are not part of our parallelization effort—for example, we consider all branching operators as strictdataflow barriers. n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA ⟨ 𝑥 ′ , . . . 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) ∈ E ⟨ 𝑣 ′ · ⊥ , . . . 𝑣 ′ 𝑝 · ⊥⟩ = (cid:74) 𝑓 ( 𝑣 · ⊥ , . . . 𝑣 𝑛 · ⊥) (cid:75) I , O , E ⊢ Γ [ 𝑥 ′ → 𝑣 ′ · ⊥ , . . . 𝑥 ′ 𝑝 → 𝑣 ′ 𝑝 · ⊥ , 𝑥 → 𝑣 · ⊥ , . . . 𝑥 𝑛 → 𝑣 𝑛 · ⊥] Completion

Figure 3.

Execution ConstraintsCommands 𝑐 : = ( 𝑠 = 𝑤 ) ∗ 𝑤𝑟 ∗ | pipe | 𝑐 + & ? | 𝑐𝑟 + | 𝑐 & | ( 𝑐 ) | 𝑐 ; 𝑐 | 𝑐 && 𝑐 | 𝑐 || 𝑐 | ! 𝑐 | while 𝑐 𝑐 | for 𝑠𝑤 𝑐 | if 𝑐 𝑐 𝑐 | case 𝑤𝑐𝑏 ∗ | 𝑠 () 𝑐 Redirections 𝑟 : = . . . Words 𝑤 : = ( 𝑠 | ␣ | 𝑘 ) ∗ Control Codes 𝑘 : = . . . Strings 𝑠 ∈ Σ + Figure 4.

A relevant subset of shell syntax presented inSmoosh [18].

Figure 7 presents the compilation ⇓ of a dataflow program 𝑝 = I ; O ; E to a shell program. The compilation can beseparated in a prologue, the main body, and an epilogue.The prologue creates a named pipe ( i.e. , Unix FIFO) forevery variable in the program. Named pipes are created ina temporary directory using the mkfifo command, and aresimilar in behavior to ephemeral pipes except that they areexplicitly associated to a file-system identifier— i.e. , they area special file in the file-system. Named pipes are used inplace of ephemeral pipes ( | ) in the original script.The epilogue inserts a wait to ensure that all the nodes inthe dataflow have completed execution, and then removesall named pipes from the temporary directory using rm . Thedesign of the prologue-epilogue pair mimics how Unix treatsephemeral pipes, which correspond to temporary identifiersin a hidden file-system.The main body expresses the parallel computation andcan also be separated into three components. For each of theinput variables 𝑥 𝑖 ∈ I , we add a command that copies thefile f = file ( 𝑥 𝑖 ) to its designated pipe. Similarly, for all outputvariables 𝑥 𝑖 ∈ O we add a command that copies the desig-nated pipe to the output file in the filesystem f = file ( 𝑥 𝑖 ) .Finally, we translate each node in E to a shell command thatreads from the pipes corresponding to its input variables andwrites to the pipes corresponding to its output variables. Inorder to correctly translate a node back to a command, weuse the node-command correspondence functions (similar tothe ones for ↑ ) that were used for the translation of the com-mand to a node. Since a translated command might get itsinput from (or send its output to) a named pipe, we need toalso add those as new redirections with in_out . For example,for a node 𝑥 ← 𝑓 ( 𝑥 , 𝑥 ) that first reads 𝑥 and then reads 𝑥 , where 𝑓 = comm -23 , the following command would beproduced: comm -23 p2 p1 > p3 & In this section we define a set of transformations that exposedata parallelism on a dataflow graph. We start by defininga set of helper DFG nodes and a set of auxiliary transfor-mations to simplify the graph and enable the parallelizationtransformations. Then we identify a property on dataflownodes that indicates whether the node can be executed ina data parallel fashion. We then define the parallelizationtransformations and we conclude with a proof that applyingall of the transformations preserves the semantics of theoriginal DFG.

Before we define the parallelization transformations, we in-troduce several helper functions that can be used as dataflownodes. The first function is split . split takes a single inputvariable (file or pipe) and sequentially splits into multipleoutput variables. The exact size of the data written in eachoutput variable is left abstract since it does not affect cor-rectness but only performance. 𝑥 ← split ( 𝑥 𝑖 ) 𝑣 = ⟨ 𝑣 · ⊥ , 𝑣 · ⊥ , . . . 𝑣 𝑘 − · ⊥ , 𝑣 𝑘 , 𝜖, . . . 𝜖 ⟩ ,𝑣 𝑐 = 𝑣 · 𝑣 . . . · 𝑣 𝑘 ∀ 𝑣 𝑐 . (cid:74) split ( 𝑣 𝑐 ) (cid:75) = 𝑣𝑣 = ⟨ 𝑣 · ⊥ , 𝑣 · ⊥ , . . . 𝑣 𝑚 · ⊥⟩ , 𝑣 𝑐 = 𝑣 · 𝑣 . . . · 𝑣 𝑚 · ⊥∀ 𝑣 𝑐 . (cid:74) split ( 𝑣 𝑐 ) (cid:75) = 𝑣 The second function is cat , which coincidentally behaves thesame as the Unix command cat . cat , given a list of inputvariables, combines their values and assigns it to a singleoutput variable. Formally cat is defined below: 𝑥 𝑖 ← cat ( 𝑥 ) 𝑣 = ⟨ 𝑣 · ⊥ , 𝑣 · ⊥ , . . . 𝑣 𝑘 − · ⊥ , 𝑣 𝑘 , . . . 𝑣 𝑚 ⟩ ,𝑣 𝑐 = 𝑣 · 𝑣 . . . · 𝑣 𝑘 ∀ 𝑣. (cid:74) cat ( 𝑣 ) (cid:75) = 𝑣 𝑐 𝑣 = ⟨ 𝑣 · ⊥ , 𝑣 · ⊥ , . . . 𝑣 𝑚 · ⊥⟩ , 𝑣 𝑐 = 𝑣 · 𝑣 . . . · 𝑣 𝑚 · ⊥∀ 𝑣. (cid:74) cat ( 𝑣 ) (cid:75) = 𝑣 𝑐 onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard cmd2node ( 𝑤, 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 )) add_metadata ( 𝑓 , 𝑎𝑠, 𝑟 ) = 𝑓 ′ redir ( 𝑥 𝑜 , 𝑥 𝑖 , 𝑟, 𝑥 ′ 𝑜 , 𝑥 ′ 𝑖 ) 𝑎𝑠𝑤𝑟 ↑ ⟨ input 𝑥 ′ 𝑖 ; output 𝑥 ′ 𝑜 ; 𝑥 ′ 𝑜 ← 𝑓 ′ ( 𝑥 ′ 𝑖 ) , fg ⟩ CommandTranscmd2node ( 𝑤, ⊥) 𝑎𝑠𝑤𝑟 ↑ 𝑎𝑠𝑤𝑟 CommandId 𝑐 ↑ ⟨ 𝑝, 𝑏 ⟩ 𝑐 & ↑ ⟨ 𝑝, bg ⟩ BackgroundDfg 𝑐 ↑ 𝑐 ′ 𝑐 & ↑ 𝑐 ′ & BackgroundCmd 𝑐 ↑ ⟨ 𝑝 , bg ⟩ 𝑐 ↑ ⟨ 𝑝 , 𝑏 ⟩ 𝑐 ; 𝑐 ↑ ⟨ compose ( 𝑝 , 𝑝 ) , 𝑏 ⟩ SeqBothBg 𝑐 ↑ 𝑐 ′ 𝑐 ↑ ⟨ 𝑝 , bg ⟩ opt ( 𝑝 ) ⇓ 𝑐 ′ 𝑐 ; 𝑐 ↑ 𝑐 ′ ; ( 𝑐 ′ & ) SeqRightBg 𝑐 ↑ ⟨ 𝑝 , fg ⟩ 𝑐 ↑ ⟨ 𝑝 , bg ⟩ opt ( 𝑝 ) ⇓ 𝑐 ′ opt ( 𝑝 ) ⇓ 𝑐 ′ 𝑐 ; 𝑐 ↑ 𝑐 ′ ; ( 𝑐 ′ & ) SeqBothFgBg 𝑐 ↑ 𝑐 ′ 𝑐 ↑ ⟨ 𝑝 , fg ⟩ opt ( 𝑝 ) ⇓ 𝑐 ′ 𝑐 ; 𝑐 ↑ 𝑐 ′ ; 𝑐 ′ SeqRightFg 𝑐 ↑ 𝑐 ′ 𝑐 ↑ 𝑐 ′ 𝑐 ; 𝑐 ↑ 𝑐 ′ ; 𝑐 ′ SeqNone 𝑐 ↑ ⟨ 𝑝 , 𝑏 ⟩ , . . . 𝑐 𝑛 ↑ ⟨ 𝑝 𝑛 , 𝑏 𝑛 ⟩ ,𝑝 ′ . . . 𝑝 ′ 𝑛 − = map ( connectpipe , 𝑝 . . . 𝑝 𝑛 − ) 𝑝 = fold_left ( compose , 𝑝 ′ 𝑝 ′ . . . 𝑝 ′ 𝑛 − 𝑝 𝑛 ) 𝑝𝑖𝑝𝑒 | 𝑐 𝑐 . . . 𝑐 𝑛 & ↑ ⟨ 𝑝, 𝑏𝑔 ⟩ PipeBG 𝑐 ↑ ⟨ 𝑝 , 𝑏 ⟩ , . . . 𝑐 𝑛 ↑ ⟨ 𝑝 𝑛 , 𝑏 𝑛 ⟩ ,𝑝 ′ . . . 𝑝 ′ 𝑛 − = map ( connectpipe , 𝑝 . . . 𝑝 𝑛 − ) 𝑝 = fold_left ( compose , 𝑝 ′ 𝑝 ′ . . . 𝑝 ′ 𝑛 − 𝑝 𝑛 ) 𝑝𝑖𝑝𝑒 | 𝑐 𝑐 . . . 𝑐 𝑛 ↑ ⟨ 𝑝, 𝑓 𝑔 ⟩ PipeFG

Figure 5.

A subset of the compilation rules. ( vars (E ) \ vars (I )) ∩ ( vars (E ) \ vars (I )) = ∅ I ′ = I \ O ∪ I \ O O ′ = O \ I ∪ O \ I 𝑝 ′ = I ′ ; O ′ ; E ∪ E ′ vars (E ) \ ( vars (I ) ∪ vars (O ) ∩ vars (I ) = ∅ vars (E ) \ ( vars (I ) ∪ vars (O ) ∩ vars (I ) = ∅ valid ( 𝑝 ′ )(cid:174) 𝑥 = vars (E ) ∩ vars (E ) \ ( vars ( 𝐼 ) ∪ vars ( 𝐼 ) ∪ vars ( 𝑂 ) ∪ vars ( 𝑂 )) (cid:174) 𝑥 ∉ vars (E ) ∪ vars (E )| (cid:174) 𝑥 | = | (cid:174) 𝑥 | E ′ = E [ (cid:174) 𝑥 /(cid:174) 𝑥 ] ∀ 𝑥 ∈ (I ∪ I \ I ′ ) ∪ (O ∪ O \ O ′ ) . pipe ( 𝑥 ) compose (I ; O ; E , I ; O ; E ) = 𝑝 ′ pure ( 𝑤 ) ins ( 𝑤, 𝑥 𝑖 , 𝑓 ) outs ( 𝑤, 𝑥 𝑜 , 𝑓 ) func ( 𝑤, 𝑓 ) cmd2node ( 𝑤, 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 )) ¬ pure ( 𝑤 ) cmd2node ( 𝑤, ⊥) ins 𝑓 ( 𝑥 𝑖 , 𝑤, 𝑥 𝑖𝑛 ? ) outs 𝑓 ( 𝑥 𝑜 , 𝑤, 𝑥 𝑜𝑢𝑡 ? ) func ( 𝑤, 𝑓 ) node2cmd ( 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 ) , 𝑤, 𝑥 𝑖𝑛 ? , 𝑥 𝑜𝑢𝑡 ? ) O ′ = O [ 𝑥 stdin / 𝑥 stdout ] connectpipe (I ; O ; E) = I ; O ′ ; E node2cmd ( 𝑥 ′ 𝑜 ← 𝑓 ( 𝑥 ′ 𝑖 ) , 𝑤, 𝑥 𝑖𝑛 ? , 𝑥 𝑜𝑢𝑡 ? ) get_metadata ( 𝑓 ) = ⟨ 𝑎𝑠, 𝑟 ⟩ redir ( 𝑥 ′ 𝑜 , 𝑥 ′ 𝑖 , 𝑟, 𝑥 𝑜 , 𝑥 𝑖 ) 𝑟 ′ = in_out ( 𝑟, 𝑥 𝑖𝑛 ? , 𝑥 𝑜𝑢𝑡 ? ) instantiate ( 𝑥 𝑜 ← 𝑓 ( 𝑥 𝑖 ) = 𝑎𝑠𝑤𝑟 Figure 6.

Auxiliary relations for translating commands to nodes and back. I = input 𝑥 𝑖 , . . . 𝑥 𝑖𝑛 cin = cat 𝑓 𝑖𝑙𝑒 ( 𝑥 𝑖 ) > 𝑝𝑖𝑝𝑒 ( 𝑥 𝑖 ) & ; . . . ; cat 𝑓 𝑖𝑙𝑒 ( 𝑥 𝑘 ) > 𝑝𝑖𝑝𝑒 ( 𝑥 𝑖𝑛 ) & O = output 𝑥 𝑜 , . . . 𝑥 𝑜𝑚 cin = cat 𝑝𝑖𝑝𝑒 ( 𝑥 𝑜 ) > 𝑓 𝑖𝑙𝑒 ( 𝑥 𝑜 ) & ; . . . ; cat 𝑝𝑖𝑝𝑒 ( 𝑥 𝑜𝑚 ) > 𝑓 𝑖𝑙𝑒 ( 𝑥 𝑜𝑚 ) & O = 𝑛 , . . . 𝑛 𝑘 cnodes = instantiate ( 𝑛 ) & ; . . . ; instantiate ( 𝑛 𝑘 ) & body (I , O , E) = cin ; cout ; cnodes vars ( 𝑝 ) = ⟨ 𝑣 , . . . , 𝑣 𝑛 ⟩ prologue ( 𝑝 ) = mkfifo /tmp/p1..n prologue ( 𝑝 ) = 𝑝𝑟 epilogue ( 𝑝 ) = 𝑒𝑝𝑝 ⇓ 𝑝𝑟 ; body ( 𝑝 ) ; 𝑒𝑝 vars ( 𝑝 ) = ⟨ 𝑣 , . . . , 𝑣 𝑛 ⟩ epilogue ( 𝑝 ) = wait ; rm /tmp/p1..n Figure 7.

DFG to shell transformations. n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA

The third function is tee , which behaves the same way as theUnix command tee , i.e. copying its input variable to severaloutput variables. Formally tee is defined below: 𝑥 ← tee ( 𝑥 𝑖 ) 𝑣 = ⟨ 𝑣 𝑐 , 𝑣 𝑐 , . . . , 𝑣 𝑐 ⟩ , ∀ 𝑣 𝑐 . (cid:74) tee ( 𝑣 𝑐 ) (cid:75) = 𝑣𝑣 = ⟨ 𝑣 𝑐 · ⊥ , 𝑣 𝑐 · ⊥ , . . . , 𝑣 𝑐 · ⊥⟩ , ∀ 𝑣 𝑐 . (cid:74) tee ( 𝑣 𝑐 · ⊥) (cid:75) = 𝑣 The final function is relay . relay works as an identity function.Formally relay is defined below: 𝑥 𝑖 ← relay ( 𝑥 𝑗 ) , ∀ 𝑣. (cid:74) relay ( 𝑣 ) (cid:75) = 𝑣, ∀ 𝑣. (cid:74) relay ( 𝑣 · ⊥) (cid:75) = 𝑣 · ⊥ Using these helper nodes our compiler performs a setof auxiliary transformations that are depicted in Figure 8. relay acts an identity function, therefore any edge can betransformed to include a relay between them. Spliting inmultiple stages to get 𝑛 edges is the same as splitting in onestep into 𝑛 edges. Similarly, combining 𝑛 edges in multiplestages is the same as combining 𝑛 edges in a single stage.If we split an edge into 𝑛 edges, and then combine the 𝑛 edges back, this behaves as an identity. A cat can be pushedfollowing a tee by creating 𝑛 copies of the tee function.The first five transformation can be performed both ways.The last three transformations are one way. A split after a cat can be converted into relays, if the input arity of cat isthe same as output arity of split . If a cat has single incomingedge, we can convert it into a relay. If a split has a singleoutgoing edge, we can convert it into a relay. Note thatreverse transformations in these cases is not allowed. The dataflow model exposes task parallelism as each differentnode can execute independently—only communicating withthe other nodes through their communication channels. Inaddition to that, it is possible to achieve data parallelism byexecuting some nodes in parallel by partitioning part of theirinput.

Sequential Consumption Nodes : We are interested innodes that produce a single output and consume their inputsin sequence (one after the other when they are depleted),after having consumed the rest of their inputs as an initial-ization and configuration phase. Note that there are severalexamples of shell commands that correspond to such nodes,e.g. grep , sort , comm -13 , and sha1sum . Let such a node 𝑥 ′ = 𝑓 ( 𝑥 , . . . , 𝑥 𝑛 + 𝑚 ) , where w.l.o.g. 𝑥 , 𝑥 , . . . , 𝑥 𝑛 representthe configuration inputs and 𝑥 𝑛 + , . . . , 𝑥 𝑛 + 𝑚 represent thesequential consumption inputs. The consumption order ofsuch a command is shown below: choice 𝑓 ( 𝑣 ) = (cid:40) { 𝑖 : 𝑖 ≤ 𝑛 ∧ ¬ 𝑐𝑙𝑜𝑠𝑒𝑑 ( 𝑣 𝑖 )} if ¬∀ 𝑖 ≤ 𝑛, 𝑐𝑙𝑜𝑠𝑒𝑑 ( 𝑣 𝑖 ){ 𝑖 : ∀ 𝑗 < 𝑖, 𝑐𝑙𝑜𝑠𝑒𝑑 ( 𝑣 𝑗 )} otherwise If we know that a command 𝑓 satisfies the above property wecan safely transform it to a 𝑥 𝑖 = cat ( 𝑥 𝑛 + , . . . , 𝑥 𝑛 + 𝑚 ) followedby a command 𝑥 ′ = 𝑓 ′ ( 𝑥 𝑖 , 𝑥 , . . . , 𝑥 𝑛 ) , without altering thesemantics of the graph. Data Parallel Nodes : We now shift our focus to a subsetof the sequential consumption nodes, namely those that canbe executed in a data parallel fashion by splitting their inputs.These are nodes that can be broken down in a parallel map 𝑓 𝑚 and an associative aggregate 𝑓 𝑟 . Formally, these nodeshave to satisfy the following equation: ∀ 𝑛, (cid:74) 𝑓 ( 𝑣 𝑖 · · · 𝑣 𝑖𝑛 , 𝑣 ) (cid:75) = (cid:74) 𝑓 𝑟 ( 𝑣 , . . . , 𝑣 𝑛 , 𝑣 ) (cid:75) where ∀ 𝑖 ∈ { . . . 𝑛 } 𝑣 𝑖 = (cid:74) 𝑓 𝑚 ( 𝑣 𝑖𝑖 , 𝑣 ) (cid:75) We denote data parallel nodes as dp ( 𝑓 , 𝑓 𝑚 , 𝑓 𝑟 ) Example ofsuch a node that satisfies this property is the sort command,where 𝑓 𝑚 = sort and 𝑓 𝑟 = sort -m .In addition to the above equation, a map function 𝑓 𝑚 should not output anything when its input closes. (cid:74) 𝑓 𝑚 ( 𝑣, 𝑣 ) (cid:75) = (cid:74) 𝑓 𝑚 ( 𝑣 · ⊥ , 𝑣 ) (cid:75) Note that 𝑓 𝑚 could have multiple outputs and be differentthan the original function 𝑓 . As has been noted in priorresearch [13] this is important as some functions requireauxiliary information in the map phase in order to be paral-lelized.An important observation is that a subset of all dataparallel nodes are completely stateless, meaning that 𝑓 𝑚 = 𝑓 and 𝑓 𝑟 = cat , and therefore are embarrasingly parallel.We can now define a transformation on any data paral-lel node 𝑓 , that replaces it with a map followed by an ag-gregate. This transformation is formally shown in Figure 9.Essentially, all the sequential consumption inputs (that areconcatenated using cat ) are given to different 𝑓 𝑚 nodes theoutputs of which are then aggregated using 𝑓 𝑟 while preserv-ing the input order. Note that the configuration inputs haveto be duplicated using tee to ensure that all parallel 𝑓 𝑚 s and 𝑓 𝑟 s will be able to read them in case they are pipes and notfiles on disk.Using the auxiliary transformations—by adding a split followed by cat before a data parallel node, we can alwaysparallelize them using the parallelization transformation. Correctness of Transformations : We present the proofof correctness for transformations provided in this sectionin the appendix (§A.2).

We next present an evaluation that characterizes the perfor-mance benefits that different transforms deliver.

Compiler : The core of our compiler is written in Pythonand applies a series of transformations starting from a shellscript AST. It takes the AST as JSON via the OCaml bind-ings of Smoosh [18], a mechanized shell. After translatingthe dataflow regions of the input script to DFGs (§5.1) andapplying optimization passes (§6), the compiler translates onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard 𝑥 𝑗 ← Unused (I , O , E) , E ′ = E [ 𝑥 𝑗 / 𝑥 𝑖 ]I , O , E ⇐⇒ I , O , E ′ ∪ { 𝑥 𝑗 ← relay ( 𝑥 𝑖 )} Relay 𝑥 𝑠 , 𝑥 ′ 𝑠 ← Unused (I , O , E) , 𝐸 = (cid:8) ⟨ 𝑥 𝑠 , 𝑥 ′ 𝑠 ⟩ ← split ( 𝑥 ) , ⟨ 𝑥 , . . . 𝑥 𝑘 ⟩ ← split ( 𝑥 𝑠 ) , ⟨ 𝑥 𝑘 + , . . . 𝑥 𝑚 ⟩ ← split ( 𝑥 ′ 𝑠 ) (cid:9) I , O , E ∪ (cid:8) ⟨ 𝑥 , . . . 𝑥 𝑚 ⟩ ← split ( 𝑥 ) (cid:9) ⇐⇒ I , O , E ∪ 𝐸 Split-Split 𝑥 𝑐 , 𝑥 ′ 𝑐 ← Unused (I , O , E) , 𝐸 = (cid:8) 𝑥 𝑐 ← cat (⟨ 𝑥 , . . . 𝑥 𝑘 ⟩) , 𝑥 ′ 𝑐 ← cat (⟨ 𝑥 𝑘 + , . . . 𝑥 𝑚 ⟩) , 𝑥 ← cat (⟨ 𝑥 𝑐 , 𝑥 ′ 𝑐 ⟩) (cid:9) I , O , E ∪ (cid:8) 𝑥 ← cat (⟨ 𝑥 , . . . 𝑥 𝑚 ⟩) (cid:9) ⇐⇒ I , O , E ∪ 𝐸 Concat-Concat 𝑥 ← Unused (I , O , E) , 𝐸 = (cid:8) 𝑥 ← split ( 𝑥 𝑗 ) , 𝑥 𝑖 ← cat ( 𝑥 ) (cid:9) I , O , E ∪ (cid:8) 𝑥 𝑖 ← relay ( 𝑥 𝑗 ) (cid:9) ⇐⇒ I , O , E ∪ 𝐸 Split-Concat 𝑥 𝑢 , 𝑥 𝑑 , 𝑥 𝑢 , 𝑥 𝑑 , . . . 𝑥 𝑢𝑛 , 𝑥 𝑑𝑛 ← Unused (I , O , E) ,𝐸 = (cid:8) ⟨ 𝑥 𝑢 , 𝑥 𝑑 ⟩ ← tee ( 𝑥 ) , ⟨ 𝑥 𝑢 , 𝑥 𝑑 ⟩ ← tee ( 𝑥 ) , . . . ⟨ 𝑥 𝑢𝑛 , 𝑥 𝑑𝑛 ⟩ ← tee ( 𝑥 𝑛 ) ,𝑥 𝑜 ← cat ( 𝑥 𝑢 , 𝑥 𝑢 , . . . 𝑥 𝑢𝑛 ) , 𝑥 ′ 𝑜 ← cat ( 𝑥 𝑑 , 𝑥 𝑑 , . . . 𝑥 𝑑𝑛 ) (cid:9) I , O , E ∪ (cid:8) 𝑥 ← cat ( 𝑥 , 𝑥 , . . . 𝑥 𝑛 ) , ⟨ 𝑥 𝑜 , 𝑥 ′ 𝑜 ⟩ ← tee ( 𝑥 ) (cid:9) ⇐⇒ I , O , E ∪ 𝐸 Tee-Concat 𝑥 ← Unused (I , O , E) ,𝑥 𝑖 = ⟨ 𝑥 , 𝑥 , . . . 𝑥 𝑛 ⟩ , 𝑥 𝑗 = ⟨ 𝑥 ′ , 𝑥 ′ , . . . 𝑥 ′ 𝑛 ⟩ , 𝐸 = (cid:8) 𝑥 ′ ← relay ( 𝑥 ) , 𝑥 ′ ← relay ( 𝑥 ) , . . . 𝑥 ′ 𝑛 ← relay ( 𝑥 𝑛 ) (cid:9) I , O , E ∪ (cid:8) 𝑥 ← cat ( 𝑥 𝑖 ) , 𝑥 𝑗 ← split ( 𝑥 ) (cid:9) = ⇒ I , O , E ∪ 𝐸 Concat-Split I , O , E ∪ { 𝑥 𝑗 ← cat ( 𝑥 𝑖 )} = ⇒ I ′ , O , E ′ ∪ { 𝑥 𝑗 ← relay ( 𝑥 𝑖 )} One-Concat I , O , E ∪ { 𝑥 𝑗 ← split ( 𝑥 𝑖 )} = ⇒ I ′ , O , E ′ ∪ { 𝑥 𝑗 ← relay ( 𝑥 𝑖 )} One-Split

Figure 8.

Auxiliary Transformation 𝑥 𝑟 , 𝑥 𝑚 , 𝑥 𝑚 , . . . 𝑥 𝑚𝑛 , 𝑥 𝑟 , . . . 𝑥 𝑟𝑛 , 𝑥 𝑡 , . . . 𝑥 𝑡𝑛 , 𝑥 𝑝 , . . . 𝑥 𝑝𝑛 ← Unused (I , O , E) dp ( 𝑓 , 𝑓 𝑚 , 𝑓 𝑟 ) 𝐸 = (cid:8) ⟨ 𝑥 𝑐 , . . . , 𝑥 𝑐𝑛 , 𝑥 𝑐𝑛 + ⟩ ← tee ( 𝑥 𝑐 ) : ∀ 𝑥 𝑐 ∈ 𝑥 (cid:9) ∪ (cid:8) 𝑥 𝑚𝑖 ← 𝑓 𝑚 ( 𝑥 𝑖 , 𝑥 𝑖 ) : ∀ 𝑖 ∈ { . . . 𝑛 } (cid:9) ∪ (cid:8) 𝑥 𝑟 ← 𝑓 𝑟 ( 𝑥 𝑚 , . . . , 𝑥 𝑚𝑛 , 𝑥 𝑛 + ) (cid:9) I , O , E ∪ (cid:8) 𝑥 ← cat ( 𝑥 , . . . , 𝑥 𝑛 ) , 𝑥 𝑖 ← 𝑓 ( 𝑥, 𝑥 ) (cid:9) = ⇒ I , O , E ∪ 𝐸 Parallel

Figure 9.

Parallelization Transformationall DFGs back into a shell script (§5.2). This is achieved byfeeding the modified AST back to Smoosh, which convertsit back to a POSIX-compliant string. Nodes of the graph areinstantiated with the commands and flags they represent,and edges are instantiated as named pipes (§5.2). The helpernodes described in Section 6.1 are instantiated as handcraftedcommands in C.The compiler accepts a –width flag for configuring thelevel of parallelism, i.e. , the number of outputs that splitnodes produce.

Methodology : We collected two sets of benchmark pro-grams from various sources, including GitHub, Stackover-flow, and the Unix literature [3, 4, 7, 23, 38, 51]: • Expert Scripts:

The first set contains 9 scripts that arewritten by experts: NFA-regex, Sort, Top-N, WF, Spell,Difference, Bi-grams, Set-Difference, and Shortest-Scripts. • Unix50 Scripts:

The second set contains 34 scripts solv-ing the Unix 50 game [28]. Due to their inefficiencies, weconjecture that these scripts were written by non-experts—however, we proceed to parallelize them without any mod-ifications.We use our compiler to parallelize all of the scripts in thesetwo benchmark sets, working with three configurations: • Baseline:

Our compiler simply executes the script using astandard shell (in our case bash ) without performing anyoptimizations. This configuration is used as our baseline.Note that it is not completely sequential since the shell n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA n f a -r e g e x s o r tt op - n w f s p e ll b i - g r a m s d i ff e r e n ce s e t - d i ff e r e n ce s ho r t e s t - s c r i p t s un i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - i x50 - Script E x ec u ti on ti m e ( s ) Figure 10.

Execution times with configurations

Baseline , Parallel , and

No Cat-Split . 16 × -parallelism on 10GB inputs. Sequential Time (s) S p ee dup Highly ParallelizableContain sortNon CPU-heavyDeepNon parallelizableContain head

Figure 11.

Speedup achieved for the “Unix50 Scripts” (y-axis) and their sequential execution time (x-axis). Pointsare color-coded to highlight important features affectingachieved parallelism. 16 × -parallelism on 10GB inputs. --width S p ee dup Averagenfa-regexsorttop-nwfspelldifferencebi-gramsset-differenceshortest-scripts

Figure 12.

Parallelization speedup for “Expert Scripts”, withparallelism –width =2–64 × .already achieves pipeline and task parallelism based on | and & . • No Cat-Split:

Our compiler performs all transformationsexcept Concat-Split. This configuration achieves paral-lelism by splitting the input before each command andthen merging it back. It is used as a baseline to measure thebenefits achieved by the Concat-Split transformation. • Parallel:

Our compiler performs all transformations. TheConcat-Split transformation, which removes a cat with 𝑛 inputs followed by a split with 𝑛 outputs, ensures thatdata is not merged unnecessarily between parallel stages. Experimental Setup : Experiments were run on a 2.1GHzIntel Xeon E5-2683 with 512GB of memory and 64 physicalcores, Debian 4.9.144-3.1, GNU Coreutils 8.30-3, GNU Bash5.0.3(1), OCaml 4.05.0, and Python 3.7.3. All scripts are setto (initially) read from and (finally) write to the file system.For “Expert Scripts”, we use 10GB-collections of inputs fromProject Gutenberg [20]; for “Unix50 Scripts”, we gather theirinputs from each level in the game [28] and multiply themup to 10GB.

Performance : Fig. 10 shows the execution times on allprograms with parallelism –width=16 and for all three con-figurations mentioned in the beginning of the evaluation. Itshows that all programs achieve significant improvementswith the addition of the

Concat-Split transformation. Theaverage speedup without

Concat-Split over the bash base-line is 2.11 × . The average speedup with the transformationis 6.14 × .Fig. 11 explains the differences in the effect of the trans-formation based on commands involved in the pipelines. Itoffers a correlation between sequential time and speedup,and shows that different programs that involve commandswith similar characteristics (color) see similar speedups (y-axis). Programs containing only parallelizable commandssee the highest speedup (10.4–14.5 × ). Programs with limitedspeedup either (1) contain sort , which does not scale lin-early, (2) are not CPU-intensive, resulting in pronounced IOand constant costs, or (3) are deep pipelines, already exploit-ing significant pipeline-based parallelism. Programs withnon-parallelizable commands see no significant change inexecution time (0.9–1.3 × ). Finally, programs containing head have a very small sequential execution, typically under 𝑠 ,and thus their parallel equivalents see a slowdown due toconstant costs—still remaining under 𝑠 .Fig. 12 includes an exploration of how the –width param-eter affects speedups for the programs in the first set by com-piling them with all transformations with –width=2-64 × . onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard All programs achieve speedups for all –width configura-tions. Average speedups for {2, 4, 8, 16, 32, 64} × -parallelismare 1.83, 3.01, 4.76, 7.29, 9.64, and 11.69, respectively. We notethat several programs reach their peak performance using –width={16, 32} due to the fact that they already exhibitpipeline and task parallelism and therefore a higher –width leads to a large number of parallel processes in the resultingscript (up to 500). Dataflow Graph Models : Graph models of computationwhere nodes represent units of computation and edges rep-resent FIFO communication channels have been studiedextensively [10, 24–26, 30, 31]. ODFM sits somewhere be-tween Kahn Process Networks [24, 25] (KPN), the modelof computation adopted by Unix pipes, and SynchronousDataflow [30, 31] (SDF). A key difference between ODFMand SDF is that ODFM does not assume fixed item rates—aproperty used by SDF for efficient scheduling determined atcompile-time. Two differences between ODFM from KPNs isthat (i) ODFM does not allow cycles, and (ii) ODFM exposesinformation about the input consumption order of each node.This order provides enough information at compile time toperform parallelizing transformations while also enablingtranslation of the dataflow back to a Unix shell script.Systems for batch [9, 40, 56], stream [16, 34, 52], and signalprocessing [8, 30] provide dataflow-based abstractions. Theseabstractions are different from ODFM which operates on theUnix shell, an existing language with its own peculiaritiesthat have guided the design of the model.Synchronous languages [6, 19, 29, 36] model stream graphsas circuits where nodes are state machines and edges arewires that carry a single value. Lustre [19] is based on adataflow model that is similar to ours, but its focus is differentas it is not intended for exploiting data-parallelism.

Semantics and Transformations : Prior work proposessemantics for streaming extensions to relational query lan-guages based on dataflow [1, 32]. In contrast to our work, itfocuses on transformations of time-varying relations.More recently, there has been significant work on thecorrect parallelization of distributed streaming applicationsby proposing sound optimizations and compilation tech-niques [21, 45], as well as developing a type system for thecorrect parallelization of streaming programs [35]. These ef-forts aim at producing a parallel implementation of a dataflowstreaming computation using techniques that do not requireknowledge of the order of consumption of each node—aproperty that very important in our setting.Recent work proposes a semantic framework for streamprocessing that uses monoids to capture the type of datastreams [33]. That work mostly focuses on generality ofexpression, showing that several already proposed program-ming models can be expressed on top of it. It also touches upon soundness proofs of optimizations using algebraic rea-soning, which is similar to our approach.

Divide and Conquer Decomposition : Prior work hasshown the possibility of decomposing programs or programfragments using divide-and-conquer techniques [13, 14, 44,47]. The majority of that work focuses on parallelizing spe-cial constructs— e.g. , loops, matrices, and arrays—rather thanstream-oriented primitives. Techniques for automated syn-thesis of MapReduce-style distributed programs [47] can beof significant aid for individual commands. In some cases [13,14], the map phase is augmented to maintain additional meta-data used by the reducer phase. Our focus is different, as wetransform entire programs composed of multiple such com-mands and prove the correctness of these transformations.The two classes systems complement each other well: usingthese techniques, our compiler could generate its aggregatorlibrary automatically.

Parallel Shell Scripting : Tools exposing parallelism onmodern Unixes such as qsub [15], SLURM [55], and GNU parallel [49] are predicated upon explicit and careful or-chestration from their users. Similarly, several shells [11,37, 48] add primitives for non-linear pipe topologies—someof which target parallelism. Here too, however, users areexpected to manually rewrite scripts to exploit these newprimitives without jeopardizing correctness. While all thesetools emphasize performance and configurability, our workemphasizes automation and correctness.PaSh is a system for automatically parallelizing Unix shellcommands and scripts [53]. The semantics and theoremspresented in this paper can serve as formal foundations thatestablish the soundness of the basic PaSh parallelizationapproach.

POSIX Shell Semantics : Our work depends on Smoosh,an effort focused on formalizing the semantics of the POSIXshell [18]. Smoosh focuses on POSIX semantics, whereas ourwork introduces a novel dataflow graph representation inorder to transform programs and prove the correctness ofits transformation passes. A subset of the Smoosh authorshave also argued for making concurrency explicit via shellconstructs [17]. This is different from our work, which arguesautomated and correct parallelization of sequential scripts.

Parallel Userspace Environments : By focusing on sim-plifying the development of distributed programs, a plethoraof environments inadvertently assist in the constructionof parallel software. Such systems [2, 39, 41, 42] or lan-guages [12, 27, 46, 54] hide many of the challenges of dealingwith concurrency as long as developers leverage the providedabstractions—which are strongly coupled to the underly-ing operating or runtime system. Even when these effortsare shell-oriented, such as Plan9’s rc , they are backward-incompatible with the Unix shell, and often focus primarily n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA on hiding the existence of a network rather than automatingparallel processing. This paper presented an order-aware dataflow model forextracting data parallelism latent in Unix shell scripts. Themodel is used to capture the semantics of transformationsthat exploit data parallelism available in Unix shell compu-tations and prove their correctness. Applying the dataflowtransformations on a series of scripts delivers a speedupof 6.14 × on average and up to 61.1 × on a 64-core machine.We view our work as a stepping stone for further studiesof the dataflow subset of the shell, as well as a playgroundfor the development of more elaborate transformations andoptimizations. Acknowledgments

We thank Konstantinos Mamouras for preliminary discus-sions that helped spark an interest for this work. This re-search was funded in part by DARPA contracts HR00112020013and HR001120C0191, and NSF award CCF 1763514. Any opin-ions, findings, conclusions, or recommendations expressed inthis material are those of the authors and do not necessarilyreflect those of DARPA or other agencies.

References [1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQLcontinuous query language: semantic foundations and query execution.

The VLDB Journal

15, 2 (2006), 121–142.[2] Amnon Barak and Oren La’adan. 1998. The MOSIX multicomputeroperating system for high performance cluster computing.

FutureGeneration Computer Systems

13, 4 (1998), 361–372.[3] Jon Bentley. 1985. Programming Pearls: A Spelling Checker.

Commun.ACM

28, 5 (May 1985), 456–462. https://doi.org/10.1145/3532.315102 [4] Jon Bentley, Don Knuth, and Doug McIlroy. 1986. Programming Pearls:A Literate Program.

Commun. ACM

29, 6 (June 1986), 471–483. https://doi.org/10.1145/5948.315654 [5] Emery D Berger. 2003. Optimizing Shell Scripting Languages. (2003).[6] Gérard Berry and Georges Gonthier. 1992. The Esterel synchronousprogramming language: Design, semantics, implementation.

Scienceof computer programming

19, 2 (1992), 87–152.[7] Pawan Bhandari. 2020. Solutions to unixgame.io. https://git.io/Jf2dn

Accessed: 2020-04-14.[8] Timothy Bourke and Marc Pouzet. 2013. Zélus: A synchronous lan-guage with ODEs. In

Proceedings of the 16th international conferenceon Hybrid systems: computation and control . 113–118.[9] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: SimplifiedData Processing on Large Clusters.

Commun. ACM

51, 1 (Jan. 2008),107–113. https://doi.org/10.1145/1327452.1327492 [10] Jack B. Dennis. 1974. First Version of a Data Flow Procedure Lan-guage. In

Programming Symposium , B. Robinet (Ed.). Springer BerlinHeidelberg, Berlin, Heidelberg, 362–376. https://doi.org/10.1007/3-540-06859-7_145 [11] Tom Duff. 1990. Rc—A shell for Plan 9 and Unix systems.

AUUGN

Proceedings of the 4th ACM Symposium onHaskell (Haskell ’11) . ACM, New York, NY, USA, 118–129. https://doi.org/10.1145/2034675.2034690 [13] Azadeh Farzan and Victor Nicolet. 2017. Synthesis of Divide andConquer Parallelism for Loops. In

Proceedings of the 38th ACM SIG-PLAN Conference on Programming Language Design and Implementa-tion (PLDI 2017) . Association for Computing Machinery, New York,NY, USA, 540–555. https://doi.org/10.1145/3062341.3062355 [14] Azadeh Farzan and Victor Nicolet. 2019. Modular Divide-and-ConquerParallelization of Nested Loops. In

Proceedings of the 40th ACM SIG-PLAN Conference on Programming Language Design and Implementa-tion (PLDI 2019) . Association for Computing Machinery, New York,NY, USA, 610–624. https://doi.org/10.1145/3314221.3314612 [15] Wolfgang Gentzsch. 2001. Sun grid engine: Towards creating a com-pute power grid. In

Proceedings First IEEE/ACM International Sympo-sium on Cluster Computing and the Grid . IEEE, 35–36.[16] Michael I Gordon, William Thies, and Saman Amarasinghe. 2006. Ex-ploiting coarse-grained task, data, and pipeline parallelism in streamprograms.

ACM SIGPLAN Notices

41, 11 (2006), 151–162.[17] Michael Greenberg. 2018. The POSIX shell is an interactive DSL forconcurrency.[18] Michael Greenberg and Austin J. Blatt. 2020. Executable Formal Se-mantics for the POSIX Shell: Smoosh: the Symbolic, Mechanized, Ob-servable, Operational Shell.

Proc. ACM Program. Lang.

4, POPL, Article43 (Jan. 2020), 31 pages. https://doi.org/10.1145/3371111 [19] Nicholas Halbwachs, Paul Caspi, Pascal Raymond, and Daniel Pilaud.1991. The synchronous data flow programming language LUSTRE.

Proc. IEEE

79, 9 (1991), 1305–1320.[20] Michael Hart. 1971.

Project Gutenberg . [21] Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and RobertGrimm. 2014. A catalog of stream processing optimizations. ACMComputing Surveys (CSUR)

46, 4 (2014), 1–34.[22] Lluis Batlle i Rossell. 2016. tsp(1) Linux User’s Manual .https://vicerveza.homeunix.net/ viric/soft/ts/.[23] Dan Jurafsky. 2017. Unix for Poets. https://web.stanford.edu/class/cs124/lec/124-2018-UnixForPoets.pdf [24] Gilles Kahn. 1974. The Semantics of a Simple Language for ParallelProgramming.

Information Processing

74 (1974), 471–475.[25] Gilles Kahn and David B. MacQueen. 1977. Coroutines and Networksof Parallel Processes.

Information Processing

77 (1977), 993–998.[26] Richard M. Karp and Raymond E. Miller. 1966. Properties of a Model forParallel Computations: Determinacy, Termination, Queueing.

SIAM J.Appl. Math.

14, 6 (1966), 1390–1411. https://doi.org/10.1137/0114108 [27] Charles Edwin Killian, James W. Anderson, Ryan Braud, Ranjit Jhala,and Amin M. Vahdat. 2007. Mace: Language Support for Building Dis-tributed Systems. In

Proceedings of the 28th ACM SIGPLAN Conferenceon Programming Language Design and Implementation (PLDI ’07) . ACM,New York, NY, USA, 179–188. https://doi.org/10.1145/1250734.1250755 [28] Nokia Bell Labs. 2019. The Unix Game—Solve puzzles using Unix pipes. https://unixgame.io/unix50

Accessed: 2020-03-05.[29] Paul Le Guernic, Albert Benveniste, Patricia Bournai, and Thierry Gau-tier. 1986. Signal–A data flow-oriented language for signal processing.

IEEE transactions on acoustics, speech, and signal processing

34, 2 (1986),362–374.[30] Edward Ashford Lee and David G Messerschmitt. 1987. Static schedul-ing of synchronous data flow programs for digital signal processing.

IEEE Transactions on computers

Proc. IEEE

75, 9 (1987), 1235–1245.[32] Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter ATucker. 2005. Semantics and evaluation techniques for window ag-gregates in data streams. In

Proceedings of the 2005 ACM SIGMODinternational conference on Management of data . 311–322.[33] Konstantinos Mamouras. 2020. Semantic Foundations for Determin-istic Dataflow and Stream Processing. In

European Symposium onProgramming . Springer, Cham, 394–427. onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard [34] Konstantinos Mamouras, Mukund Raghothaman, Rajeev Alur,Zachary G Ives, and Sanjeev Khanna. 2017. StreamQRE: Modular speci-fication and efficient evaluation of quantitative queries over streamingdata. In

Proceedings of the 38th ACM SIGPLAN Conference on Program-ming Language Design and Implementation . 693–708.[35] Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, Zachary G. Ives,and Val Tannen. 2019. Data-Trace Types for Distributed Stream Pro-cessing Systems. In

Proceedings of the 40th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI 2019) . ACM,New York, NY, USA, 670–685. https://doi.org/10.1145/3314221.3314580 [36] Florence Maraninchi and Yann Rémond. 2001. Argos: an automaton-based synchronous language.

Computer languages

27, 1-3 (2001), 61–92.[37] Chris McDonald and Trevor I Dix. 1988. Support for graphs of pro-cesses in a command interpreter.

Software: Practice and Experience

Bell System Technical Journal

57, 6(1978), 1899–1904.[39] Sape J Mullender, Guido Van Rossum, AS Tanenbaum, Robbert Van Re-nesse, and Hans Van Staveren. 1990. Amoeba: A distributed oper-ating system for the 1990s.

Computer

23, 5 (1990), 44–53. [40] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, PaulBarham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System.In

Proceedings of the Twenty-Fourth ACM Symposium on OperatingSystems Principles (SOSP ’13) . ACM, New York, NY, USA, 439–455. https://doi.org/10.1145/2517349.2522738 [41] John K Ousterhout, Andrew R. Cherenson, Fred Douglis, Michael N.Nelson, and Brent B. Welch. 1988. The Sprite network operatingsystem.

Computer

21, 2 (1988), 23–36. [42] Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, et al. 1990.Plan 9 from Bell Labs. In

Proceedings of the summer 1990 UKUUGConference . 1–9. http://css.csail.mit.edu/6.824/2014/papers/plan9.pdf [43] Deepti Raghavan, Sadjad Fouladi, Philip Levis, and Matei Zaharia. 2020. { POSH } : A Data-Aware Shell. In { USENIX } Annual TechnicalConference ( { USENIX }{ ATC } . 617–631.[44] Radu Rugina and Martin Rinard. 1999. Automatic Parallelization ofDivide and Conquer Algorithms. In Proceedings of the Seventh ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP ’99) . Association for Computing Machinery, New York, NY, USA,72–83. https://doi.org/10.1145/301104.301111 [45] Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2013.Safe data parallelism for general streaming.

IEEE transactions oncomputers

64, 2 (2013), 504–517.[46] Peter Sewell, James J. Leifer, Keith Wansbrough, Francesco ZappaNardelli, Mair Allen-Williams, Pierre Habouzit, and Viktor Vafeiadis.2005. Acute: High-level Programming Language Design for DistributedComputation. In

Proceedings of the Tenth ACM SIGPLAN InternationalConference on Functional Programming (ICFP ’05) . ACM, New York,NY, USA, 15–26. https://doi.org/10.1145/1086365.1086370 [47] Calvin Smith and Aws Albarghouthi. 2016. MapReduce ProgramSynthesis. In

Proceedings of the 37th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI ’16) . As-sociation for Computing Machinery, New York, NY, USA, 326–340. https://doi.org/10.1145/2908080.2908102 [48] Diomidis Spinellis and Marios Fragkoulis. 2017. Extending UnixPipelines to DAGs.

IEEE Trans. Comput.

66, 9 (2017), 1547–1561.[49] Ole Tange. 2011. GNU Parallel—The Command-Line Power Tool. ;login:The USENIX Magazine

36, 1 (Feb 2011), 42–47. https://doi.org/10.5281/zenodo.16303 [50] Ole Tange. 2020. DIFFERENCES BETWEEN GNU Parallel ANDALTERNATIVES. [51] Dave Taylor. 2004.

Wicked Cool Shell Scripts: 101 Scripts for Linux, MacOS X, and Unix Systems . No Starch Press.[52] William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002.StreamIt: A language for streaming applications. In

International Con-ference on Compiler Construction . Springer, 179–196.[53] Nikos Vasilakis, Konstantinos Kallas, Konstantinos Mamouras, Achil-leas Benetopoulos, and Lazar Cvetkovich. 2020. PaSh: Light-touchData-Parallel Shell Processing. arXiv preprint arXiv:2007.09436 (2020).[54] Robert Virding, Claes Wikström, and Mike Williams. 1996.

ConcurrentProgramming in ERLANG (2Nd Ed.) . Prentice Hall International (UK)Ltd., Hertfordshire, UK, UK.[55] Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simplelinux utility for resource management. In

Workshop on Job SchedulingStrategies for Parallel Processing . Springer, 44–60.[56] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, andIon Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerantAbstraction for In-memory Cluster Computing. In

Proceedings ofthe 9th USENIX Conference on Networked Systems Design and Imple-mentation (NSDI’12) . USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301n Order-aware Dataflow Model for Extracting Shell Script Parallelism Conference’17, July 2017, Washington, DC, USA

A Appendix

A.1 Two Case Studies with GNU Parallel

We describe an attempt to achieve data parallelism in twoscripts using GNU parallel [49], a tool for running shellcommands in parallel. We chose GNU parallel because itcompares favorably to other alternatives in the literature [50],but note that GNU parallel sits somewhere between an au-tomated compiler like our tool and a fully manual approach—illustrating only some of the issues that one might face whilemanually trying to parallelize their shell scripts.

Spell : We first apply parallel on Spell ’s first pipeline (§1): TEMP_C1 ="/tmp /{/}. out1 " TEMP1 ="/tmp/in {1..16}. out1 " mkfifo ${ TEMP1 } parallel "cat {} | col -bx | \ tr -cs A-Za -z '\n' | tr A-Z a-z | \ tr -d '[: punct :]' | sort > $TEMP_C1 " \ ::: $IN & sort -m ${ TEMP1 } | parallel -k --jobs ${ JOBS } --pipe \ --block 250M " uniq " | uniq | parallel -k --jobs ${ JOBS } --pipe \ --block 250M " comm -23 - $dict " rm ${ TEMP1 } It took us a few iterations to get the parallel version right,leading to a few observations. First, despite its automationbenefits, parallel still requires manual placement of theintermediate FIFO pipes and 𝑎𝑔𝑔 functions. Additionally,achieving ideal performance requires some tweaking: set-ting –block to , , and yields widely differentexecution times—27, 4, and 3 minutes respectively.Most importantly, omitting the -k flag in the last twofragments breaks correctness due to re-ordering related toscheduling non-determinism. These fragments are fortunatecases in which the -k flag has the desired effect, because theiroutput order follows the same order as the arguments of thecommands they parallelize. Other commands face problems,in that the correct output order is not the argument ordernor an arbitrary interleaving. Set-difference : We apply parallel to Set-diff, a scriptthat compares two streams using sort and comm : mkfifo s1 s2 TEMP_C1 ="/tmp /{/}. out1 " TEMP_C2 ="/tmp /{/}. out2 " TEMP1 ="/tmp/in {1..16}. out1 " TEMP2 ="/tmp/in {1..16}. out2 " mkfifo ${ TEMP1 } ${ TEMP2 } parallel "cat {} | cut -d ' ' -f 1 | \ sort > $TEMP_C1 " ::: $IN & sort -m ${ TEMP1 } > s1 & parallel "cat {} | cut -d ' ' -f 1 | \ sort > $TEMP_C2 " ::: $IN2 & sort -m ${ TEMP2 } > s2 & cat s1 | parallel -k --pipe --jobs ${ JOBS } \ --block 250M " comm -23 - s2" rm ${ TEMP1 } ${ TEMP2 } rm s1 s2 In addition to the issues highlighted in Spell, this parallelimplementation has a subtle bug. GNU parallel spawnsseveral instances of comm -23 - s2 that all read FIFO s2 .When the first parallel instance exits, the kernel sends a SIGPIPE signal to the second sort -m . This forces sort toexit, in turn leaving the rest of the parallel comm instancesblocked waiting for new input.The most straightforward way we have found to addressthis bug is to remove (1) the “ & ” operator after the second sort -m , and (2) s2 from mkfifo . This modification sacri-fices pipeline parallelism, as the first stage of the pipelinecompletes before executing comm . The parallel pipelinemodified for correctness completes in 4m54s. Our tool doesnot sacrifice pipeline parallelism by using tee to replicate s2 for all parallel instances of comm (§2), completing in 4m7s. A.2 Proof SketchesTheorem A.1.

Let ⟨ 𝑥 ′ , 𝑥 ′ , . . . , 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) ∈ E , thefollowing is true about Γ and 𝜎 at any point during the execu-tion: ⟨ Γ ( 𝑥 ′ ) , . . . Γ ( 𝑥 ′ 𝑝 )⟩ = (cid:74) 𝑓 ( 𝜎 ( 𝑥 ) , . . . 𝜎 ( 𝑥 𝑛 )) (cid:75) Proof.

Proof by Induction.

Base Case:

We prove that for the initial mappings Γ 𝑖 and 𝜎 𝑖 ,the above statement is true. For 𝑥 ∈ ⟨ 𝑥 , . . . 𝑥 𝑛 ⟩ , 𝜎 𝑖 ( 𝑥 ) = 𝜖 .For 𝑥 ∈ ⟨ 𝑥 ′ , . . . 𝑥 ′ 𝑝 ⟩ , Γ 𝑖 ( 𝑥 ) = 𝜖 (since 𝑥 ′ , . . . 𝑥 ′ 𝑝 are not inputvariables to the DFG, hence initialized to 𝜖 ). The followingproperty is true for all functions 𝑓 : ⟨ 𝜖, . . . 𝜖 ⟩ = (cid:74) 𝑓 ( 𝜖, . . . 𝜖 ) (cid:75) Induction Case:

Step preserves this property. The followingstatement is true from assumption: ⟨ 𝑣 ′ , . . . 𝑣 ′ 𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 , . . . 𝑣 𝑘 , . . . 𝑣 𝑛 ) (cid:75) Using the definition of (cid:74)(cid:75) 𝑠 , the following statement is true: ⟨ 𝑣 ′ · 𝑣 𝑚 , . . . 𝑣 ′ 𝑝 · 𝑣 𝑚𝑝 ⟩ = (cid:74) 𝑓 ( 𝑣 , . . . 𝑣 𝑘 · 𝑣 𝑥 , . . . 𝑣 𝑛 ) (cid:75) Therefore, given that initially the above statement was true of Γ and 𝜎 , the small step semantics will preserve this property.Therefore, by induction, for all ⟨ 𝑥 ′ , 𝑥 ′ , . . . , 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) ∈E , the following is always true about Γ and 𝜎 , during anypoint within the execution: ⟨ Γ ( 𝑥 ′ ) , . . . Γ ( 𝑥 ′ 𝑝 )⟩ = (cid:74) 𝑓 ( 𝜎 ( 𝑥 ) , . . . 𝜎 ( 𝑥 𝑛 )) (cid:75) □ Lemma A.2.

Let ⟨ 𝑥 ′ , 𝑥 ′ , . . . , 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) ∈ E , and ∀ 𝑖 ∈ [ , 𝑛 ] Γ ( 𝑥 𝑖 ) = 𝑣 𝑖 · ⊥ . Then eventually ∀ 𝑖 ∈ [ , 𝑛 ] , 𝜎 ( 𝑥 𝑖 ) = 𝑣 𝑖 · ⊥ . Therefore, eventually ∀ 𝑖 ∈ [ , 𝑝 ] , Γ ( 𝑥 ′ 𝑖 ) = 𝑣 ′ 𝑖 · ⊥ . onference’17, July 2017, Washington, DC, USA Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard Proof.

If all inputs are Γ ( 𝑥 𝑖 ) is closed, a step can be made toupdate 𝜎 . Since choice 𝑓 is non empty unless 𝜎 ( 𝑥 𝑖 ) is closedfor all 𝑖 . Since, all files considered are finite, eventually all 𝜎 ( 𝑥 𝑖 ) will be closed. When all inputs are closed, (cid:74)(cid:75) dictatesthat all outputs will be closed as well. Using theorem A.1, Γ ( 𝑥 ′ 𝑖 ) will be closed. □ Theorem A.3.

Eventually for all variables 𝑥 , ∃ 𝑣. Γ ( 𝑥 ) = 𝑣 · ⊥ , i.e. , all variables will eventually be closed.Proof. Let C be the set of variables which be closed eventu-ally. In the beginning, C contains all input variables in I . Us-ing Lemma A.2, for any node ⟨ 𝑥 ′ , 𝑥 ′ , . . . , 𝑥 ′ 𝑝 ⟩ ← 𝑓 ( 𝑥 , . . . 𝑥 𝑛 ) ∈E , if 𝑥 , . . . 𝑥 𝑛 ∈ C , then 𝑥 ′ , . . . 𝑥 ′ 𝑝 ∈ C . Since the dataflowprogram contains no cycles, eventually all variables reach-able from the input variables are in C . □ Theorem A.4.

The Dataflow program will always terminate.And when it terminates, constraint 3 will be true at the endstate.Proof.

Follows from Theorem A.3 and Theorem A.1. □ A.3 Correctness of Transformations

We now proceed to prove a series of statements regardingthe semantics-preservation properties of dataflow programs.

Program Equivalence : Let 𝑝 = ⟨I , O , E⟩ and 𝑝 ′ = ⟨I ′ , O ′ , E ′ ⟩ be two dataflow programs, where I = ⟨ 𝑥 𝑖 , . . . 𝑥 𝑖𝑛 ⟩ , I ′ = ⟨ 𝑦 𝑖 , . . . 𝑦 𝑖𝑛 ⟩ , O = ⟨ 𝑥 𝑜 , . . . 𝑥 𝑜𝑚 ⟩ , and O ′ = ⟨ 𝑦 𝑜 , . . . 𝑦 𝑜𝑚 ⟩ . Theseprograms are equivalent if and only if the value of the out-put variables is the same at the time of their completion,given the initial values of the input variables are the same.Formally, for all values 𝑣 , . . . 𝑣 𝑛 , if ∀ 𝑗 ∈ [ , 𝑛 ] . Γ 𝑖 ( 𝑥 𝑖𝑗 ) = Γ ′ 𝑖 ( 𝑦 𝑖𝑗 ) = 𝑣 𝑗 ∀ 𝑗 ∈ [ , 𝑚 ] . Γ 𝑜 ( 𝑥 𝑜𝑗 ) = Γ ′ 𝑜 ( 𝑦 𝑜𝑗 ) where Γ 𝑖 , Γ ′ 𝑖 are the initial mappings for 𝑝 and 𝑝 ′ respectively,and Γ 𝑜 , Γ ′ 𝑜 are the mappings when 𝑝 and 𝑝 ′ have completedtheir execution. Theorem A.5.

Let 𝑝 = ⟨I , O , E ∪ 𝐸 ⟩ and 𝑝 ′ = ⟨I , O , E ∪ 𝐸 ′ ⟩ be two dataflow programs. Let S 𝑖 be the set of input variablesin node set 𝐸 (variables read in 𝐸 but not assigned inside 𝐸 ).Let S 𝑜 be the set of output variables in the node set 𝐸 (variablesassigned in 𝐸 but not read inside 𝐸 ). And S 𝑖 , S 𝑜 be the inputvariables and output variables of 𝐸 ′ and 𝐸 as well. If ⟨S 𝑖 , S 𝑜 , 𝐸 ⟩ is equivalent to ⟨S 𝑖 , S 𝑜 , 𝐸 ′ ⟩ , then program ⟨I , O , E ∪ 𝐸 ′ ⟩ isequivalent to ⟨I , O , E ∪ 𝐸 ⟩ .Proof. Since ⟨S 𝑖 , S 𝑜 , 𝐸 ⟩ is equivalent to ⟨S 𝑖 , S 𝑜 , 𝐸 ′ ⟩ , for allclosed value of S 𝑖 , eventually, both of these programs stopexecuting, the value of output variables S 𝑜 will be equal.Given any initial mapping Γ 𝑖 , let Γ 𝑜 , Γ ′ 𝑜 be the mappingswhen 𝑝 and 𝑝 ′ complete their execution. For all 𝑥 ∈ S 𝑖 , Γ 𝑜 ( 𝑥 ) = Γ ′ 𝑜 ( 𝑥 ) as there are no cycles in the dataflow graph,and the subgraph which computes S 𝑖 is same in both 𝑝 and 𝑝 ′ . Since ⟨S 𝑖 , S 𝑜 , 𝐸 ⟩ is equivalent to ⟨S 𝑖 , S 𝑜 , 𝐸 ′ ⟩ , and for all 𝑥 ∈ S 𝑖 , Γ 𝑜 ( 𝑥 ) = Γ ′ 𝑜 ( 𝑥 ) , for all 𝑥 ∈ S 𝑜 , Γ 𝑜 ( 𝑥 ) = Γ ′ 𝑜 ( 𝑥 ) .Only variables in set S 𝑜 are the variables assigned in 𝐸 and 𝐸 ′ , that are used in computing the value of the outputvariables O . Since the value of the variables is same in boththese programs, given the same input mapping Γ 𝑖 , for alloutput variables 𝑥 ∈ O , Γ 𝑜 ( 𝑥 ) = Γ ′ 𝑜 ( 𝑥 ) .Hence, both these programs are equivalent. □ Theorem A.6.

Transformations presented in Figure 8 andFigure 9 preserve program equivalence.Proof.

The (Relay) transformation preserves program equiv-alence as the program terminates, the value of 𝑥 𝑖 is equal tothe value of 𝑥 𝑗 .The remaining transformations can be viewed as trans-forming input program ⟨I , O , E ∪ 𝐸 ⟩ to ⟨I ′ , O ′ , E ∪ 𝐸 ′ ⟩ .Besides (Concat-Split) transformation, equivalence of pro-grams ⟨S 𝑖 , S 𝑜 , 𝐸 ⟩ and ⟨S 𝑖 , S 𝑜 , 𝐸 ′ ⟩ follow from properties of cat , relay , split , tee , the properties of 𝑓 𝑚 and 𝑓 𝑟 for data par-allel commands 𝑓 .The (Concat-Split) transformation relies on the additionalproperty that the program produces the same output inde-pendent of how the split breaks the input stream. Choice ofa particular way of breaking the stream does not change thevalue of the program’s output variables when it halts.Since, ⟨S 𝑖 , S 𝑜 , 𝐸 ⟩ is equivalent to ⟨S 𝑖 , S 𝑜 , 𝐸 ′ ⟩ , these trans-formations preserve equivalence (Theorem A.5)., these trans-formations preserve equivalence (Theorem A.5).