PaSh: Light-touch Data-Parallel Shell Processing

Nikos Vasilakis

∗

MIT

[email protected]

Konstantinos Kallas

∗

University of Pennsylvania

[email protected]

Konstantinos Mamouras

Rice University

[email protected]

Achilles Benetopoulos

Unaliated

[email protected]

Lazar Cvetković

University of Belgrade

[email protected]

Abstract

This paper presents PaSh, a system for parallelizing POSIX

shell scripts. Given a script, PaSh converts it to a dataow

graph, performs a series of semantics-preserving program

transformations that expose parallelism, and then converts

the dataow graph back into a script—one that adds POSIX

constructs to explicitly guide parallelism coupled with PaSh-

provided Unix-aware runtime primitives for addressing per-

formance- and correctness-related issues. A lightweight an-

notation language allows command developers to express

key parallelizability properties about their commands. An

accompanying parallelizability study of POSIX and GNU

commands—two large and commonly used groups—guides

the annotation language and optimized aggregator library

that PaSh uses. PaSh’s extensive evaluation over 44 unmod-

ied Unix scripts shows signicant speedups (0

89–61

avg: 6

) stemming from the combination of its program

transformations and runtime primitives.

CCS Concepts • Software and its engineering → Com-

pilers; Massively parallel systems; Scripting languages.

Keywords

Automatic Parallelization, Shell, Pipelines, Source-

to-source compiler, POSIX, Unix

ACM Reference Format:

Nikos Vasilakis, Konstantinos Kallas, Konstantinos Mamouras, Achil-

les Benetopoulos, and Lazar Cvetković. 2021. PaSh: Light-touch

Data-Parallel Shell Processing. In Sixteenth European Conference on

Computer Systems (EuroSys ’21), April 26–28, 2021, Online, United

Kingdom. ACM, New York, NY, USA, 18 pages. hps://doi.org/10.

1145/3447786.3456228

∗

The two marked authors contributed equally to the paper.

EuroSys ’21, April 26–28, 2021, Online, United Kingdom

ACM ISBN 978-1-4503-8334-9/21/04.

https://doi.org/10.1145/3447786.3456228

Annotations §3.2

Parallelizing

Transformations §4.3

Dataflow

Regions

DFG

§4.2

Parallelizability

Classes §3

§ 4.4

Seq. Script Par. Script

§ 4.1

Runtime

Primitives §5

POSIX, GNU §3.1

Fig. 1. PaSh overview.

PaSh identies dataow regions (§4.1), converts

them to dataow graphs (§4.2), applies transformations (§4.3) based on the

parallelizability properties of the commands in these regions (§3.1, §3.2),

and emits a parallel script (§4.4) that uses custom primitives (§5).

1 Introduction

The Unix shell is an environment—often interactive—for

composing programs written in a plethora of programming

languages. This language-agnosticism, coupled with Unix’s

toolbox philosophy [

], makes the shell the primary choice

for specifying succinct and simple pipelines for data process-

ing, system orchestration, and other automation tasks. Un-

fortunately, parallelizing such pipelines requires signicant

eort shared between two dierent programmer groups:

•

Command developers, responsible for implementing indi-

vidual commands such as

sort

uniq

, and

. These de-

velopers usually work in a single programming language,

leveraging its abstractions to provide parallelism when-

ever possible. As they have no visibility into the com-

mand’s uses, they expose a plethora of ad-hoc command-

specic ags such as

-t

--parallel

-p

, and

-j

[

•

Shell users, who use POSIX shell constructs to combine

multiple such commands from many languages into their

scripts and are thus left with only a few options for incor-

porating parallelism. One option is to use manual tools

such as GNU

parallel

[

qsub

[

], SLURM [

];

these tools are either command-unaware, and thus at risk

of breaking program semantics, or too coarse-grained, and

thus only capable of exploiting parallelism at the level of

entire scripts rather than individual components. Another

option is to use shell primitives (such as

wait

) to ex-

plicitly induce parallelism, at a cost of manual eort to

split inputs, rewrite scripts, and orchestrate execution—

an expensive and error-prone process. To top it o, all

these options assume a good understanding of parallelism;

users with domain of expertise outside computing—from

hobbyists to data analysts—are left without options.

This work is licensed under a Creative Commons Attribution International 4.0 License

This paper presents a system called PaSh and outlined in

Fig. 1 for parallelizing POSIX shell scripts that benets both

programmer groups, with emphasis on shell users. Com-

mand developers are given a set of abstractions, akin to

lightweight type annotations, for expressing the paralleliz-

ability properties of their commands: rather than expressing

a command’s full observable behavior, these annotations fo-

cus primarily on its interaction with state. Shell users, on the

other hand, are provided with full automation: PaSh analyzes

their scripts and extracts latent parallelism. PaSh’s transfor-

mations are conservative, in that they do not attempt to

parallelize fragments that lack sucient information—i.e., at

worst, PaSh will choose to not improve performance rather

than risking breakage.

To address cold-start issues, PaSh comes with a library

of parallelizability annotations for commands in POSIX and

GNU Coreutils. These large classes of commands serve as

the shell’s standard library, expected to be used pervasively.

The study that led to their characterization also informed

PaSh’s annotation and transformation components.

These components are tied together with PaSh’s runtime

component. Aware of the Unix philosophy and abstractions,

it packs a small library of highly-optimized data aggregators

as well as high-performance primitives for eager data split-

ting and merging. These address many practical challenges

and were developed by uncovering several pathological situ-

ations, on a few of which we report.

We evaluate PaSh on 44 unmodied scripts including (i)

a series of smaller scripts, ranging from classic Unix one-

liners to modern data-processing pipelines, and (ii) two large

and complex use cases for temperature analysis and web

indexing. Speedups range between 0.89–61.1

(avg: 6

with the 39 out of 44 scripts seeing non-trivial speedups.

PaSh’s runtime primitives add to the base speedup extracted

by PaSh’s program transformations—e.g., 8.83

over a base

5.93

average for 10 representative Unix one-liners. PaSh ac-

celerates a large program for temperature analysis by 2.52

parallelizing both the computation (12.31

) and the prepro-

cessing (2.04

) fragment (i.e., data download, extraction, and

cleanup), the latter traditionally falling outside of the focus of

conventional parallelization systems—even though it takes

75% of the total execution time.

The paper is structured as follows. It starts by introducing

the necessary background on shell scripting and present-

ing an overview of PaSh (§2). Sections 3–5 highlight key

contributions:

•

§3 studies the parallelizability of shell commands, and in-

troduces a lightweight annotation language for commands

that are executable in a data-parallel manner.

•

§4 presents a dataow model and associated transforma-

tions that expose data parallelism while preserving the

semantics of the sequential program.

•

§5 details PaSh’s runtime component, discussing the cor-

rectness and performance challenges it addresses.

After PaSh’s evaluation (§6) and comparison with related

work (§7), the paper concludes (§8).

2 Background and Overview

This section reviews Unix shell scripting through an exam-

ple (§2.1), later used to explore parallelization challenges (§2.2)

and how they are addressed by PaSh (§2.3).

2.1 Running Example: Weather Analysis

Suppose an environmental scientist wants to get a quick

sense of trends in the maximum temperature across the U.S.

over the past ve years. As the National Oceanic and Atmo-

spheric Administration (NOAA) has made historic tempera-

ture data publicly available [38], answering this question is

only a matter of a simple data-processing pipeline.

Fig. 2’s script starts by pulling the yearly index les and

ltering out URLs that are not part of the compressed dataset.

It then downloads and decompresses each le in the remain-

ing set, extracts the values that indicate the temperature,

and lters out bogus inputs marked as

999

. It then calcu-

lates the maximum yearly temperature by sorting the values

and picking the top element. Finally, it matches each maxi-

mum value with the appropriate year in order to print the

result. The eort expended writing this script is low: its data-

processing core amounts to 12 stages and, when expressed

as a single line, is only 165 characters long. This program

is no toy: a Java program implementing only the last four

stages takes 137 LoC [

, §2.1]. To enable such a succinct

program composition, Unix incorporates several features.

Unix Features

Composition in Unix is primarily achieved

with pipes (

), a construct that allows for task-parallel ex-

ecution of two commands by connecting them through a

character stream. This stream is comprised of contiguous

character lines separated by newline characters (NL) delin-

eating individual stream elements. For example, Fig 2’s rst

grep

outputs (le-name) elements containing

, which are

then consumed by

. A special end-of-le (EOF) condition

marks the end of a stream.

Dierent pipeline stages process data concurrently and

possibly at dierent rates—e.g., the second

curl

produces

output at a signicantly slower pace than the

grep

commands

before and after it. The Unix kernel facilitates scheduling,

communication, and synchronization behind the scenes.

Command ags, used pervasively in Unix, are congura-

tion options that the command’s developer has decided to

expose to its users to improve the command’s general appli-

cability. For example, by omitting

sort

’s

-r

ag that enables

reverse sorting, the user can easily get the minimum temper-

ature. The shell does not have any visibility into these ags;

after it expands special characters such as

and

, it leaves

parsing and evaluation entirely up to individual commands.

base="ftp://ftp.ncdc.noaa.gov/pub/data/noaa";

for y in {2015..2019}; do

curl $base/$y | grep gz | tr -s" " | cut -d" " -f9 |

sed "s;^;$base/$y/;" | xargs -n 1 curl -s | gunzip |

cut -c 89-92 | grep -iv 999 | sort -rn | head -n 1 |

sed "s/^/Maximum temperature for $y is: /"

done

Fig. 2. Calculating maximum temperatures per year.

The script down-

loads daily temperatures recorded across the U.S. for the years 2015–2019

and extracts the maximum for every year.

Finally, Unix provides an environment for composing com-

mands written in any language. Many of these commands

come with the system—e.g., ones dened by the POSIX stan-

dard or ones part of the GNU Coreutils—whereas others

are available as add-ons. The fact that commands are de-

veloped in a variety of languages—including shell scripts—

provides users with signicant exibility. For example, one

could replace

sort

and

head

with

./avg.py

to get the average

rather than the maximum—the pipeline still works, as long

as ./avg.py conforms to the interface outlined earlier.

2.2 Parallelization Challenges

While these features aid development-eort economy through

powerful program composition, they complicate shell script

parallelization, which even for simple scripts such as the one

in Fig. 2 create several challenges.

Commands

In contrast to restricted programming frame-

works that enable parallelization by supporting a few careful-

ly-designed primitives [

], the Unix shell provides

an unprecedented number and variety of composable com-

mands. To be parallelized, each command may require spe-

cial analysis and treatment—e.g., exposing data parallelism

in Fig. 2’s

sort

would require splitting their inputs,

running them on each partial input, and then merging the

partial results.

Automating such an analysis is infeasible, as

individual commands are black boxes written in a variety of

programming languages and models. Manual analysis is also

challenging, due to the sheer number of commands and the

many ags that aect their behavior—e.g., Fig. 2’s program

invokes cut with two separate sets of ags.

Scripts

Another challenge is due to the language of the

POSIX shell. First, the language contains constructs that

enforce sequential execution: The sequential composition

operator (

;

) in Fig. 2 indicates that the assignment to

base

must be completed before everything else. Moreover, the lan-

guage semantics only exposes limited task-based parallelism

in the form of constructs such as

. Even though Fig. 2’s

for

focuses only on ve years of data,

curl

still outputs thou-

sands of lines per year; naive parallelization of each loop

As explained earlier (§1), commands such as

sort

may have ad hoc ags

such as

--parallel

, which do not compose across commands and may risk

breaking correctness or not exploiting performance potential (§6.5).

iteration will miss such opportunities. Any attempt to au-

tomate parallelization should be aware of the POSIX shell

language, exposing latent data parallelism without modify-

ing execution semantics.

Implementation

On top of command and shell semantics,

the broader Unix environment has its own set of quirks. Any

attempt to orchestrate parallel execution will hit challenges

related to task parallelism, deadlock prevention, and runtime

performance. For example, forked processes piping their

combined results to Fig. 2’s

head

may not receive a

PIPE

signal

head

exits prior to opening all pipes. Moreover, several

commands such as

sort

and

uniq

require specialized data

aggregators in order to be correctly parallelized.

2.3 PaSh Design Overview

At a high level, PaSh takes as input a POSIX shell script

like the one in Fig. 2 and outputs a new POSIX script that

incorporates data parallelism. The degree of data parallelism

sought by PaSh is congurable using a

--width

parameter,

whose default value is system-specic. Fig. 3 highlights a

few fragments of the parallel script resulting from applying

PaSh with

--width=2

to the script of Fig. 2—resulting in 2

copies of {grep, tr, cut, etc.}.

PaSh rst identies sections of the script that are poten-

tially parallelizable, i.e., lack synchronization and scheduling

constraints, and converts them to dataow graphs (DFGs).

It then performs a series of DFG transformations that ex-

pose parallelism without breaking semantics, by expanding

the DFG to the desired

width

. Finally, PaSh converts these

DFGs back to a shell script augmented with PaSh-provided

commands. The script is handed o to the user’s original

shell interpreter for execution. PaSh addresses the aforemen-

tioned challenges (§2.2) as below.

Commands

To understand standard commands available

in any shell, PaSh groups POSIX and GNU commands into a

small but well-dened set of parallelizability classes (§3.1).

Rather than describing a command’s full observable behavior,

these classes focus on information that is important for data

parallelism. To allow other commands to use its transforma-

tions, PaSh denes a light annotation language for describing

a command’s parallelizability class (§3.2). Annotations are

expressed once per command rather than once per script

and are aimed towards command developers rather than

its users, so that they can quickly and easily capture the

characteristics of the commands they develop.

Scripts

To maintain sequential semantics, PaSh rst ana-

lyzes a script to identify dataow regions containing com-

mands that are candidates for parallelization (§4.1). This

analysis is guided by the script structure: some constructs

expose parallelism (e.g.,

); others enforce synchroniza-

tion (e.g.,

;

). PaSh then converts each dataow region

to a dataow graph (DFG) (§4.2), a exible representation

that enables a series of local transformations to expose data

mkfifo $t{0,1...}

curl $base/$y > $t0 & cat $t0 | split $t1 $t2 &

cat $t1 | grep gz > $t3 &

cat $t2 | grep gz > $t4 &

...

cat $t9 | sort -rn > $t11 & cat $t10 | sort -rn > $t12 &

cat $t11 | eager > $t13 & cat $t12 | eager > $t14 &

sort -mrn $t13 $t14 > $t15 &

cat $t15 | head -n1 > $out1 &

wait $! && get-pids | xargs -n 1 kill -SIGPIPE

Fig. 3. Output of pash --width=2 for Fig. 2 (fragment).

PaSh orches-

trates the parallel execution through named pipes, parallel operators, and

custom runtime primitives—e.g., eager, split, and get-pids.

parallelism, converting the graph into its parallel equiva-

lent (§4.3). Further transformations compile the DFG back to

a shell script that uses POSIX constructs to guide parallelism

explicitly while aiming at preserving the semantics of the

sequential program (§4.4).

Implementation

PaSh addresses several practical chal-

lenges through a set of constructs it provides—i.e., modular

components for augmenting command composition (§5). It

also provides a small and ecient aggregator library target-

ing a large set of parallelizable commands. All these com-

mands live in the

PATH

and are addressable by name, which

means they can be used like (and by) any other commands.

3 Parallelizability Classes

PaSh aims at parallelizing data-parallel commands, i.e., ones

that can process their input in parallel, encoding their char-

acteristics by assigning them to parallelizability classes. PaSh

leans towards having a few coarse classes rather than many

detailed ones—among other reasons, to simplify their under-

standing and use by command developers.

This section starts by dening these classes, along with a

parallelizability study of the commands in POSIX and GNU

Coreutils (§3.1). Building on this study, it develops a light-

weight extensibility framework that enables light-touch par-

allelization of a command by its developers (§3.2). PaSh in

turn uses this language to annotate POSIX and GNU com-

mands and generate their wrappers, as presented in later

sections.

3.1 Parallelizability of Standard Libraries

Broadly speaking, shell commands can be split into four

major classes with respect to their parallelization character-

istics, depending on what kind of state they mutate when

processing their input (Tab.1). These classes are ordered in

ascending diculty (or impossibility) of parallelization. In

this order, some classes can be thought of as subsets of the

next—e.g., all stateless commands are pure—meaning that

the synchronization mechanisms required for any superclass

would work with its subclass (but foregoing any performance

improvements). Commands can change classes depending

on their ags, which are discussed later (§3.2).

Tab. 1. Parallelizability Classes

. Broadly, Unix commands can be

grouped into four classes according to their parallelizability properties.

Class Key Examples Coreutils POSIX

Stateless

○ tr, cat, grep 13 (12.5%) 19 (12.7%)

Parallelizable Pure

○ sort, wc, head 17 (16.3%) 13 (8.7%)

Non-parallelizable Pure

○ sha1sum 13 (12.5%) 11 (7.3%)

Side-eectful

○ env, cp, whoami 61 (58.6%) 105 (70.4%)

Stateless Commands

The rst class,

○

, contains com-

mands that operate on individual line elements of their in-

put, without maintaining state across invocations. These are

commands that can be expressed as a purely functional map

or lter—e.g.,

grep

lters out individual lines and

basename

removes a path prex from a string. They may produce mul-

tiple elements—e.g.,

may insert NL tokens—but always

return empty output for empty input. Workloads that use

only stateless commands are trivial to parallelize: they do

not require any synchronization to maintain correctness, nor

caution about where to split inputs.

The choice of line as the data element strikes a conve-

nient balance between coarse-grained (les) and ne-grained

(characters) separation while staying aligned with Unix’s

core abstractions. This choice can aect the allocation of

commands in

○

, as many of its commands (about 1/3) are

stateless within a stream element—e.g.,

transliterates char-

acters within a line, one at a time—enabling further paral-

lelization by splitting individual lines. This feature may seem

of limited use, as these commands are computationally in-

expensive, precisely due to their narrow focus. However, it

may be useful for cases with very large stream elements (i.e.,

long lines) such as the

.fastq

format used in bioinformatics.

Parallelizable Pure Commands

The second class,

○

contains commands that respect functional purity—i.e., same

outputs for same inputs—but maintain internal state across

their entire pass. The details of this state and its propagation

during element processing aect their parallelizability char-

acteristics. Some commands are easy to parallelize, because

they maintain trivial state and are commutative—e.g.,

simply maintains a counter. Other commands, such as

sort

maintain more complex invariants that have to be taken into

account when merging partial results.

Often these commands do not operate in an online fash-

ion, but need to block until the end of a stream. A typical

example of this is

sort

, which cannot start emitting results

before the last input element has been consumed. Such con-

straints aect task parallelism, but not data parallelism:

sort

can be parallelized signicantly using divide-and-conquer

techniques—i.e., by encoding it as a group of (parallel)

map

functions followed by an aддreдate that merges the results.

Non-parallelizable Pure Commands

The third class,

○

contains commands that, while purely functional, cannot

be parallelized within a single data stream.

This is because

their internal state depends on prior state in non-trivial ways

over the same pass. For example, hashing commands such

sha1sum

maintain complex state that has to be updated

sequentially. If parallelized on a single input, each stage

would need to wait on the results of all previous stages,

foregoing any parallelism benets.

It is worth noting that while these commands are not paral-

lelizable at the granularity of a single input, they are still par-

allelizable across dierent inputs. For example, a web crawler

involving hashing to compare individual pages would allow

sha1sum to proceed in parallel for dierent pages.

Side-eectful Commands

The last class,

○

, contains

commands that have side-eects across the system—for ex-

ample, updating environment variables, interacting with the

lesystem, and accessing the network. Such commands are

not parallelizable without ner-grained concurrency control

mechanisms that can detect side-eects across the system.

This is the largest class, for two main reasons. First, it

includes commands related to the le-system—a central ab-

straction of the Unix design and philosophy [

]. In fact,

Unix uses the le-system as a proxy to several le-unrelated

operations such as access control and device driving. Second,

this class contains commands that do not consume input

or do not produce output—and thus are not amenable to

data parallelism. For example,

date

uname

, and

finger

are all

commands interfacing with kernel- or hardware-generated

information and do not consume any input from user pro-

grams.

3.2 Extensibility Framework

To address the challenge of a language-agnostic environ-

ment (§2.2), PaSh allows communicating key details about

their parallelizability through a lightweight extensibility

framework comprising two components: an annotation lan-

guage, and an interface for developing parallel command

aggregators. The framework can be used both by develop-

ers of new commands as well as developers maintaining

existing commands. The latter group can express additions

or changes to the command’s implementation or interface,

which is important as commands are maintained or extended

over long periods of time.

The extensibility framework is expected to be used by

individuals who understand the commands and their par-

allelizability properties, and thus PaSh assumes their cor-

rectness. The framework could be used as a foundation for

crowdsourcing the annotation eort, for testing annotation

records, and for generating command aggregators. We use

this extension framework in a separate work to synthesize

command aggregators automatically [57].

Note that these commands may still be parallelizable across dierent data

streams, for example when applied to dierent input les.

Key Concerns

PaSh’s annotations focus on three crucial

concerns about a command: (C1) its parallelizability class,

(C2) its inputs and outputs, and the characteristics of its input

consumption, and (C3) how ags aect its class, inputs, and

outputs. The rst concern was discussed extensively in the

previous section; we now turn to the latter two.

Manipulating a shell script in its original form to expose

parallelism is challenging as each command has a dierent

interface. Some commands read from standard input, while

others read from input les. Ordering here is important, as

a command may read several inputs in a predened input

order. For example,

grep "foo" f1 - f2

rst reads from

then shifts to its standard input, and nally reads f2.

Additionally, commands expose ags or options for allow-

ing users to control their execution. Such ags may directly

aect a command’s parallelizability classication as well as

the order in which it reads its inputs. For example,

cat

de-

faults to

○

, but with

-n

it jumps into

○

because it has to

keep track of a counter and print it along with each line.

To address all these concerns, PaSh introduces an anno-

tation language encoding rst-order logic predicates. The

language allows specifying the aforementioned informa-

tion, i.e., correspondence of arguments to inputs and out-

puts and the eects of ags. Annotations assign one of the

four parallelizability class as a default class, subsequently

rened by the set of ags the command exposes. Addition-

ally, for commands in

○

and

○

, the language captures how

a command’s arguments, standard input, and standard out-

put correspond to its inputs and outputs. Annotations in

these classes can also express ordering information about

these inputs—eectively lifting commands into a more con-

venient representation where they only communicate with

their environment through a list of input and output les.

The complete annotation language currently contains 8

operators, one of which supports regular expressions. It was

used to annotate 47 commands, totaling 708 lines of JSON—

an eort that took about 3–4 hours. Annotation records are

by default conservative so as to not jeopardize correctness,

but can be incrementally rened to capture parallelizabil-

ity when using increasingly complex combinations of ags.

The language is extensible with more operators (as long

as the developer denes their semantics); it also supports

writing arbitrary Python code for commands whose proper-

ties are dicult to capture—e.g., higher-order

xargs

, whose

parallelizability class depends on the class of the rst-order

command that it invokes.

Example Annotations

Two commands whose annota-

tions sit at opposing ends of the complexity spectrum are

chmod

and

cut

. The fragment below shows the annotation

for chmod.

{ "command": "chmod",

"cases": [ { "predicate": "default",

"class": "side-effectful" } ] }

This annotation is simple, but serves as an illustration of

the annotation structure. Each annotation is a JSON record

that contains the command name, and a sequence of cases.

Each case contains a predicate that matches on the argu-

ments of the command invocation. It assigns a paralleliz-

ability class (C1) to a specic command instance, i.e., the

combination of its inputs-output consumption (C2) and its in-

vocation arguments (C3). In this case,

chmod

is side-eectful,

and thus the

"default"

predicate of its single

cases

value

always matches—indicating the presence of side-eects.

The annotation for

cut

is signicantly more complex, and

is only shown in part (the full annotation is in Appendix B).

This annotation has two cases, each of which consists of

a predicate on

cut

’s arguments and an assignment of its

parallelizability class, inputs, and outputs as described above.

We only show

cut

’s rst predicate, slightly simplied for

clarity.

{ "predicate": {"operator": "exists", "operands": [ "-z" ]},

"class": "n-pure",

"inputs": [ "args[:]" ],

"outputs": [ "stdout" ] }

This predicate indicates that if

cut

is called with

-z

as an

argument, then it is in

○

, i.e., it only interacts with the

environment by writing to a le (its

stdout

) but cannot be

parallelized. This is because

-z

forces

cut

to delimit lines

with

NUL

instead of newline, meaning that we cannot paral-

lelize it by splitting its input in the line boundaries. The case

also indicates that

cut

reads its inputs from its non-option

arguments.

Experienced readers will notice that

cut

reads its input

from its

stdin

if no le argument is present. This is expressed

in the "options" part of cut’s annotation, shown below:

{ "command": "cut",

"cases": [ ... ],

"options": [ "empty-args-stdin",

"stdin-hyphen" ] }

Option

"empty-args-stdin"

indicates that if non-option ar-

guments are empty, then the command reads from its

stdin

Furthermore, option

"stdin-hyphen"

indicates that a non-

option argument that is just a dash - represents the stdin.

The complete annotation in Appendix B) shows the rest of

the cases (including the default case for

cut

, which indicates

that it is in

○).

Custom Aggregators

For commands in

○

, the annota-

tions are enough to enable parallelization: commands are

applied to parts of their input in parallel, and their outputs

are simply concatenated.

To support the parallelization of arbitrary commands in

○

PaSh allows supplying custom map and aggregate functions.

In line with the Unix philosophy, these functions can be

written in any language as long as they conform to a few

invariants: (i) map is in

○

and aggregate is in

○

, (ii) map can

consume (or extend) the output of the original command and

aggregate can consume (and combine) the results of multiple

map invocations, and (iii) their composition produces the

same output as the original command. PaSh can use the map

and aggregate functions in its graph transformations (§4) to

further expose parallelism.

Most commands only need an aggregate function, as the

map function for many commands is the sequential com-

mand itself. PaSh denes a set of aggregators for many POSIX

and GNU commands in

○

. This set doubles as both PaSh’s

standard library and an exemplar for community eorts tack-

ling other commands. Below is the Python code for one of

the simplest aggregate functions, the one for wc:

#!/usr/bin/python

import sys, os, functools, utils

def parseLine(s):

return map(int, s.split())

def emitLine(t):

f = lambda e: str(e).rjust(utils.PAD_LEN, ' ')

return [" ".join(map(f, t))]

def agg(a, b):

# print(a, b)

if not a:

return b

az = parseLine(a[0])

bz = parseLine(b[0])

return emitLine([ (i+j) for (i,j) in zip(az, bz) ])

utils.help()

res = functools.reduce(agg, utils.read_all(), [])

utils.out("".join(res))

The core of the aggregator, function

agg

, takes two input

streams as its arguments. The

reduce

function lifts the aggre-

gator to arity

to support an arbitrary number of parallel

map

commands. This lifting allows developers to think of

aggregators in terms of two inputs, but generalize them to

operate on many inputs. Utility functions such as

read

and

help

, common across PaSh’s aggregator library, deal with

error handling when reading multiple le descriptors, and

oer a

-h

invocation ag that demonstrates the use of each

aggregator.

PaSh’s library currently contains over 20 aggregators,

many of which are usable by more than one command or ag.

For example, the aggregator shown above is shared among

wc, wc -lw, wc -lm, etc.

4 Dataow Graph Model

PaSh’s core is an abstract dataow graph (DFG) model (§4.2)

used as the intermediate representation on which PaSh per-

forms parallelism-exposing transformations. PaSh rst lifts

sections of the input script to the DFG representation (§4.1),

then performs transformations to expose parallelism (up to

the desired

--width

) (§4.3), and nally instantiates each DFG

back to a parallel shell script (§4.4). A fundamental charac-

teristic of PaSh’s DFG is that it encodes the order in which

a node reads its input streams (not just the order of input

DFG1

DFG2

DFG1

cat f1 f2

grep foo

DFG2

sort f3

cat

grep foo

mkfifo t1

cat f1 f2 > t1 & # node 1

grep foo > f3 & # node 2

wait $!

rm t1

Sort in1 in2 > out

Mkfifo t1 t2

Sort in1 > t1 &

Sort in2 > t2 &

Sort -m t1 t2 > out & # node 3

Wait $!

Rm t1 t2

Grep > t1 &

Grep > t2 &

…

Cat t1 t2 t3 … > out &

“

F1 > t1;

F2 t1 > t2

“

F1 > t1

“

F2 t1 > t2

“

F1 > t1 &

F2 t1 > t2

F_m > t1 &

F_r t1 t2 > t3 &

F_m > t2 &

# Assuming t1 t2 are pipes

F1 > t1 &

F2 > t2 &

Wait(t1, t2) > t3, t4 &

F3 t1 t2 > t3 &

Grep t1 > t2 &

Grep t3 > t4 &

Grep t2 > t5 &

Grep t4 > t6 &

Cat t5 t6 > t7

N Cat | split N | grep

| cat

N - grep

| cat

N > cat > grep

2 > cat > grep > t1

Cat t1 t2 > t3

N > grep > cat

Fm fm

fm, fr, fm

Combined

F1 > F2

Split N > F1 > Cat N

> F2

Split N > f1 > cat N >

split 2^N > f2 > cat N

N > fR> t2

Split K > f1 > f2 > cat

sort

stdout

Fig. 4. From a script AST to DFGs.

The AST on the left has two dataow

regions, each not extending beyond

(Cf.§4.1). Identiers

, and

sit at the boundary of the DFG.

elements per stream), which in turn enables a set of graph

transformations that can be iteratively applied to expose

parallelization opportunities for

○ and

○ commands.

To the extent possible, this section is kept informal and

intuitive. The full formalization of the dataow model, the

shell↔DFG bidirectional translations, and the parallelizing

transformations, as well as their proof of correctness with

respect to the script’s sequential output, are all presented in

a separate work [20].

4.1 Frontend: From a Sequential Script to DFGs

Dataow Regions

In order to apply the graph transforma-

tions that expose data parallelism, PaSh rst has to identify

program sub-expressions that can be safely transformed to

a dataow graph, i.e., sub-expressions that (i) do not impose

any scheduling or synchronization constraints (e.g., by using

;

), and (ii) take a set of les as inputs and produce a set of

les as outputs. The search for these regions is guided by

the shell language and the structure of a particular program.

These contain information about (i) fragments that can be

executed independently, and (ii) barriers that are natural

synchronization points. Consider this code fragment (Fig. 4):

cat f1 f2 | grep "foo" > f3 && sort f3

The

cat

and

grep

commands execute independently (and

concurrently) in the standard shell, but

sort

waits for their

completion prior to start. Intuitively, dataow regions cor-

respond to sub-expressions of the program that would be

allowed to execute independently by dierent processes in

the POSIX standard [

]. Larger dataow regions can be

composed from smaller ones using the pipe operator (

) and

the parallel-composition operator (

). Conversely, all other

operators, including sequential composition (

;

) and logical

operators (

), represent barrier constructs that do not

allow dataow regions to expand beyond them.

Translation Pass

PaSh’s front-end performs a depth-rst

search on the AST of the given shell program. During this

pass, it extends the dataow regions bottom-up, translating

their independent components to DFG nodes until a barrier

construct is reached. All AST subtrees not translatable to

DFGs are kept as they are. The output of the translation

pass is the original AST where dataow regions have been

replaced with DFGs.

To identify opportunities for parallelization, the trans-

lation pass extracts each command’s parallelizability class

together with its inputs and outputs. To achieve this for

each command, it searches all its available annotations (§3.2)

and resorts to conservative defaults if none is found. If the

command is in

○

, or

○

, the translation pass initiates a

dataow region that is propagated up the tree.

Due to the highly dynamic nature of the shell, some infor-

mation is not known to PaSh at translation time. Examples

of such information include the values of environment vari-

ables, unexpanded strings, and sub-shell constructs. For the

sake of correctness, PaSh takes a conservative approach and

avoids parallelizing nodes for which it has incomplete infor-

mation. It will not attempt to parallelize sub-expressions for

which the translation pass cannot infer that, e.g., an environ-

ment variable passed as an argument to a command does

not change its parallelizability class.

4.2 Dataow Mo del Denitions

The two main shell abstractions are (i) data streams, i.e., les

or pipes, and (ii) commands, communicating through these

streams.

Edges—Streams

Edges in the DFG represent streams, the

basic data abstraction of the shell. They are used as commu-

nication channels between nodes in the graph, and as the

input or output of the entire graph. For example, the edges

in DFG1 of Figure 4 are the les

, and

, as well as the

unnamed pipe that connects

cat

and

grep

. We x the data

quantum to be character lines, i.e., sequences of characters

followed by the newline character,

so edges represent pos-

sibly unbounded sequences of lines. As seen above, an edge

can either refer to a named le, an ephemeral pipe, or a Unix

FIFO used for interprocess communication. Edges that do

not start from a node in the graph represent the graph inputs;

edges that do not point to a node in the graph represent its

outputs.

Nodes—Commands

A node of the graph represents a re-

lation (to capture nondeterminism) from a possibly empty

list of input streams to a list of output streams. This repre-

sentation captures all the commands in the classes

○

and

○

, since they only interact with the environment by

reading and writing to streams. We require that nodes are

monotone, namely that they cannot retract output once they

have produced it. As an example,

cat

grep

, and

sort

are the

nodes in the DFGs of Figure 4.

Streaming Commands

A large subset of the parallelizable

○

and

○

classes falls into the special category of streaming

commands. These commands have two execution phases.

First, they consume a (possibly empty) set of input streams

that act as conguration. Then, they transition to the second

phase where they consume the rest of their inputs sequen-

tially, one element at a time, in the order dictated by the

This is a choice that is not baked into PaSh’s DFG model, which supports

arbitrary data elements such as characters and words, but was made to

simplify alignment with many Unix commands.

cat

grep

Grep t1 > t2 &

Grep t3 > t4 &

Grep t2 > t5 &

Grep t4 > t6 &

Cat t5 t6 > t7

N Cat | split N | grep

| cat

N - grep

| cat

N > cat > grep

2 > cat > grep > t1

Cat t1 t2 > t3

N > grep > cat

Fm fm

fm, fr, fm

Combined

F1 > F2

Split N > F1 > Cat N

> F2

Split N > f1 > cat N >

split 2^N > f2 > cat N

N > fR> t2

Split K > f1 > f2 > cat

grep

cat

Fig. 5. Stateless parallelization transformation.

The

cat

node is com-

muted with the stateless node to exploit available data parallelism.

conguration phase and produce a single output stream. The

simplest example of a streaming command is

cat

, which has

an empty rst phase and then consumes its inputs in order,

producing their concatenation as output. A more interesting

example is

grep

invoked with

-f patterns.txt

as arguments;

it rst reads

patterns.txt

as its conguration input, identi-

fying the patterns for which to search on its input, and then

reads a line at a time from its standard input, stopping when

it reaches EOF.

4.3 Graph Transformations

PaSh denes a set of semantics-preserving graph transfor-

mations that act as parallelism-exposing optimizations. Both

the domain and range of these transformations is a graph in

PaSh’s DFG model; transformations can be composed arbi-

trarily and in any order. Before describing the dierent types

of transformations, we formalize the intuition behind classes

○ and

○ described informally earlier (§3.1).

Stateless and Parallelizable Pure Commands

Stateless

commands such as

operate independently on individ-

ual lines of their input stream without maintaining any

state (§3.1). To avoid referring to the internal command state,

we can instead determine that a command is stateless if its

output is the same if we “restart” it after it has read an ar-

bitrary prex of its input. If a command was stateful, then

it would not produce the same output after the restart. For-

mally, a streaming command

is stateless if it commutes

with the operation of concatenation on its streaming input,

i.e., it is a semigroup homomorphism:

∀x, x

′

, c, f (x · x

′

, c) = f (x, c) · f (x

′

, c)

In the above

x · x

′

is the concatenation of the two parts of

’s

streaming input and

is the conguration input (which needs

to be passed to both instances of

). The above equation

means that applying the command

to a concatenation of

two inputs x, x

′

produces the same output as applying f to

each input

x, x

′

separately, and concatenating the outputs.

Note that we only focus on deterministic stateless commands

and that is why

is a function and not a relation in the above.

Pure commands such as

sort

and

can also be paral-

lelized, using divide-and-conquer parallelism. These com-

mands can be applied independently on dierent segments

of their inputs, and then their outputs are aggregated to pro-

duce the nal result. More formally, these pure commands

can be implemented as a combination of a function

map

and

an associative function

aддreдate

that satisfy the following

equation:

∀x, x

′

, c, f (x · x

′

, c) = aддreдate(map(x, c), map(x

′

, c), c)

cmd

Grep t1 > t2 &

Grep t3 > t4 &

Grep t2 > t5 &

Grep t4 > t6 &

Cat t5 t6 > t7

N Cat | split N | grep

| cat

N - grep

| cat

N > cat > grep

2 > cat > grep > t1

Cat t1 t2 > t3

N > grep > cat

Fm fm

fm, fr, fm

Combined

F1 > F2

Split N > F1 > Cat N

> F2

Split N > f1 > cat N >

split 2^N > f2 > cat N

N > fR> t2

Split K > f1 > f2 > cat

cmd

cat

relay

cat

split

Fig. 6. Auxiliary transformations.

These augment the DFG with

cat

split, and relay nodes.

Parallelization Transformations

Based on these equa-

tions, we can dene a parallelization transformation on a

node

f ∈

○

whose streaming input is a concatenation, i.e.,

produced using the command

cat

, of

input streams and is

followed by a node

′

(Fig. 5). The transformation replaces

with

new nodes, routing each of the

input streams

to one of them, and commutes the

cat

node after them to

concatenate their outputs and transfer them to

′

. Formally:

v(x

· x

· · · x

, s) ⇒ v(x

, s) · v(x

, s) · · · v(x

, s)

The transformation can be extended straightforwardly to

nodes v ∈

○, implemented by a (map, aддreдate) pair:

v(x

· x

· · · x

, s) ⇒

aддreдate(map(x

, s), map(x

, s), . . . map(x

, s), s)

Both transformations can be shown to preserve the behavior

of the original graph assuming that the pair

(map, aддreдate)

meets the three invariants outlined earlier (§3.2) and the

aforementioned equations hold.

Auxiliary Transformations

PaSh also performs a set of

auxiliary transformations

1−3

that are depicted in Fig. 6. If a

node has many inputs,

concatenates these inputs by insert-

ing a

cat

node to enable the parallelization transformations.

In cases where a parallelizable node has one input and is

not preceded by a concatenation,

inserts a

cat

node that

is preceded by its inverse

split

, so that the concatenation

can be commuted with the node. Transformation

inserts a

relay node that performs the identity transformation. Relay

nodes can be useful for performance improvements (§5), as

well as for monitoring and debugging.

Degree of Parallelism

The degree of parallelism achieved

by PaSh is aected by the width of the nal dataow graph.

The dataow width corresponds, intuitively, to the number

of data-parallel copies of each node of the sequential graph

and thus the fanout of the

split

nodes that PaSh introduces.

The dataow width is congured using the

--width

param-

eter, which can be chosen by the user depending on their

script characteristics, input data, and target execution envi-

ronment. By default, PaSh assigns width to 2 if it is executing

on a machine with 2-16 processors, and

floor(cpu_cores/8)

if it is executing on a machine with more than 16 processors.

This is a conservative limit that achieves benets due to par-

allelism but does not consume all system resources. It is not

meant to be optimal, and as shown in our evaluation, dier-

ent scripts achieve optimal speedup with dierent

--width

values, which indicates an interesting direction for future

work.

4.4 Backend: From DFGs to a Parallel Shell Script

After applying transformations (§4.3), PaSh translates all

DFGs back into a shell script. Nodes of the graph are in-

stantiated with the commands and ags they represent, and

edges are instantiated as named pipes. A prologue in the

script creates the necessary intermediate pipes, and a

trap

call takes care of cleaning up when the script aborts.

5 Runtime

This section describes technical challenges related to the

execution of the resulting script and how they are addressed

by PaSh’s custom runtime primitives.

Overcoming Laziness

The shell’s evaluation strategy is

unusually lazy, in that most commands and shell constructs

consume their inputs only when they are ready to process

more. Such laziness leads to CPU underutilization, as com-

mands are often blocked when their consumers are not re-

questing any input. Consider the following fragment:

mkfifo t1 t2

grep "foo" f1 > t1 & grep "foo" f2 > t2 & cat t1 t2

The

cat

command will consume input from

only after it

completes reading from

. As a result, the second

grep

will

remain blocked until the rst grep completes (Fig. 7a).

To solve this, one might be tempted to replace FIFOs with

les, a central Unix abstraction, simulating pipes of arbitrary

buering (Fig. 7b). Aside from severe performance implica-

tions, naive replacement can lead to subtle race conditions, as

a consumer may reach EOF before a producer. Alternatively,

consumers could wait for producers to complete before open-

ing the le for reading (Fig. 7c); however, this would insert

articial barriers impeding task-based parallelism and wast-

ing disk resources—that is, this approach allows for data

parallelism to the detriment of task parallelism.

To address this challenge, PaSh inserts and instantiates

eager

relay

nodes at these points (Fig. 7d). These nodes fea-

ture tight multi-threaded loops that consume input eagerly

while attempting to push data to the output stream, forcing

upstream nodes to produce output when possible while also

preserving task-based parallelism. In PaSh’s evaluation (§6),

these primitives have the names presented in Fig. 7.

Splitting Challenges

To oer data parallelism, PaSh needs

to split an input data stream to multiple chunks operated

upon in parallel. Such splitting is needed at least once at the

beginning of a parallel fragment, and possibly every time

within the parallel program when an aggregate function of a

stage merges data into a single stream.

relay

<EOF>

(a) No Eager — (c) Blocking Eager —

(b) Wrong Eager ✘ (d) PaSh Eager ✓

Fig. 7. Eager primitive.

Addressing intermediary laziness is challenging:

(a) FIFOs are blocking; (b) les alone introduce race conditions between

producer/consumer; (c) les +

wait

inhibit task parallelism. Eager

relay

nodes (d) address the challenge while remaining within the PaSh model.

To achieve this, PaSh’s transformations insert split nodes

that correspond to a custom

split

command. For

split

to be

eective, it needs to disperse its input uniformly. PaSh does

not do this in a round-robin fashion, as that would require

augmenting the data stream with additional metadata to

maintain FIFO ordering—a challenge for both performance

and correctness. PaSh instead splits chunks in-order, which

necessitates knowledge of the input size beforehand and

which is not always available. To address this challenge,

PaSh provides a

split

implementation that rst consumes its

complete input, counts its lines, and then splits it uniformly

across the desired number of outputs. PaSh also inserts eager

relay

nodes after all

split

outputs (except for the last one)

to address laziness as described above.

Dangling FIFOs and Zombie Producers

Under normal

operation, a command exits after it has produced and sent all

its results to its output channel. If the channel is a pipe and

its reader exits early, the command is notied to stop writing

early. In Unix, this notication is achieved by an out-of-band

error mechanism: the operating system delivers a

PIPE

signal

to the producer, notifying it that the pipe’s consumer has

exited. This handling is dierent from the error handling

for other system calls and unusual compared to non-Unix

systems

primarily because pipes and pipelines are at the

heart of Unix. Unfortunately though, if a pipe has not been

opened for writing yet, Unix cannot signal this condition.

Consider the following script:

mkfifo fifo1 fifo2

cat in1 > fifo1 & cat in2 > fifo2 &

cat fifo1 fifo2 | head -n 1 & wait

In the code above,

head

exits early causing the last

cat

exit before opening

fifo2

. As a result, the second

cat

never

receives a

PIPE

signal that its consumer exited—after all,

fifo2

never even had a consumer! This, in turn, leaves the

second

cat

unable to make progress, as it is both blocked

and unaware of its consumer exiting. Coupled with

wait

the end, the entire snippet reaches a deadlock.

This problem is not unique to PaSh; it occurs even when

manually parallelizing scripts using FIFOs (but not when us-

ing e.g., intermediary les, Cf. §5, Laziness). It is exacerbated,

however, by PaSh’s use of the

cat fifo1 fifo2

construct,

used pervasively when parallelizing commands in

○.

For example, Windows indicates errors for

WriteFile

using its return

code—similar to DeleteFile and other Win32 functions.

Tab. 2. Summary of Unix one-liners

. Structure summarizes the dierent classes of commands used in the script. Input and seq. time report on the input

size fed to the script and the timing of its sequential execution. Nodes and compile time report on PaSh’s resulting DFG size (which is equal to the number of

resulting processes and includes aggregators, eager, and split nodes) and compilation time for two indicative --widths.

Script Structure Input Seq. Time #Nodes(16, 64) Compile Time (16, 64) Highlights

nfa-regex 3 ×

○ 1 GB 79m35.197s 49 193 0.056s 0.523s complex NFA regex

sort

○,

○ 10 GB 21m46.807s 77 317 0.090s 1.083s sorting

top-n 2 ×

○, 4 ×

○ 10 GB 78m45.872s 96 384 0.145s 1.790s double sort, uniq reduction

wf 3 ×

○, 3 ×

○ 10 GB 22m30.048s 96 384 0.147s 1.809s double sort, uniq reduction

spell 4 ×

○, 3 ×

○ 3 GB 25m7.560s 193 769 0.335s 4.560s comparisons (comm)

dierence 2 ×

○, 2 ×

○,

○ 10 GB 25m49.097s 125 509 0.186s 2.341s non-parallelizable diffing

bi-grams 3 ×

○, 3 ×

○ 3 GB 38m9.922s 185 761 0.313s 4.310s stream shifting and merging

set-dierence 5 ×

○, 2 ×

○,

○ 10 GB 51m32.313s 155 635 0.316s 4.358s two pipelines merging to a comm

sort-sort

○, 2 ×

○ 10 GB 31m26.147s 154 634 0.293s 3.255s parallelizable

○ after

○

shortest-scripts 5 ×

○, 2 ×

○ 85 MB 28m45.900s 142 574 0.328s 4.657s long

○ pipeline ending with

○

Sequential

Blocking

Eager

PaSh

w/o split

PaSh

No Eager

Fig. 8. Runtime setup lattice.

Parallel No Eager and Blocking Eager im-

prove over sequential, but are not directly comparable. PaSh w/o Split adds

PaSh’s optimized eager relay, and PaSh uses all primitives in §5 (Fig. 9).

To solve this problem, PaSh emits cleanup logic that oper-

ates from the end of the pipeline and towards its start. The

emitted code rst gathers the IDs of the output processes

and passes them as parameters to

wait

; this causes

wait

block only on the output producers of the dataow graph.

Right after

wait

, PaSh inserts a routine that delivers

PIPE

signals to any remaining processes upstream.

Aggregator Implementations

As discussed earlier, com-

mands in

○

can be parallelized using a map and an aggregate

stage (§3). PaSh implements aggregate for several commands

○

to enable parallelization. A few interesting examples

are aggregate functions for (i)

sort

, which amounts to the

merge phase of a merge-sort (and on GNU systems is im-

plemented as

sort -m

), (ii)

uniq

and

uniq -c

, which need to

check conditions at the boundary of their input streams, (iii)

tac

, which consumes stream descriptors in reverse order,

and (iv)

, which adds inputs with an arbitrary number of

elements (e.g.,

wc -lw

wc -lwc

etc.). The aggregate func-

tions iterate over the provided stream descriptors, i.e., they

work with more than two inputs, and apply pure functions

at the boundaries of input streams (with the exception of

sort that has to interleave inputs).

6 Evaluation

This section reports on whether PaSh can indeed oer per-

formance benets automatically and correctly using several

scripts collected out from the wild along with a few micro-

benchmarks for targeted comparisons.

Highlights

This paragraph highlights results for

width=16

but PaSh’s evaluation reports on varying

width

s (2–64). Over-

all, applying PaSh to all 44 unmodied scripts accelerates

39 of them by 1.92–17.42

; for the rest, the parallel perfor-

mance is comparable to the sequential (0.89, 0.91, 0.94, 0.99,

1.01

). The total average speedup over all 44 benchmarks is

. PaSh’s runtime primitives oer signicant benets—for

the 10 scripts that we measured with and without the run-

time primitives they bump the average speedup from 5

to 8

. PaSh signicantly outperforms

sort --parallel

, a

hand-tuned parallel implementation, and performs better

than GNU

parallel

, which returns incorrect results if used

without care.

Using PaSh’s standard library of annotations for POSIX

and GNU commands (§3), the vast majority of programs

(

40, with

200 commands) require no eort to parallelize

other than invoking PaSh; only 6 (

3%) commands, outside

this library, needed a single-record annotation (§6.4).

In terms of correctness, PaSh’s results on multi-GB inputs

are identical to the sequential ones. Scripts feature ample op-

portunities for breaking semantics (§6.5), which PaSh avoids.

Setup

PaSh was run on 512GB of memory and 64 physi-

cal

2.1GHz Intel Xeon E5-2683 cores, Debian 4.9.144-3.1,

GNU Coreutils 8.30-3, GNU Bash 5.0.3(1), and Python 3.7.3—

without any special conguration in hardware or software.

Except as otherwise noted, (i) all pipelines are set to (ini-

tially) read from and (nally) write to the le-system, (ii)

curl

fetches data from a dierent physical host on the same

network connected by 1Gbps links.

We note a few characteristics of our setup that mini-

mize statistical non-determinism: (1) our evaluation exper-

iments take several hours to complete (about 23 hours for

the full set), (2) our experimental infrastructure is hosted on

premises, not shared with other groups or researchers, (3)

the runtime does not include any managed runtimes, virtu-

alization, or containment,

(4) many commands are repeated

many times—for example, there are more than 40 instances

grep

in our benchmark set. The set of benchmarks also

executes with smaller inputs multiple times a week (using

While PaSh is available via Docker too, all results reported in this paper

are from non-containerized executions.

Fig. 9. PaSh’s speedup for width=2–64.

Dierent congurations per benchmark: (i) PaSh: the complete implementation with

eager

and

split

enabled, (ii)

PaSh w/o split:

eager

enabled (no

split

), (iii) Blocking Eager: only blocking

eager

enabled (no

split

), (iv) No Eager: both

eager

and

split

disabled. For

some pairs of congurations, PaSh produces identical parallel scripts and thus only one is shown.

continuous integration), reporting minimal statistical dier-

ences between runs.

Parallelism

PaSh’s degree of parallelism is congured by

the

--width

ag (§4.3). PaSh does not control a script’s initial

parallelism (e.g., a command could spawn 10 processes), and

thus the resulting scripts often reach maximum paralleliza-

tion benets with a value of

width

smaller than the physical

cores available in our setup (in our case 64).

6.1 Common Unix One-liners

We rst evaluate PaSh on a set of popular, common, and clas-

sic Unix pipeline patterns [

]. The goal is to evaluate

performance benets due to PaSh’s (i) DFG transformations

alone, including how

--width

aects speedup, and (ii) run-

time primitives, showing results for all points on the runtime

conguration lattice (Fig. 8).

Programs

Tab. 2 summarizes the rst collection of pro-

grams. NFA-Regex is centered around an expensive NFA-

based backtracking expression and all of its commands are

○

. Sort is a short script centered around a

○

command.

Wf and Top-n are based on McIlroy’s classic word-counting

program [

]; they use sorting, rather than tabulation, to iden-

tify high-frequency terms in a corpus. Spell, based on the

original

spell

developed by Johnson [

], is another Unix

classic: after some preprocessing, it makes clever use of

comm

to report words not in a dictionary. Shortest-scripts extracts

the 15 shortest scripts in the user’s

PATH

, using the

file

utility and a higher-order

via

xargs

[

, pg. 7]. Di and

Set-di compare streams via a

diff

(in

○

, non-parallelizable)

and

comm

(in

○

), respectively. Sort-sort uses consecutive

○

commands without interleaving them with commands that

condense their input size (e.g.,

uniq

). Finally, Bi-grams repli-

cates and shifts a stream by one entry to calculate bigrams.

Results

Fig. 9 presents PaSh’s speedup as a function of

width=

2–64. Average speedups of the optimized PaSh, i.e.,

with

eager

and

split

enabled, for

width=

{2, 4, 8, 16, 32, 64} are

{1.97, 3.5, 5.78, 8.83, 10.96, 13.47}

, respectively. For No Ea-

ger, i.e., PaSh’ transformations without its runtime support,

speedups drop to 1.63, 2.54, 3.86, 5.93, 7.46, 9.35×.

Plots do not include lines for congurations that lead to

identical parallel programs. There are two types of such cases.

In the rst, the PaSh (blue) and PaSh w/o Split (red, hidden)

lines are identical for scripts where PaSh does not add

split

as the width of the DFG is constant; conversely, when both

lines are shown (e.g., Spell, Bi-grams, and Sort), PaSh has

added

split

s due to changes in the DFG width (e.g. due to

○

command). In the second type, Pash w/o Split (red) is

identical to No Eager (green, hidden) and Blocking Eager

(orange, hidden) because the input script features a com-

mand in

○

relatively early. This command requires

an aggregator, whose output is of width 1, beyond which

PaSh w/o Split congurations are sequential and thus see no

speedup. Finally, Tab. 2 shows that PaSh’s transformation

time is negligible, and its COST [

], i.e., the degree of paral-

lelism threshold over which PaSh starts providing absolute

execution time benets, is 2.

Discussion

As expected, scripts with commands only in

○

see linear speedup. PaSh’s

split

benets scripts with

○

commands, without negative eects on the rest. PaSh’s

eager

primitive improves over No Eager and Blocking Eager

for all scripts. No Eager is usually faster than Blocking Ea-

ger since it allows its producer and consumer to execute in

parallel. Sort-sort illustrates the full spectrum of primitives:

(i) PaSh w/o Split oers benets despite the lack of

split

be-

cause it fully parallelizes the rst

sort

, and (ii) PaSh gets full

benets because

split

ting allows parallelizing the second

sort too.

As described earlier, PaSh often achieves the maximum

possible speedup for a

width

that is lower than the number

of available cores—i.e.,

width=

16–32 for a 64-core system.

This is also because PaSh’s runtime primitives spawn new

processes—e.g., Sort with

width=

8 spawns 37 processes: 8

8 sort, 7 aggregate, and 14 eager processes.

Take-aways

PaSh accelerates scripts by up to 60

, de-

pending on the characteristics of the commands involved

in a script. Its runtime constructs improve over the baseline

speedup achieved by its parallelization transformations.

6.2 Unix50 from Bell Labs

We now turn to a set of Unix pipelines found out in the wild.

Programs

In a recent celebration of Unix’s 50-year legacy,

Bell Labs created 37 challenges [

] solvable by Unix pipelines.

The problems were designed to highlight Unix’s modular

philosophy [

]. We found unocial solutions to all-but-

three problems on GitHub [

], expressed as pipelines with

2–12 stages (avg.: 5.58). They make extensive use of standard

commands under a variety of ags, and appear to be written

by non-experts (contrary to §6.1, they often use sub-optimal

or non-Unix-y constructs). PaSh executes each pipeline as-is,

without any modication.

Results

Fig. 10 shows the speedup (left) over the sequential

runtime (right) for 31 pipelines, with

width=

16 and 10GB

inputs. It does not include 3 pipelines that use

head

fairly

early thereby nishing execution in under 0

1 seconds. We

refer to each pipeline using its x-axis index (#0–30) in Fig. 10.

Average speedup is 6

, and weighted average (with the

absolute times as weights) is 5.75×.

Discussion

Most pipelines see signicant speedup, ex-

cept #25-30 that see no speedup because they contain gen-

eral commands that PaSh cannot parallelize without risking

breakage—e.g.,

awk

and

sed -d

. A Unix expert would no-

tice that some of them can be replaced with Unix-specic

commands—e.g.,

awk "{print \$2, \$0}" | sort -nr

, used

to sort on the second eld can be replaced with a single

sort -nr -k 2

(#26). The targeted expressiveness of the re-

placement commands can be exploited by PaSh—in this spe-

cic case, achieving 8.1× speedup (vs. the original 1.01×).

For all other scripts (#0–24), PaSh’s speedup is capped

due to a combination of reasons: (i) scripts contain pure

commands that are parallelizable but don’t scale linearly,

such as

sort

(#5, 6, 7, 8, 9, 19, 20, 21, 23, 24), (ii) scripts are

deep pipelines that already exploit task parallelism (#4, 10,

11, 13, 15, 17, 19, 21, 22), or (iii) scripts are not CPU-intensive,

resulting in pronounced I/O and constant costs (#3, 4, 11, 12,

14, 16, 17, 18, 22).

Take-aways

PaSh accelerates unmodied pipelines found

in the wild; small tweaks can yield further improvements,

showing that PaSh-awareness and scripting expertise can

improve performance. Furthermore, PaSh does not signi-

cantly decelerate non-parallelizable scripts.

6.3 Use Case: NOAA Weather Analysis

We now turn our attention to Fig. 2’s script (§2).

Fig. 10. Unix50 scripts.

Speedup (left axis) over sequential execution (right

axis) for Unix50 scripts. Parallelism is 16

on 10GB of input data (Cf.§6.2).

Pipelines are sorted in descending speedup order.

Program

This program is inspired by the central example

in “Hadoop: The Denitive Guide” [

, §2], where it exempli-

es a realistic analytics pipeline comprising 3 stages: fetch

NOAA data (shell), convert them to a Hadoop-friendly format

(shell), and calculate the maximum temperature (Hadoop).

While the book focuses only on the last stage, PaSh paral-

lelizes the entire pipeline.

Results

The complete pipeline executes in 44m2s for ve

years (82GB) of data. PaSh with

width=

16 leads to 2.52

speedup, with dierent phases seeing dierent benets: 2.04

speedup (vs. 33m58s) for all the pre-processing (75% of the

total running time) and 12.31

speedup (vs. 10m4s) for com-

puting the maximum.

Discussion

The speedup of the preprocessing phase of

the pipeline is bound by the network and I/O costs since

curl

downloads 82GB of data. However, the speedup for

the processing phase (CPU-bound) is 12

, much higher

than what would be achieved by parallelizing per year (for

a total of ve years). Similar to Unix50 (§6.2), we found

that large pipelines enable signicant freedom in terms of

expressiveness.

Take-aways

PaSh can be applied to programs of notable

size and complexity to oer signicant acceleration. PaSh

is also able to extract parallelism from fragments that are

not purely compute-intensive, i.e., the usual focus of conven-

tional parallelization systems.

6.4 Use Case: Wikipedia Web Indexing

We now apply PaSh to a large web-indexing script.

Program

This script reads a le containing Wikipedia

URLs, downloads the pages, extracts the text from HTML,

and applies natural-language processing—e.g., trigrams, char-

acter conversion, term frequencies—to index it. It totals 34

commands written in multiple programming languages.

Results

The original script takes 191min to execute on 1%

of Wikipedia (1.3GB). With

width=

16, PaSh brings it down

to 15min (12

), with the majority of the speedup coming

from the HTML-to-text conversion.

Discussion

The original script contains 34 pipeline stages,

thus the sequential version already benets from task-based

parallelism. It also uses several utilities not part of the stan-

dard POSIX/GNU set—e.g., its

url-extract

ion is written in

JavaScript and its

word-stem

ming is in Python. PaSh can still

operate on them as their parallelizability properties—

○

for

url-extract

and

word-stem

—can be trivially described by an-

notations. Several other stages are in

○

allowing PaSh to

achieve benets by exposing data parallelism.

Take-aways

PaSh operates on programs with (annotated)

commands outside the POSIX/GNU subsets and leads to

notable speedups, even when the original program features

signicant task-based parallelism.

6.5 Further Micro-benchmarks

As there are no prior systems directly comparable to PaSh,

we now draw comparisons with two specialized cases that

excel within smaller fragments of PaSh’s proposed domain.

Parallel Sort

First,

we compare a

sort

parallelized by PaSh

(

) against the same

sort

invoked using

the

--parallel

ag

set (

While the

--parallel

ag is not

a general solution, the comparison serves to establish a

baseline for PaSh.

’s parallelism is congured to 2

that

’s

--width

(i.e., the rightmost plot point for

is for

--parallelism=128

), to account for PaSh’s additional run-

time processes.

A few points are worth noting.

without

eager

performs

comparably to

, and with

eager

it outperforms

(

∼

);

this is because

eager

adds intermediate buers that ensure

CPU utilization is high.

indicates that

sort

’s scalability is

inherently limited (i.e., due to

sort

, not PaSh); this is why

all scripts that contain

sort

(e.g., §6.1–6.4) are capped at 8

speedup. The comparison also shows PaSh’s benets to com-

mand developers: a low-eort parallelizability annotation

achieves better scalability than a custom ag (and underlying

parallel implementation) manually added by developers.

GNU Parallel

We compare PaSh to

parallel

(v.20160422),

a GNU utility for running other commands in parallel [

on a small bio-informatics script. Sequential execution takes

554.8s vs. PaSh’s 128.5s (4.3

), with most of the overhead

coming from a single command—cutadapt.

There are a few possible ways users might attempt to

use GNU

parallel

on this program. They could use it on

the bottleneck stage, assuming they can deduce it, bring-

ing execution down to 304.4s (1

speedup). Alternatively,

they could (incorrectly) sprinkle

parallel

across the entire

program. This would lead to 3.2

performance improve-

ments but incorrect results with respect to the sequential

Both sorts use the same buer size internally [44].

execution—with 92% of the output showing a dierence be-

tween sequential and parallel execution. PaSh’s conservative

program transformations are not applied in program frag-

ments with unclear parallelizability properties.

7 Related Work

Existing techniques for exploiting parallelism are not directly

comparable to PaSh, because they either require signicantly

more user eort (see §1 for the distinction between users and

developers) or are too specialized, targeting narrow domains

or custom programming abstractions.

Parallel Shell Scripting

Utilities exposing parallelism on

modern Unixes—e.g.,

qsub

[

], SLURM [

parallel

[

]—

are limited to embarrassingly parallel (and short) programs

and are predicated upon explicit and careful user invoca-

tion: users have to navigate through a vast array of dierent

congurations, ags, and modes of invocation to achieve par-

allelization without jeopardizing correctness. For example,

parallel

contains ags such as

--skip-first-line

-trim

and

--xargs

, and introduces (and depends on) other pro-

grams with complex semantics, such as ones for SQL query-

ing and CSV parsing. In contrast, PaSh manages to parallelize

large scripts correctly with minimal-to-zero user eort.

Several shells [

] add primitives for non-linear

pipe topologies—some of which target parallelism. Here too,

however, users are expected to manually rewrite scripts to

exploit these new primitives, contrary to PaSh.

Recently, Greenberg [

] argued that the shell and its

constructs can be seen as a DSL for orchestrating concurrent

processes. PaSh’s extraction of dataow regions is based on

a similar observation, but its central focus is on achieving

data parallelism from these dataow regions automatically.

Developed concurrently with PaSh, the Process-Ooad

SHell (POSH) [45] is a shell and runtime that automatically

reduces data movement when running shell pipelines on

data stored in remote storage á la NFS. POSH accelerates

I/O-heavy pipelines that access les in remote lesystems,

by ooading computation to servers closer to the data. PaSh

is a shell-to-shell compiler that parallelizes Unix shell scripts

running on a single multi-processor machine by transform-

ing them to DFGs, applying transformations, and then trans-

forming them back to parallel shell scripts augmented with

PaSh’s runtime primitives that are executed on the user’s

shell. Both PaSh and POSH observe that Unix commands

can have arbitrary behaviors (§2.2), thus each introducing

an annotation language that ts its problem: POSH uses an-

notations to identify which les are accessed by a pipeline,

and thus co-locates commands and their dependencies; PaSh

uses annotations to identify whether a command is paral-

lelizable and, if so, how to translate it to a dataow node.

Both systems descend from a lineage of annotation-based

black-box transformations [41, 54, 55, 61].

Low-level Parallelization

There exists signicant work

on automating parallelization at the instruction level, start-

ing with explicit

DOALL

and

DOACROSS

annotations [

] and

continuing with compilers that attempt to automatically ex-

tract parallelism [

]. These eorts operate at a lower

level than PaSh (e.g., that of instructions or loops rather than

the boundaries of programs that are part of a script), within

a single-language or single-target environments, and require

source modications.

More recent work focuses on extracting parallelism from

domain-specic programming models [

] and inter-

active parallelization tools [

]. These tools simplify the

expression of parallelism, but still require signicant user

involvement in discovering and exposing parallelism.

Correct Parallelization of Dataow Graphs

The DFG

is a prevalent model in several areas of data processing, in-

cluding batch- [

] and stream-processing [

]. Systems

implementing DFGs often perform optimizations that are

correct given subtle assumptions on the dataow nodes that

do not always hold, introducing erroneous behaviors. Re-

cent work [

] attempts to address this issue by

performing optimizations only in cases where correctness is

preserved, or by testing that applied optimizations preserve

the original behavior. PaSh draws inspiration from these

eorts, in that it delegates the satisfaction of assumptions

to the annotation writers, who are expected to be command

developers rather than shell users (§1), ensuring that trans-

formations preserve the behavior of the original dataow. Its

DFG model, however, is dierent from earlier eorts in that

it explicitly captures and manipulates ordering constraints.

The constraints are due to the intricacies of the Unix model—

e.g., FIFO streams, argument processing, and concatenation

operators.

Parallel Userspace Environments

By focusing on sim-

plifying the development of distributed programs, a plethora

of environments additionally assist in the construction of par-

allel software. Such systems [

], languages [

or system-language hybrids [

] hide many of the

challenges of dealing with concurrency as long as developers

leverage the provided abstractions—which are strongly cou-

pled to the underlying operating or runtime system. Even

shell-oriented eorts such as Plan9’s

are not backward-

compatible with the Unix shell, and often focus primarily

on hiding the existence of a network rather than automating

parallel processing.

Parallel Frameworks

Several frameworks [

]

oer fully automated parallelism as long as special primitives

are used—e.g., map-reduce-style primitives for Phoenix [

These primitives make strong assumptions about the nature

of the computation—e.g., commutative and associative ag-

gregation functions that can be applied on their inputs in

any order. By targeting specic classes of computation (viz.

PaSh’s parallelizability), these primitives are signicantly

optimized for their target domains. PaSh instead chooses an

approach that is better tailored to the shell: it does not require

rewriting parts of a shell script using specic parallelization-

friendly primitives, but rather lifts arbitrary commands to

a parallelization-friendly space using an annotation frame-

work.

Dryad [

] is a distributed system for dataow graphs.

Dryad oers a scripting language, Nebula, that allows us-

ing shell commands such as

grep

sed

in place of indi-

vidual dataow nodes. The main dierence with PaSh is

that in Dryad the programmer needs to explicitly express

the dataow graph, which is then executed in a distributed

fashion, whereas PaSh automatically parallelizes a given

shell script by producing a parallel script that runs on an

unmodied shell of choice.

8 Conclusion

Shell programs are ubiquitous, use blocks written in a plethora

of programming languages, and spend a signicant fraction

of their time interacting with the broader environment to

download, extract, and process data—falling outside the focus

of conventional parallelization systems. This paper presents

PaSh, a system that allows shell users to parallelize shell

programs mostly automatically. PaSh can be viewed as (i) a

source-to-source compiler that transforms scripts to DFGs,

parallelizes them, and transforms them back to scripts, cou-

pled with (ii) a runtime component that addresses several

practical challenges related to performance and correctness.

PaSh’s extensive evaluation over 44 unmodied Unix scripts

demonstrates non-trivial speedups (0.89–61.1×, avg: 6.7×).

PaSh’s implementation, as well as all the example code

and benchmarks presented in this paper, are all open source

and available for download: github.com/andromeda/pash.

Acknowledgments

We want to thank André DeHon, Ben Karel, Caleb Stanford,

Thurston Dang, Jean-Sébastien Légaré, Nick Roessler, Sage

Gerard, and several open-source contributors. We are grate-

ful to our shepherd, Julia Lawall, for her guidance. This ma-

terial is based upon work supported by DARPA contract no.

HR00112020013 and no. HR001120C0191, and NSF awards

CCF 1763514 and 2008096. Any opinions, ndings, conclu-

sions, or recommendations expressed in this material are

those of the authors and do not necessarily reect those of

DARPA or NSF.

References

[1]

Amnon Barak and Oren La’adan. 1998. The MOSIX multicomputer

operating system for high performance cluster computing. Future

Generation Computer Systems 13, 4 (1998), 361–372.

[2]

Jonathan C Beard, Peng Li, and Roger D Chamberlain. 2017. RaftLib: A

C++ template library for high performance stream parallel processing.

The International Journal of High Performance Computing Applications

31, 5 (2017), 391–404.

[3]

Jon Bentley. 1985. Programming Pearls: A Spelling Checker. Commun.

ACM 28, 5 (May 1985), 456–462. hps://doi.org/10.1145/3532.315102

[4]

Jon Bentley, Don Knuth, and Doug McIlroy. 1986. Programming Pearls:

A Literate Program. Commun. ACM 29, 6 (June 1986), 471–483. hps:

//doi.org/10.1145/5948.315654

[5]

Pawan Bhandari. 2020. Solutions to unixgame.io. hps://git.io/Jf2dn

Accessed: 2020-04-14.

[6]

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fata-

halian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs:

Stream Computing on Graphics Hardware. ACM Trans. Graph. 23, 3

(2004), 777–786. hps://doi.org/10.1145/1015706.1015800

[7]

Michael Burke and Ron Cytron. 1986. Interprocedural Dependence

Analysis and Parallelization. In Proceedings of the 1986 SIGPLAN Sym-

posium on Compiler Construction (SIGPLAN ’86). ACM, New York, NY,

USA, 162–175. hps://doi.org/10.1145/12276.13328

[8]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl,

Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and

Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38 (2015),

28–38.

[9]

Jerey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplied

Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008),

107–113. hps://doi.org/10.1145/1327452.1327492

[10]

Tom Du. 1990. Rc-A shell for Plan 9 and Unix systems. AUUGN 12,

1 (1990), 75.

[11]

Je Epstein, Andrew P. Black, and Simon Peyton-Jones. 2011. Towards

Haskell in the Cloud. In Proceedings of the 4th ACM Symposium on

Haskell (Haskell ’11). ACM, New York, NY, USA, 118–129. hps:

//doi.org/10.1145/2034675.2034690

[12]

Yuan Yu Michael Isard Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson,

and Pradeep Kumar Gunda Jon Currey. 2009. DryadLINQ: A system for

general-purpose distributed data-parallel computing using a high-level

language. Proc. LSDS-IR 8 (2009).

[13]

Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The

implementation of the Cilk-5 multithreaded language. ACM Sigplan

Notices 33, 5 (1998), 212–223.

[14]

Wolfgang Gentzsch. 2001. Sun grid engine: Towards creating a com-

pute power grid. In Proceedings First IEEE/ACM International Sympo-

sium on Cluster Computing and the Grid. IEEE, 35–36.

[15]

Michael I Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S

Meli, Andrew A Lamb, Chris Leger, Jeremy Wong, Henry Homann,

David Maze, et al

2002. A stream compiler for communication-exposed

architectures. In ACM SIGOPS Operating Systems Review, Vol. 36. ACM,

291–303.

[16]

Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin,

Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Ho-

mann, David Maze, and Saman Amarasinghe. 2002. A Stream Com-

piler for Communication-Exposed Architectures. In Proceedings of

the 10th International Conference on Architectural Support for Pro-

gramming Languages and Operating Systems (ASPLOS X). Associ-

ation for Computing Machinery, New York, NY, USA, 291âĂŞ303.

hps://doi.org/10.1145/605397.605428

[17]

Michael Greenberg. 2018. The POSIX shell is an interactive DSL for

concurrency. hps://cs.pomona.edu/~michael/papers/dsldi2018.pdf.

[18]

The Open Group. 2018. POSIX. hps://pubs.opengroup.org/

onlinepubs/9699919799/. [Online; accessed November 22, 2019].

[19]

Mary W Hall, Jennifer M Anderson, Saman P. Amarasinghe, Brian R

Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S Lam. 1996.

Maximizing multiprocessor performance with the SUIF compiler. Com-

puter 29, 12 (1996), 84–89.

[20]

Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Ri-

nard. 2020. An Order-aware Dataow Model for Extracting Shell Script

Parallelism. arXiv preprint arXiv:2012.15422 (2020).

[21]

Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert

Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM

Computing Surveys (CSUR) 46, 4, Article 46 (March 2014), 34 pages.

hps://doi.org/10.1145/2528412

[22]

Lluis Batlle i Rossell. 2016. tsp(1) Linux User’s Manual.

https://vicerveza.homeunix.net/ viric/soft/ts/.

[23]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis

Fetterly. 2007. Dryad: distributed data-parallel programs from sequen-

tial building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys

European Conference on Computer Systems 2007. 59–72.

[24]

Makoto Ishihara, Hiroki Honda, and Mitsuhisa Sato. 2006. Develop-

ment and implementation of an interactive parallelization assistance

tool for OpenMP: iPat/OMP. IEICE transactions on information and

systems 89, 2 (2006), 399–407.

[25]

Konstantinos Kallas, Filip Niksic, Caleb Stanford, and Rajeev Alur.

2020. DiStream: Dierential Output Testing for Stream Processing

Programs. Proceedings of the ACM on Programming Languages 4,

OOPSLA (2020), 1–29.

[26]

Ken Kennedy, Kathryn S McKinley, and C-W Tseng. 1991. Interactive

parallel programming using the ParaScope Editor. IEEE Transactions

on Parallel and Distributed Systems 2, 3 (1991), 329–341.

[27] Charles Edwin Killian, James W. Anderson, Ryan Braud, Ranjit Jhala,

and Amin M. Vahdat. 2007. Mace: Language Support for Building Dis-

tributed Systems. In Proceedings of the 28th ACM SIGPLAN Conference

on Programming Language Design and Implementation (PLDI ’07). ACM,

New York, NY, USA, 179–188. hps://doi.org/10.1145/1250734.1250755

[28]

Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Rama-

narayanan, Kavita Bala, and L Paul Chew. 2007. Optimistic parallelism

requires abstractions. ACM SIGPLAN Notices 42, 6 (2007), 211–222.

[29]

Nokia Bell Labs. 2019. The Unix Game—Solve puzzles using Unix pipes.

hps://unixgame.io/unix50 Accessed: 2020-03-05.

[30]

Amy W. Lim and Monica S. Lam. 1997. Maximizing Parallelism and

Minimizing Synchronization with Ane Transforms. In Proceedings

of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Pro-

gramming Languages (POPL ’97). ACM, New York, NY, USA, 201–214.

hps://doi.org/10.1145/263699.263719

[31]

Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, Zachary G. Ives,

and Val Tannen. 2019. Data-Trace Types for Distributed Stream Pro-

cessing Systems. In Proceedings of the 40th ACM SIGPLAN Conference on

Programming Language Design and Implementation (PLDI 2019). ACM,

New York, NY, USA, 670–685. hps://doi.org/10.1145/3314221.3314580

[32]

Chris McDonald and Trevor I Dix. 1988. Support for graphs of pro-

cesses in a command interpreter. Software: Practice and Experience 18,

10 (1988), 1011–1016.

[33]

Malcolm D McIlroy, Elliot N Pinson, and Berkley A Tague. 1978. UNIX

Time-Sharing System: Foreword. Bell System Technical Journal 57, 6

(1978), 1899–1904.

[34]

Peter M McIlroy, Keith Bostic, and M Douglas McIlroy. 1993. Engi-

neering radix sort. Computing systems 6, 1 (1993), 5–27.

[35]

Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability!

But at what COST?. In 15th Workshop on Hot Topics in Operating

Systems (HotOS XV).

[36]

Sape J Mullender, Guido Van Rossum, AS Tanenbaum, Robbert Van Re-

nesse, and Hans Van Staveren. 1990. Amoeba: A distributed oper-

ating system for the 1990s. Computer 23, 5 (1990), 44–53. hps:

//www.cs.cornell.edu/home/rvr/papers/Amoeba1990s.pdf

[37]

Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul

Barham, and Martín Abadi. 2013. Naiad: A Timely Dataow System.

In Proceedings of the Twenty-Fourth ACM Symposium on Operating

Systems Principles (SOSP ’13). ACM, New York, NY, USA, 439–455.

hps://doi.org/10.1145/2517349.2522738

[38]

National Oceanic and Atmospheric Administration. 2017. National

Climatic Data Center. hps://www.ncdc.noaa.gov/.

[39]

John K Ousterhout, Andrew R. Cherenson, Fred Douglis, Michael N.

Nelson, and Brent B. Welch. 1988. The Sprite network operating

system. Computer 21, 2 (1988), 23–36. hp://www.research.ibm.com/

people/f/fdouglis/papers/sprite.pdf

[40]

David A Padua, Rudolf Eigenmann, Jay Hoeinger, Paul Petersen,

Peng Tu, Stephen Weatherford, and Keith Faigin. 1993. Polaris: A

new-generation parallelizing compiler for MPPs. In In CSRD Rept. No.

1306. Univ. of Illinois at Urbana-Champaign.

[41] Shoumik Palkar and Matei Zaharia. 2019. Optimizing Data-intensive

Computations in Existing Libraries with Split Annotations. In Pro-

ceedings of the 27th ACM Symposium on Operating Systems Principles

(SOSP ’19). ACM, New York, NY, USA, 291–305. hps://doi.org/10.

1145/3341301.3359652

[42]

Davide Pasetto and Albert Akhriev. 2011. A comparative study of

parallel sort algorithms. In Proceedings of the ACM international con-

ference companion on Object oriented programming systems languages

and applications companion. 203–204.

[43]

Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, et al

1990.

Plan 9 from Bell Labs. In Proceedings of the summer 1990 UKUUG

Conference. 1–9. hp://css.csail.mit.edu/6.824/2014/papers/plan9.pdf

[44]

Pixelbeat. 2015. Answer to: Sort –parallel isn’t parallelizing. hps:

//superuser.com/a/938634 Accessed: 2020-04-14.

[45]

Deepti Raghavan, Sadjad Fouladi, Philip Levis, and Matei Zaharia.

2020. POSH: A Data-Aware Shell. In 2020 USENIX Annual Technical

Conference (USENIX ATC 20). 617–631.

[46]

Dennis M. Ritchie and Ken Thompson. 1973. The UNIX Time-sharing

System. SIGOPS Oper. Syst. Rev. 7, 4 (Jan. 1973), 27–. hps://doi.org/

10.1145/957195.808045

[47]

Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015.

Safe Data Parallelism for General Streaming. IEEE Trans. Comput. 64,

2 (Feb 2015), 504–517. hps://doi.org/10.1109/TC.2013.221

[48]

Peter Sewell, James J. Leifer, Keith Wansbrough, Francesco Zappa

Nardelli, Mair Allen-Williams, Pierre Habouzit, and Viktor Vafeiadis.

2005. Acute: High-level Programming Language Design for Distributed

Computation. In Proceedings of the Tenth ACM SIGPLAN International

Conference on Functional Programming (ICFP ’05). ACM, New York,

NY, USA, 15–26. hps://doi.org/10.1145/1086365.1086370

[49]

Diomidis Spinellis and Marios Fragkoulis. 2017. Extending Unix

Pipelines to DAGs. IEEE Trans. Comput. 66, 9 (2017), 1547–1561.

[50]

Richard M Stallman and Roland McGrath. 1991. GNU Make—A Pro-

gram for Directing Recompilation. hps://www.gnu.org/soware/

make/manual/make.pdf.

[51]

Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011.

Phoenix++: Modular MapReduce for Shared-Memory Systems. In Pro-

ceedings of the Second International Workshop on MapReduce and Its

Applications (MapReduce ’11). Association for Computing Machinery,

New York, NY, USA, 9–16. hps://doi.org/10.1145/1996092.1996095

[52]

Ole Tange. 2011. GNU Parallel—The Command-Line Power Tool. ;login:

The USENIX Magazine 36, 1 (Feb 2011), 42–47. hps://doi.org/10.5281/

zenodo.16303

[53]

Dave Taylor. 2004. Wicked Cool Shell Scripts: 101 Scripts for Linux, Mac

OS X, and Unix Systems. No Starch Press.

[54]

Nikos Vasilakis, Ben Karel, Yash Palkhiwala, John Sonchack, André

DeHon, and Jonathan M. Smith. 2019. Ignis: Scaling Distribution-

oblivious Systems with Light-touch Distribution. In Proceedings of the

40th ACM SIGPLAN Conference on Programming Language Design and

Implementation (PLDI 2019). ACM, New York, NY, USA, 1010–1026.

hps://doi.org/10.1145/3314221.3314586

[55]

Nikos Vasilakis, Ben Karel, Nick Roessler, Nathan Dautenhahn, André

DeHon, and Jonathan M. Smith. 2018. BreakApp: Automated, Flexi-

ble Application Compartmentalization. In Networked and Distributed

Systems Security (NDSS’18). hps://doi.org/10.14722/ndss.2018.23131

[56]

Nikos Vasilakis, Ben Karel, and Jonathan M. Smith. 2015. From Lone

Dwarfs to Giant Superclusters: Rethinking Operating System Abstrac-

tions for the Cloud. In Proceedings of the 15th USENIX Conference on

Hot Topics in Operating Systems (HOTOS’15). USENIX Association,

Berkeley, CA, USA, 15–15. hp://dl.acm.org/citation.cfm?id=2831090.

2831105

[57]

Nikos Vasilakis, Jiasi Shen, and Martin Rinard. 2020. Automatic Syn-

thesis of Parallel and Distributed Unix Commands with KumQuat.

arXiv preprint arXiv:2012.15443 (2020).

[58]

Robert Virding, Claes Wikström, and Mike Williams. 1996. Concurrent

Programming in ERLANG (2nd Ed.). Prentice Hall International (UK)

Ltd., Hertfordshire, UK, UK.

[59]

Tom White. 2015. Hadoop: The Denitive Guide (4th ed.). O’Reilly

Media, Inc.

[60]

Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple

linux utility for resource management. In Workshop on Job Scheduling

Strategies for Parallel Processing. Springer, 44–60.

[61]

Gina Yuan, Shoumik Palkar, Deepak Narayanan, and Matei Zaharia.

2020. Ooad Annotations: Bringing Heterogeneous Computing to

Existing Libraries and Workloads. In 2020 USENIX Annual Technical

Conference (USENIX ATC 20). USENIX Association, 293–306. hps:

//www.usenix.org/conference/atc20/presentation/yuan

[62]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and

Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Ab-

straction for In-memory Cluster Computing. In Proceedings of the

9th USENIX Conference on Networked Systems Design and Implemen-

tation (NSDI’12). USENIX Association, Berkeley, CA, USA, 15–28.

hp://dl.acm.org/citation.cfm?id=2228298.2228301

A Annotation for the Command cut

The code below shows the full annotation for cut.

{ "command": "cut",

"cases": [

{ "predicate": {

"operator": "or",

"operands": [

{ "operator": "val_opt_eq",

"operands": [ "-d", "\n" ] },

{ "operator": "exists",

"operands": [ "-z" ] }

]

"class": "pure",

"inputs": [ "args[:]" ],

"outputs": [ "stdout" ]

{ "predicate": "default",

"class": "stateless",

"inputs": [ "args[:]" ],

"outputs": [ "stdout" ]

}

"options": [ "stdin-hyphen", "empty-args-stdin" ],

"short-long": [

{ "short": "-d", "long": "--delimiter" },

{ "short": "-z", "long": "--zero-terminated" }

]

}

B Artifact Appendix

Summary

The artifact consists of several parts: (i) a mirror

of PaSh’ GitHub repository (git commit

e5f56ec

, available

permanently in branch

eurosys-2021-aec-frozen

) includ-

ing annotations, the parallelizing compiler, and the runtime

primitives presented in this paper; (ii) instructions for pulling

Tab. 3. Major experiments presented in the paper

. There are four

major experiments presented in the paper: (i) Common Unix one-liners, (ii)

Unix50 from Bell Labs, (iii) NOAA Weather Analysis, and (iv) Wikipedia

Web Indexing.

Experiment Section Location

Common Unix one-liners §6.1 https://git.io/JYi9m

Unix50 from Bell Labs §6.2 https://git.io/JYi9n

NOAA Weather Analysis §6.3 https://git.io/JYi9C

Wikipedia Web Indexing §6.4 https://git.io/JYi98

code and experiments, building from source, preparing the

environment, and running the experiments; (iii) a 20-minute

video walk-through of the entire artifact; and (iv) instruc-

tions for directly pulling a pre-built Docker container and

building a Docker image from scratch; (v) scripts, descrip-

tions, and instructions to run the experiments (automatically

or manually) to reproduce the graphs and results presented

in the paper.

Codebase information

Below is a summary of key infor-

mation about PaSh’s repository:

• Repository: https://github.com/andromeda/pash

• License: MIT

• Stats: 2,278 commits, from 14 contributors

Artifact requirements

Below is a summary of require-

ments for running PaSh and its evaluation experiments:

•

CPU: a modern multi-processor, to show performance

results (the more cpus, the merrier)

•

Disk: about 10GB for small-input (quick) evaluation, about

100GB+ for full evaluation

•

Software: Python 3.5+, Ocaml 4.05.0, Bash 5+, and GNU

Coreutils (details below)

•

Time: about 30min for small-input, about 24h for full eval-

uation

Dependencies

The artifact depends on several packages;

on Ubuntu 18.04: libtool, m4, automake, opam, pkg-cong,

lib-dev, python3, python3-pip, wamerican-insane, bc, bs-

dmainutils, curl, and wget. PaSh and its experimental and

plotting infrastructure make use of the following Python

packages: jsonpickle, PyYAML, numpy, matplotlib. Exper-

iments and workloads have their own dependencies—e.g.,

pandoc-2.2.1, nodejs, and npm (Web indexing), or p7zip-full

(Wikipedia dataset).

Access PaSh is available via several means, including:

•

Git:

git clone [email protected]:andromeda/pash.git

• Docker: curl img.pash.ndr.md | docker load

• HTTP: wget pkg.pash.ndr.md

• Shell: curl -s up.pash.ndr.md | sh

Code Structure

This repo hosts the core PaSh develop-

ment. The artifact’s directory structure is as follows:

•

annotations: Parallelizability study and associated com-

mand annotations.

•

compiler: Shell-dataow translations and associated par-

allelization transformations.

•

docs: Design documents, tutorials, installation instruc-

tions, etc.

•

evaluation: Shell pipelines and example scripts used in the

evaluation of PaSh.

•

runtime: Runtime component—e.g., eager, split, and asso-

ciated aggregators.

•

scripts: Scripts related to installation, continuous integra-

tion, deployment, and testing.

Calling PaSh

To parallelize a script

hello-world.sh

with

a parallelization degree of 2, from the top-level directory of

the repository run:

./pa.sh hello-world.sh

PaSh will compile and execute hello-world.sh on the y.

Tutorial

To go through a longer tutorial, see docs/tutorial.

Available subcommands

Run

./pa.sh --help

to get more

information about the available PaSh subcommands:

Usage: pa.sh [-h] [--preprocess_only] [--output_preprocessed]

[-c COMMAND] [-w WIDTH] [--no_optimize]

[--dry_run_compiler] [--assert_compiler_success]

[-t] [-p] [-d DEBUG] [--log_file LOG_FILE]

[--no_eager] [--speculation {no_spec,quick_abort}]

[--termination {clean_up_graph,drain_stream}]

[--config_path CONFIG_PATH] [-v] [input]

Positional arguments:

input The script to be compiled and executed.

optional arguments:

-h, --help

Show this help message and exit.

--preprocess_only

Pre-process (not execute) input script.

--output_preprocessed

Output the preprocessed script.

-c COMMAND, --command COMMAND

Evaluate the following COMMAND as a

script, rather than a file.

-w WIDTH, --width WIDTH

Set degree of data-parallelism.

--no_optimize

Not apply transformations over the DFG.

--dry_run_compiler

Not execute the compiled script, even

if the compiler succeeded.

--assert_compiler_success

Assert that the compiler succeeded

(used to make tests more robust).

-t, --output_time

Output the time it took for every step.

-p, --output_optimized

Output the parallel script for inspection.

-d DEBUG, --debug DEBUG

Configure debug level; defaults to 0.

--log_file LOG_FILE

Location of log file; defaults to stderr.

--no_eager

Disable eager nodes before merging nodes.

--termination {clean_up_graph,drain_stream}

Determine the termination behavior of the

DFG. Defaults to cleanup after the last

process dies, but can drain all streams

until depletion.

--config_path CONFIG_PATH

Determine the config file path, by

default 'PASH_TOP/compiler/config.yaml'.

-v, --version Show program's version number and exit