Tools, glue, scripts, and automation on Unix - Old school stuff

Users browsing this thread: 3 Guest(s)
venam
Administrators
Transcript:
# Tools, glue, scripts, and automation on Unix


## Introduction


Today we take for granted the concept of software as a tool but it
didn't always exist. Mini-scripts, the interoperable programs, the small
utilities for specific tasks, etc..

This is what we're going to discuss, where do they come from, the history,
and a bit more.


## What are tools


There is a type of program that we write for a special run-time
environment that automates the execution of tasks that could alternatively
be executed one-by-one manually.

This is what we call scripts.

There are other definitions of what scripts are but let's stick to this
one for this podcast.

Usually scripts are interpreted, like we said "run-time", rather than
compiled.

For example, the shell is a scripting environment. Perl also, and other
dynamic high level languages, could be considered scripting environments.

> Perl might be a bad example since it actually uses a bytecode compiler.

The usual idea of scripting is to combine smaller repetitive tasks to
create more complex tasks and automate them, script them.

However one problem that emerges from this technique is that programs
may still need to communicate with one another. It becomes mandatory to
insert a middle-man that will reshape the output of one program so that
it's compatible with the input of another.

This code or software that is added is the glue code or glue language
and its sole purpose is to connect software components together.

The concept of glue code is seen everywhere, from wrappers like decorators
in some languages, to adapter patterns in object oriented programming,
to the pipelines on the shell.

Glue code is useful for a lot of reasons: It's quick to write, it keeps
the parts on both sides separate and thus makes them easier to maintain,
it creates interoperability between components, it's simpler than to
break the separate pieces or dig in their code and learn how they work,
it's even valid when you don't have access to the source of the softwares.

On the other hand, it's considered somewhat of a sort of duct tape
programming. A quick fix that might not last. It also has the cost of
performance penalties for the adaptation code and the transfer of this
info, which is usually some sort of input/output mechanism.

You get the gist of it.

Let's give some notable examples, be them text processing, or macro
languages, preprocessors, etc.. Mostly text processing utilities.

* roff/nroff/troff/ditroff/groff
* dc/bc
* ed/ex
* g/re/p
* sed
* diff
* The C preprocessor
* m4
* vi
* Makefile
* Awk

And much more.

You've definitely heard of those names before, they're very Unixy. All of
them are used on the shell, in scripts to automate tasks. But it wasn't
always that way, we may take for granted those softwares, we may take
for granted the idea of softwares as tools.

Where did the idea come from?

Pipes were first (second attempt) implemented in UNIX in 1972 by Doug
McIlroy. The idea is simple but powerful: to allow the standard output
of the program on the left side of the pipe to be the standard input of
the program on the right side of it.

> That was the second attempt of implementing pipes. Their first
> approach - which only lasted a couple of months - used "<" and ">"
> to define the direction.

This sounds more practical than to use other means of inter-process
communication such as databases, sockets, or simply files.

After pipes were invented this lead to some room for new ways of thinking
but this didn't really happen before some event.

Doug McIlroy, the creator of pipes, still warm and happy about them,
was working on a text-to-speech program and wanted to manipulate large
dictionaries. The ed editor didn't fill the task properly, it was too
cumbersome. With the fresh idea of pipes in mind he then asks Ken Thompson
if he could extract the regex feature from ed and make it standalone,
capable of accepting input and output so that he could use it in a
pipeline, it was the creation of grep.

It became obvious to McIlroy, after this small but meaningful event,
that there was a useful pattern to be extracted, the one of "software
tool". An idea that was later better articulated in the book "Software
Tools" by Kernighan and Plaugher.

This insightful idea became a guiding principle to build programming
environments.

The software methodology that followed is the well known one, the so
often heard "each tool/utility achieving its end and role internally",
the "do one thing and do it well", the "single functionality program",
the "program as generic as possible, accepting stdin/stdout".

This is where the concept of softwares as tools come from. This was
truly a "new style of computing and thinking on how to attack problem",
from a bottom up approach. With a bunch of tools we link/combine them
together to create a software. Small parts coupled to build a whole and
not a big monolithic block of software.

If a feature is part of a particular environment but is useful to so many
other developers then why not separate it in as its own utility. Making
it fun and helpful to use by other programmers.

Not only that, there were now tools to facilitate the creation of tools,
such as yacc and lex.

This general concept is a trademark features of Unix.

> Not anymore: There are blown-up Unix tools (tried Solaris
> recently?) today while the "KISS" thing had been ported to VMS and other
> operating systems soon after its invention.

So let's dive in the history and concept of some of the most popular of
those tools that are used as glue code in scripts or more.


## roff/nroff/troff/ditroff/groff


> A utility without a manual is of no utility at all.

roff apparently was the first Unix text-formatting computer program,
and the first application to run on the first machine specifically
purchased to run UNIX.

But it has many predecessor, to understand it we have to go back in
history.

It was a Unix version of the runoff text-formatting program from Multics,
which was a descendant of RUNOFF for CTSS (Compatible Time Sharing System,
a project of MIT, the Massachusetts Institute of Technology) in 1964.

The RUNOFF of the CTSS was one of the first, if not, computer text
formatting program.

> In the modern sense of the word, yes.

How did RUNOFF work and what did it do?

It was composed of two programs, TYPSET, a document editor, and RUNOFF,
the output processor.

The role of a typesetter is to do typesetting, which is arrange text on
a page, change the font, the colors, etc.. Kind of like a word processor
does. TeX and LaTeX are typesetters too, markdown includes some of the
features of typesetting though it lacks a lot of them like paginations,
spaces, and text modification.

Generically you have a specific syntax or way of doing thing that applies
something over the text, like spacing, bullets, colors, style, slant,
size, aspect, anything.

That's the idea.

RUNOFF supported pagination, headers, and text justification.

The name is said to have come from the phrase "I'll run off a copy" or
"run off a document" which means to print it out.

This explains why it's such an important software, remember the move
from physical typewriters to typewriters connected to computers, to
glass teletypes, to raster graphics.

Typewriters are made to edit text and people wanted to keep that initial
feature. Right?

Multics had its own version, runoff, that was the successor of RUNOFF
for CTSS. roff (roff already existed, ported to Unix) and nroff are the
successor of the Multics version.

> roff already existed in a Multics version. (see history tree in the notes)

One of its important usage was for manual pages in Unix.

However, the pages were only written in runoff (the troff version) since
the 4th Unix edition, though some account say that it's from version
1 through 3. Some places says that the first 2 years Unix didn't have
any documentation though.

According to jkl:

> Not one single manpage on Unix was written in runoff as runoff never
> existed on Unix. There was not even a roff command on PDP-7 UNIX. Before
> he did troff, Joe Ossanna ("jfo") - double-N - implemented his own
> version of McIlroy's BCPL version of Robert Morris's implementation of
> RUNOFF named "roff" into V1 UNIX: man(1).
> http://www.tuhs.org/cgi-bin/utree.pl?fil...an1/roff.1

In all cases, they started using runoff documenation because Doug McIlroy
insisted on it. And to this day this is regarded as one of the great
advancement, to be able to have a manual for everything on the system. The
lack of documentation on Unix is even seen as a lack of quality.

```
nroff -man Documents/programming/2bwm/2bwm.man -Tascii | less
```

The Unix version, in 1970, was a port of the BCPL version of runoff made
for Multics into the PDP-7 assembly language.

When roff started being used for manpages it instantly became popular
within Bell Labs patent department. it was the first Unix application
used by people other than the developers themselves, real usage.

It was flexible, easily modifiable, and with real world usage. So
providing features to it was an important factor to the adoption of Unix,
it filled the word processing needs (of the patent department at Bell
labs here). Altogether this gave credibility to the project and secured
funding for the purchase of a PDP-11/45s.

When they got the PDP-11, in the late 1970, they started transliterating
it to the pdp-11 assembly, this version was done in 1971.

Some say that runoff was the hook that justified the cost of getting
the PDP-11.

There's a long history of passing the code/torch, adding more
and more features to runoff, polishing the code, making it more
interoperable, making it work on all machines, all output types,
etc..: I've linked in the show note a tree hierarchy of the oh so many
runoff-like softwares: <https://manpages.bsd.lv/history.html>.

This is a noble piece of software.

In 1972 for example, Joseph Ossanna took over the PDP-11 roff and build a
version for the Graphic Systems CAT Phototypesetter, which the lab just
acquired. He called it troff, t-roff, for "Times roff" (Times font family
was the most used), or other say for "typesetter roff". A typesetter is
sort of the ancestor of the modern printer, don't sweat the definition,
it takes some specific programs and typeset the text and print it.

troff was basically an add-on over nroff, with some #ifdefs to produce
and remove specific features for the new typesetter.

Preprocessing, transforming part of the document so that they are
compatible with it.

Ossanna's troff was not so extensible, written in PDP-11 assembly and
produced specific output for the CAT phototypesetter only. He started
rewriting it in C to make it support multiple vendors however he died
of a heart attack before finishing the work (1977).

Brian Kernighan took over the task and called it ditroff, for device
independant troff. This was in 1979.

On and on, several new pre/post/processors appeared. For graphs, pics,
references, etc..

This went on.

In 1989, James Clark implemented a GNU version of ditroff, called groff,
which was released in 1990.

It included many features, postprocessors for character devices,
postscript, TeX DVI, X Windows (as in a frontend to facilitate the
construction of pipelines), etc..

runoff/roff was one important piece in the Unix arsenal even though
it might not look like a software that would be so important to us
now though.


## dc/bc


Let's move to other well known tools, dc and bc.

dc stands for desk calculator.

It's the oldest surviving Unix language. It was the first program to be
run on the newly acquired PDP-11, the one that they secured funding for
because of the runoff success.

> Depends on how you define a "language". Wouldn't that be B? dc is
> the oldest surviving interpreted language though.

> According to McIlroy's notes (p. 10), it was the first language to
> be run there. I could not find a source about whether it was the first
> program, but Ritchie suggested that it was, in fact, "one of the earliest
> programs to run on the PDP-11". As they had received their PDP-11/20 in
> 1970 and Unix ran on it in the same year, I would want to assume that
> dc predates 1971. Ken Thompson said that they already had a version of
> it on the PDP-7.
> http://www.cs.dartmouth.edu/~doug/reader.pdf
> http://webarchive.loc.gov/all/2010050623.../hist.html

The first version was written in B, predating the C language, by Robert
Morris and Lorinda Cherry, which are also the author of bc which we'll
tackle in a bit. I'm not sure at what time it was created but it was
present in Unix 1st edition, so 1971 or before.

So what is dc all about?

It's a calculator language that uses reverse-polish notation and has
arbitrary precision arithmetic.

And what does that mean?

dc is first of all a mini-language, it has storage, registers usually
single letters variables, operations, it has basic conditional
expressions, and even some versions have support for macros.

It's not your average general purpose programming language but it's a
domain specific language, it's made specifically as a calculator.

A calculator of arbitrary precision.

What does arbitrary precision means?

It's something we take for granted today, the way you can control the
number of fractional digits, the scale aka the number of digits following
the decimal point.

However by default it's set to 0, so setting this at the start of the
program is a good idea.

So, the last part, the one we didn't discuss yet, is the particularity
of this calculator: that it uses RPN, the reverse-polish notation. What
is it?

RPN, reverse polish notation, is a method of expressing a mathematical
expression by putting the operands before the operator.

As in: `4 5 +` would mean "keep 4 and 5 on the side then apply + to them.

This is also called a postfix notation in contrast with the infix
notation, the one we commonly use where the operators are in between
the operands.

The advantage is that programmatically it is easier to implement by
using stacks. We push operands on the stack and when we see operators
we take as many operands from the top of the stack as is needed, compute
the result, and push it back on the top of the stack.

However, as you would've guessed, it's not as intuitive to use.

So here comes bc.

bc stands for basic or bench calculator and it's the infix-notation
version of dc. It is more user-friendly, with a syntax resembling the
C programming language.

It was introduced in 1975, in Unix V6, also by Robert Morris and Lorinda
Cherry, the same author of dc. It was implemented as a wrapper/frontend
over dc. A simple, hundred line compiler in Yacc, converted the bc like
into dc's postfix notation and then piped the results through dc.

The conditions and variables, and all the stuffs are easier to write
in bc.

Later on, in 1991, POSIX standardized bc, there exist today 2
implementations of the standard: the front-end version implemented in
Plan 9 and the GNU bc version. Though I'm sure there are others.

The GNU bc version, released in 1991, is no longer being a front-end to
dc and has many extensions and new features, like not being limited to
having variables, arrays and function names limited to one character only.

So let's discuss exactly that, the usage of both.

dc is quite straight forward when it comes to usage.

You type the RPN directly and the terms gets added to a stack in the
program. To print the result, aka the value on the top of the stack,
you can type 'p', to quit you can use 'q'. The interface is ed-like.

dc also offers one letter variables called registers. They are a second
place of storage other than the stack of the RPN. To store the value
that is on top of the stack in the 'c' register you can use 'sc' and to
retrieve it, to put it on the top of the stack, you can use 'lc'.

As with bc, if you start it, it'll be in interactive mode like dc, you
type like you'd normally type your usual calculation in C: `(1 + 4) *
2` and it'll output 10 directly. No need for a 'p' command if you don't
assign the value to a variable.

bc has conditions and loops too, with a similar syntax as in C, dc has
conditions but they're kind of ugly to write once the expression becomes
big enough.

As we said, the precision of both calculators is by default 0. To set it
in dc you push the scale you want into the stack and pop it using the 'k'
command which will set it as the scale. For bc you assign the scale to
the scale variable or you can start bc with the '-l' flag which will set
the number of decimal point to 20 and will also enable some mathematical
functions in the language. sqrt(), square root is POSIX bc's only built-in
math function otherwise.

With -l you get:

s(x) sine
c(x) cosine
a(x) arctangent
l(x) natural logarithm
e(x) exponential function
j(n,x) bessel function

Example:

```
$ bc -lq
scale=10000
4*a(1) # The atan of 1 is 45 degrees, which is pi/4 in radians.
# This may take several minutes to calculate.
```

There's a lot more features and you can dig more into those languages.
Let's just mention that there are some subtle difference between C syntax
and bc.

()[]{}>=< etc.. Act exactly like the C operators.
% like C but depends on scale variable.
^ ^= resembling C XOR but is in fact exponent.

Now there's an interesting back story I'd like to mention...

Robert Morris is also an expert cryptographer who later worked for the
NSA. His son Robert Tappan Morris is the author of the Morris worm,
the first worm on the internet and the first person convicted under
the then-new Computer Fraud and Abuse Act. He also founded a company
with Paul Graham, of Y Combinator, which you might have heard through
The Hackernews.

The world is small isn't it, and I find this ironic considering that for
some time dc was used as a memento for RSA encryption and the controversy
against crypto export in the USA and Canada. You can find some links
about it in the show notes.

Compute: g ^ e % m
gen ^ exp % mod
```
#!/usr/bin/perl -- -export-a-crypto-system-sig Diffie-Hellman-2-lines
($g,$e,$m)=@ARGV,$m||die"$0 gen exp mod\n";
print`echo "16dio1[d2%Sa2/d0<X+d*La1=z\U$m%0]SX$e"[$g*]\EszlXx+p|dc`
```
<http://online.offshore.com.ai/arms-trafficker/>
<http://www.cypherspace.org/rsa/>
<http://fringe.davesource.com/Fringe/Crypt/RSA/Algorithm.html>


## ed/ex


Another classic is ed.

ed is a line editor aka the "standard editor". It was one of the first
part of the Unix OS that was developed in August 1969. It is also part
of the POSIX standards as one of the essential tools.

It was developed early on by Ken Thompson on the PDP-7 as a mandatory
element of the OS, it being the editor that goes along with the assembler
and the shell.

Many features came from qed, a piece of software that originated from
Berkeley university, which is the university Thompson studied at which
made him very familiar with it. He even reimplemented it on the CTSS
and also on Multics with a BCPL version.

His big addition to the mix was regular expressions, which is also a
big feature of ed.

QED stands for "quick editor" and it is also a line-oriented text
editor. It was developed at Berkely in 1965 for the Timesharing System
running on the SDS 940.

QED was not only influential for Thompson alone, it also influenced
Dennis Ritchie and Brian Kernighan and the whole team, they wrote the
QED manuals used at Bell Labs.

So they were well familiar with qed.

Dennis Ritchie later on produces a version of ed that Doug McIlroy
described as the "definitive" ed. I can't find much about it but I think
it was a C rewriting.

> There were quite some "eds" involved.
> http://web.mit.edu/kolya/misc/txt/editors

So why the line-oriented text editor, what does that mean?

Line-oriented, a line by line editor, actions over lines, very succinct
in the output, is what is meant by line-oriented, which is contrasted
with editor such as vim or emacs that are more visual.

To understand this decision we have to realize the type of machines they
were using to interact with the OS: typewriters. If you've listened to
the podcast about terminals you might have a clue to what they are,
they are very simple terminals that use a typewriter, a keyboard and
paper that is printed on.

This kind of interaction is slow and so interactive editors are not
especially good with them. But it is good for line by line editing
and scripting.

So, ed usage is very linear, as you would've imagined but what does it
actually look like.

You specify a file to ed on the command line so that it opens it
and then you're faced with a prompt, like the shell, you give it
commands to execute on lines and some of the commands return outputs on
the screen directly.

You don't see the entire file as you are editing it. You specify
a range of lines, comma separated and an operation to perform over
them. For example "1,5l" will display lines 1 through 5, and ",l" will
print the whole file.

If you are familiar with vim then this shouldn't be surprising because
most of it is included in it through vi and ex inspiration, we'll talk
about those in another part.

Here are some of the commands:

```
l to print, range before 1,3 or just , for all
i for insert
a for append
w for write
q for quit
g to go to the top
G to go to the end of the file
d to delete
g/re/something/
s/something/somethingelse/g
```

As you could've guessed ed has become famous or more infamous for its
terseness, it has even been called the "most user-hostile editor ever
created". Probably because according to today's standard the lack of
visual feedbacks is regarded as backward-minded.

For example, the message that ed will produce in case of error, or
when it wants to make sure the user wishes to quit without saving, is
"?". Some older versions didn't even ask for confirmation when quitting.

ed doesn't do anything unless it's requested to do it or display it.

So why should you use ed, and where, today in a world of editors with
visual feedback?

* Available on essentially all Unix systems (and mandatory on systems
conforming to the Single Unix Specification).

* Inside a modal editor supporting command mode, text mode and viewing
mode support for regular expressions

* Powerful automation can be achieved by feeding commands from standard
input. In scripts for example.

The ED style of interface, the "ed pattern", or maybe we have to put it
as the QED pattern, has inspired many more descendants.

Many other programs use this type of interface, for example gdb.

The pattern sometimes comes with a downside, it's not always as easily
scriptable.

ex, the Extended ed, is also a line editor, written by Bill Joy in 1976
as a more friendly version of ed, or more precisely a modified version
of George Coulouris improved ed called 'em', that took advantages of
video terminals, which the Bell Labs didn't have.

Coulouris considered ed to be way too hard to use, only suitable for
"immortals" as he put it, thus he modified it and created the 'em'
editor, which was specifically designed for display terminals and was
also single-line-at-a-time visual editor. It was arguably one of the
first program to use extensively the "raw terminal input mode" bypassing
the terminal line-discipline, check the podcast about terminals to know
more about this topic. Anyway, Bill Joy liked it and modified his version
further to create ex.

Then a full-screen visual interface was added to ex which turned it into
the vi text editor. Thus ex is a part or mode of vi.

ed also influenced the creation of grep and sed, we'll talk about them
separately later.

So it is more or less true that ed doesn't have much practical use today
as an interactive program, that most of its cool features are present
in other places and not on its own. Still it's nice to learn how it's
an integral part of many other tools.


## vi


Let's specifically talk about vi now.

vi, unlike ed, is a screen oriented text editor. It was written by Bill
Joy in 1976, as we said, and it's part of POSIX.

It was made as the visual mode for the ex line editor, and so the first
few releases within BSD were also named ex and in May 1979 the editor
was installed under the name "vi", short for visual, and took the user
directly in the visual mode of ex.

However the vi we use today isn't really a direct descendant of this one,
there's way more history behind it.

As we said, Bill Joy and friends were inspired by Coulouris' 'em' and
extended it to create 'ex' then added a full-screen visual mode to ex
and there was 'vi'.

As with all the work that was done, it was all a superposition of ideas
and code, one over the other. The same goes for the visual mode, it
was inspired by the Bravo editor, which is said to be the first WYSIWYG
editor. Not surprisingly Bravo was made at Xerox PARC in 1974.

Many ideas were taken from Bravo, for example the dot command is the
double-escape from Bravo, and for those not in the know it's a redo
command.

So what kicked off vi?

Joy created the BSD Unix in March 1978, which gave him the option to ship
freely his new editor, ex and then vi as we said, within it. It gave it
exposure, and because at the time there weren't that many editors that
came with Unix, basically only ed, it gave rise to the success of vi.

On the other side Emacs, which the version at the time wasn't a free as
in free beer, would cost hundreds of dollars, and to be clear we're not
talking about GNU Emacs, which didn't appear before 1985, but an earlier
version also called Gosling Emacs or Unipress Emacs.

> Gosling Emacs "didn't appear" before 1981, so, in fact, the expensive
> versions of Emacs were not a thing when vi was published. 1978 already had
> (versions of) TECO, Emacs's birthplace, as well as Stallman's original
> MIT Emacs itself which was available at no cost either - we had the topic
> "everything was free software once" in a previous discussion. Here is
> a list of free Emacsen.
> http://www.finseth.com/emacs.html#25

I digress, it was free, at least within Unix, as we'll see in a bit.

We mentioned that earlier that vi was made as an update to ex, but it
was actually a hardlink to ex's visual mode (ex 2.0) that started to
be shipped within BSD2 because everyone that was using ex was using its
visual mode more than anything.

Which brings the question: is vi ex? Yes in a sense until it got upgraded
specifically for its visual mode only, with that new way of thinking
first visual. Same goes for all the previous projects we talked about in
this podcast, they are all work upon work upon work. As people learned
new techniques, even though those were not always clean and elegant
and pompous, people just tried implementing new things over the work
of others.

Bill Joy implemented the visual mode because he thought it was primarily
useful for others. These days the attraction to vi stems from how
lightweight and efficient it is, however in its first iteration vi
was a very large program it could barely fit in the memory of the
PDP-11/70. Even the version 3.1 shipped in BSD3 in 1979 couldn't fit
anymore within the memory of a PDP-11.

Now back to what I said "it was free, at least within Unix".

That is because it was only available through commercial Unix vendors
such as Sun, HP, DEC, and IBM which included the code until vi 3.7 with
some of their own customizations in their OS, namely Solaris, HP-UX,
Tru64 UNIX, and AIX.

The version included in those OS descends directly from vi, though with
modifications.

I mention this because this isn't the case everywhere.

ex and vi being based on ed, were burdened by the AT&T licensing, and
thus people looking for a free editor would have to use something else.

For example Minix created (published under it) a vi clone called Elvis.

> Tanenbaum added the third-party Elvis editor to Minix - it was
> not created "by Minix" although its creator published it to the Minix
> newsgroup first.
> https://groups.google.com/forum/#!origin...lQiUR7aeYJ

Remember the whole war that AT&T lead against the BSDs. It was not until
BSD Unix started relicensing its code that they realized the issue,
first in 1992 in their 386BSD they used Elvis as a replacement for vi,
then in BSD 4.4 Lite, in 1994, Elvis was used as a starting point to
create nvi a more one to one correspondence with vi. This is the version
of vi that is used today in FreeBSD and NetBSD.

It's only recently in 2002 that the original version of vi were allowed to
be distributed legally. Which is sort of ironic that the BSD, which were
created by Bill Joy don't include the initial version of vi that he wrote.

So what's so special in vi usage that separate it from ed.

First of all, vi is modal, that means it can work in different modes,
there's a mode to access the ex features, a mode for applying commands,
normal mode, there's a mode for direct editing, insert mode, and others.

It also has the idea of combining single letters keys to create a full
command to be applied over text.

For example, in vi, you can type `cw<replacement text>Esc` and it would
mean "change a word by "replacement text". And the operation can be
repeated by the redo command we mentioned, the dot.

There are many critics of this way of interaction. That there are too
many single characters commands and that they are hard to remember.

Though arguably those were single characters because the connection
of the terminal was slow and it needed to be efficient rendering wise,
but not memory wise which is fun to think about.

Another critic is that it is lacking mouse interaction, feedback when
switching mode, and multiple undo levels. Which is all because internally
vi is just ex with a nicer UI.

The last thing to tackle with vi is the origin of the iconic HJKL and the
usage of escape. Those keys were chosen because on the ADM-3A terminal,
the one Joy used, the escape key was at the location where tab is these
days, which is a good location, and the hjkl keys served as cursor
movements because on the keyboard those were printed with arrow keys on
them and the ADM-3A had no cursor keys to begin with.

A nota bene here is that you can use control sequences to simulate escape,
`ctrl-[` does the trick.


## g/re/p


Back in time again, with grep.
We talked a bit about its history earlier so let's get back to it.

grep is simple, it's one of the most commonly used command on Unix. It's
a command that searches the input file for a pattern, using regular
expressions, and prints the lines that contains it.

This is it, end of the line, nothing more.

Grep, as we said was created by Ken Thompson in March 1973, soon after the
implementation of pipes, as a standalone application based on the request
from McIlroy so that he would get ed matching facilities outside of ed.

The name is literally the command that were initially used inside ed,
g/re/p, globally search a regular expression and print it.

g is for global, the /regular expression for the search/ , and p to
print the results.

If you're familiar with vi it is the same command, as we mentioned
earlier. For example s/test/test2/g would replace all global occurrence of
the word "test".

With ex for example you would do:

```
ex -c 'g/test/p' file
```

And with grep simply `grep 'test' file`.

However grep didn't directly get included as a tool for Unix. According
to McIlroy he kept it as a personal tool for a while, until he finally
decided to make it public.

From what I could see the tool official release was only 7 to 8 months
in November after its creation in March. However, I'm not sure of my
sources, and I'm not sure if it was released internally way earlier.

He was parsing a dictionary in a horrible inefficient way for a project
and once he got his hand on this grep it worked like a charm in a chain
of piped programs.

It became such as useful tool and lead to the start of the understanding
of the concept of software as tools. It is often stated as the
"prototypical software tool" that is credited with "irrevocably
ingraining" Thompson's tools philosophy in Unix.

As with others, the software morphed and diverged over time.

Al Aho also a researcher at Bell Labs and the co-author of AWK wrote
egrep and fgrep one weekend in 1975 and they got introduced in Unix V7.

egrep for extended regex using Alfred Aho. fgrep for a list of fixed
strings using Aho-Corasick string matching algo.

Historically, grep and egrep took turns being faster than the other.

The modern versions within GNU and BSD have deprecated egrep and fgrep,
they are now aliased to grep. According to POSIX grep should have the
-E and -F switches for compatibility.

Back to the topic. grep is so iconic that it is used by everyone as a
word to indicate searching for something.

So much that in December 2003, the Oxford English Dictionary Online
added draft entries for "grep" as both a noun and a verb. I'm not sure
if it has now been officially included, it seems like it. There's nothing
indicating that it's a draft.

Last thing to mention is also related to the massive influence it
got. As an example there's a lot of tools with the word grep in them,
the pgrep utility that displays the processes whose names match a given
regular expression.

It got something started within the community, the creation of mini tools
that did one thing. The next one is exactly that, we're going to tackle
`sed`.


## sed


It wasn't long before the creation of grep lead to the invention of
other special purpose tools.

The next ideas were kind of obvious, a grep-like tool but for
substitution, a so called gres, a gred for deletion, a grea for appending,
and so on... There seemed to be no end to this program as tool thing. And
that's when Lee McMahon saw that, in 1973, same year as before, he decided
to merge all those concepts into one single tool, a sort of ed but that
could be used on the pipeline, he called it sed for "stream editor",
to edit right in the pipeline, in the stream.

sed is more or less a literal ed in the pipeline, thus its language is
very similar to the one of ed. ed was kind of universal at the time,
so it was common knowledge and intuitive to use sed.

The power of sed comes from how it allows to edit in the middle of a
pipeline passthrough and the flexibility and complex kind of editing
and pattern matching it allows. This is the epitome of the middle-man,
glue scripting, we talked about, it can easily adapt its input to fit
the output and it does that really fast. It's on par with languages like
awk or perl for that matter. It's even Turing complete.

It was a dream come true for programmers that were using slow teletype
terminals and had to do basic editing. And it still is a key tool
today to edit huge files in a single pass through instead of using an
interactive editor.

It does that by reading the file line by line into an internal buffer
called the pattern space, then apply operations on this line, outputs
the modified line if it needs to, and continue the cycle. It is quite
efficient.

However you'll rarely see sed today being used to its full potential,
all you'll see is its simple replace feature. While in fact there are
more than 20 other commands available in sed language.

It's a full Turing complete language with two variables (only), the
pattern space we talked about, and the hold space, a sort of storage,
and it has conditions via a sort of GOTO-like branching (using label,
colon + string, and using the b command for branching to this label).

So how is it used.

You call it on the command line 'sed', followed by the sed script file
passed as argument (with -f file or without the -e expression option)
or the script written directly on the pipeline. It can accept a file as
input or a stream, as we said, as input.

The important part is the expression syntax. I won't list all the commands
here but I can mention some.

For example, s is for substitution, it is followed by a slash separating
the pattern match (regex) and the string to substitute the match with. d
is used for deletion. i for insertion. a for appending. r to append the
text from a specific file to the current one. y to transliterate from
one character set to the other, and much more.

> About sed's ed compatibility:
> ed does not limit you to the slash character: sxaxe or swawe, whatever
> works for you, replaces "a" by "e". sed does not support that.

Commands in sed may also take optional address so that you can specify
where to apply them, it could be a line number or a regular expression.

For example 2d would delete only the second input line and print all
the others, while `/^ /d` would delete all lines beginning with a space.

You can see why for a long time no one saw a utility for the 'head'
tool and instead intuitively used `sed 10q` instead, which print the
first 10 lines and then apply the 'q' for quit command.

Today we even have more powerful sed versions available, arguably and
probably a bit slower though.

For instance, GNU sed added several new features, including in-place
editing of files.

The original sed is sort of lost in history, though Eric Raymond reverse
engineered it, probably in 1995, and released it as `minised`. Which was
the default sed for a while instead of GNU sed, and is also the default
`sed` in Minix.

sed was the big inspiration for the next mini-language we're going to
discuss, AWK. Which in itself was the inspiration for Perl. Everyone
influences everyone else.


## Awk


AWK stands for the initials of its creators: Alfred Aho, if you remember
it's guy who implemented the extended regex grep version "egrep", Peter
Weinberger, and Brian Kernighan.

It was made in 1977, 4 years after the previous tools.

In 1985, a new version, called nawk or new-awk, made the programming
language more powerful, now letting the user define his own functions,
adding multiple input streams, and computed regular expressions. This
version was popularized within Unix System V released in 1987.

As with the other tools, AWK is in the POSIX specifications standards.

AWK, like sed, is a programming language designed for text
editing. However, unlike sed it isn't mainly for editing streams, it
specializes in a more data-driven approach, it is more for extraction
and transformation than editing.

As with sed it also both act on input streams and files.

To understand AWK you have to get the generic usage. It takes a pattern
(regex) for matching and an action to extract and transform the data
from the lines matching the pattern. But you can omit the pattern or the
action, and leave only one of both, or the pattern alone or the action.


```
awk /pattern/{action}
```

The extraction mechanism automatically understand separators and
format. By default it split the input by columns which you can act upon
via variables that get assigned directly.

The action part of AWK is a full language, it include strings and
associative arrays. It's actually the language that first popularized the
usage of associative arrays, arrays in which the index keys are strings.

The associative array allows to easily generate and report information
over huge amount of data, accumulating information on each stream parsing.

The language itself is easier to read than sed as it's C-like.

As with other tools there are newer versions of them.

GAWK was written between 1985-1988 by Paul Rubin with advice from Richard
Stallman. If you remember NAWK was released 1985-1987, so the releases
were close in time.

GAWK was, like any other tool, extended by even more people afterward.

In May 1997 it got even more extended by Jurgen Kahrs, which addded
network access features to gawk.

Also, as with sed, vi, and others, the original version of the awk tool
was kept closed behind doors until Brian Kernighan's nawk (New AWK)
source was publicly released in the late 1990s; This version is used by
many BSD systems to avoid the GPL license.


## Makefile


The last tool we're going to talk about is make and the makefile. It
isn't really a glue tool, but it's an important one for scripting.

Make is a build automation tool, more often but not limited to create
executable programs and libraries from source code by reading a file
named Makefile.

The make utility is used to interpret the Makefile.

More precisely it's a dependency-tracking build utility. It knows when
the timestamp of all the dependencies changes and act accordingly, doing
the minimum amount of work necessary, rebuilding only what depends on
those updated files.

Historically it was created in April 1976 by Stuart Feldman at Bell
labs, and included in Unix starting with the first edition of PWB/UNIX
(programmer's workbench).

Feldman got the idea of make when one of his coworker was having trouble
debugging a program but didn't realize that the executable wasn't updated
with the changes he made to the code, thus rendering the changes useless,
and wasting time chasing invisible bugs.

This person that was debugging was Steve Johnson, the author of YACC,
a parser generator, that was used for a bunch of other tools that we
mentioned.

Thus Feldman thought of building a dependency analyzer but came up with
something much simpler, and created the make utility over a weekend.

The makefile language parsing was made using Johnson's YACC.

This was great because before make they used separate shell scripts
accompanying source code to build projects. Including all the abstract
target dependencies and tracking them in a single file was a big move
forward.

So this makefile list the dependencies but what's in a makefile.

Makefiles are relatively simple. They are a set of directives, targets,
dependencies, and command accompanying them.

Concretely it contains five kinds of things: explicit rules, implicit
rules, variable definitions, directives, and comments.

A comment is anything that is after the '#' character.

An explicit rule is a label, a target, it lists other files or targets
that depends on it, that are prerequisites to it. Make will rebuild this
target if any of the dependencies changes.

The implicit rule is similar but is for a class of files based on their
names, for example you could create a rule for any c header file.

The variable definition is simple assignment, like in the shell. It can
be used and substituted later.

Building, or the directive, means executing the command or instruction
that is under the label. But those directives should be tab indented.

There are two critics to this. One each command is executed in a separate
shell and this is system dependant. Thus not so portable. Secondly, the
tab is a whitespace character, and thus could mislead the users. Making
the tab character visible solves this.

The tab was used because yacc was new and the author wasn't accustomed
to it, he was just learning to create something useful out of it. It
worked and spread too fast to revert the changes.

Anyway, this is as far as it goes, you got targets, dependencies for
it, and commands to execute after the dependencies/prerequisites have
been fulfilled.

```
target: dependencies
system command(s)
```

Or separated by semicolon

```
targets : prerequisites ; command
```

You can see that this is not limited to C programming, contrary to
what some may think. It's a valid build system for anything that has
dependencies or that needs changes to be tracked. They're a good way to
bundle up little procedures together, to script, for example when making
a distribution package.

Usage wise make is invoked on the command line with `make target`. It uses
the Makefile named Makefile by default but it could have any other name.

Without argument it'll build the first target that appears in the
file. And one thing to note here is that the targets don't need to be in
any specific order in the Makefile, make will understand the dependencies
between them.

Today there are ton of other build systems, which I won't cover. Imake,
autoconf, cmake, automake, gnu make, etc..

All have their advantages and disadvantages.


## Conclusion


You can probably guess what kind of conclusion I'm going to give to this
podcast. You realized after two of those tools. They're all linked through
history, they all learned from the one before. And this is probably
totally unrelated to the topic but it's the most apparent relation,
everything is a remix, it's great. It made me respect those humble tools.

And there are much more that I didn't mention.

After learning most of this history I think the concept of the tool as
a glue code is less about gluing code and more about solving problems in
efficient and human ways. The pipeline has its place there but probably
not the skewed extremist view of the Unix philosophy as minimalist
radicalism we kind of see in some places today.

So get your pipeline ready, get your scripts ready, think of how to
improve them, try all the tools mentioned and others, and more importantly
have fun.


## References


Scripting or Glue-coding:
<https://nixers.net/showthread.php?tid=1893>
<https://en.wikipedia.org/wiki/Scripting_language>
<https://en.wikipedia.org/wiki/Glue_code>
<https://en.wikipedia.org/wiki/Adapter_pattern>
<http://www.catb.org/~esr/writings/taoup/html/ch11s06.html>
<http://www.columbia.edu/~rh120/ch001j.c11>
<https://www.bell-labs.com/usr/dmr/www/hist.html#pipes>

roff:
<https://en.wikipedia.org/wiki/Roff_(computer_program)>
<https://www.gnu.org/software/groff/manual/html_node/History.html>
<http://troff.org/history.html>
<https://manpages.bsd.lv/history.html>
<https://lists.gnu.org/archive/html/groff/2014-02/msg00104.html>
<http://www.tuhs.org/cgi-bin/utree.pl?file=V1/man/man1/roff.1>

bc/dc:
<http://man.cat-v.org/unix-1st/1/dc>
<https://en.wikipedia.org/wiki/Bc_(programming_language)>
<https://en.wikipedia.org/wiki/Dc_%28computer_program%29>
<https://compilers.iecc.com/comparch/article/95-09-015>
<https://www.gnu.org/software/bc/manual/html_mono/bc.html>
<http://online.offshore.com.ai/arms-trafficker/>
<http://www.cypherspace.org/rsa/>
<http://fringe.davesource.com/Fringe/Crypt/RSA/Algorithm.html>
<http://www.cs.dartmouth.edu/~doug/reader.pdf>
<http://webarchive.loc.gov/all/20100506231949/http://cm.bell-labs.com/cm/cs/who/dmr/hist.html>


ed/ex:
<http://man.cat-v.org/unix-1st/1/ed>
<http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/ed.c>
<https://en.wikipedia.org/wiki/Ed_(text_editor)>
<https://www.ibm.com/developerworks/community/blogs/brian/entry/ed_editor?lang=en>
<https://sanctum.geek.nz/arabesque/category/ed/>
<https://en.wikipedia.org/wiki/Ex_(text_editor)>
<http://web.mit.edu/kolya/misc/txt/editors>
<https://nixwindows.wordpress.com/2018/03/13/ed1-is-turing-complete/>

vi:
<https://en.wikipedia.org/wiki/Vi>
<http://thomer.com/vi/vi.html>
<https://en.wikipedia.org/wiki/Bravo_(software)>
<http://www.viemu.com/a-why-vi-vim.html>
<https://en.wikipedia.org/wiki/Gosling_Emacs>
<https://en.wikipedia.org/wiki/GNU_Emacs>
<http://www.finseth.com/emacs.html#25>
<https://groups.google.com/forum/#!original/comp.os.minix/RhqVtXMWiN8/6GlQiUR7aeYJ>

g/re/p:
<https://www.reddit.com/r/unix/comments/7meimh/something_interesting_that_i_read_today_about/>
<https://en.wikipedia.org/wiki/Grep>
<https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48>
<http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html>
<https://en.oxforddictionaries.com/definition/us/grep>

sed:
<https://en.wikipedia.org/wiki/Sed>
<https://blog.sourcerer.io/a-brief-history-of-sed-6eaf00302ed>
<http://groups.engin.umd.umich.edu/CIS/course.des/cis400/sed/sed.html>
<http://exactcode.com/opensource/minised/>

awk:
<https://www.gnu.org/software/gawk/manual/html_node/History.html>
<https://en.wikipedia.org/wiki/AWK>

Make:
<https://en.wikipedia.org/wiki/Makefile>
<https://en.wikipedia.org/wiki/Make_(software)>
<http://www.catb.org/esr/writings/taoup/html/ch15s04.html>

<http://sasamat.xen.prgmr.com/michaelochurch/wp/index.php/2013/01/09/ide-culture-vs-unix-philosophy/>
<https://mkaz.tech/geek/unix-is-my-ide/>
<https://sanctum.geek.nz/arabesque/series/unix-as-ide/>
<https://dspinellis.github.io/unix-history-man/>


## Music


Catmosphere - Candy-Coloured Sky: <https://soundcloud.com/argofox/catmosphere-candy-coloured-sky?in=photochic16/sets/creative-commons-license>



Messages In This Thread
RE: Tools, glue, scripts, and automation on Unix - by venam - 03-03-2018, 09:55 AM