Nixers Book Club - Book #1: The UNIX Programming Environment - Community & Forums Related Discussions

Users browsing this thread: 1 Guest(s)
movq
Long time nixers
(05-12-2020, 11:26 AM)venam Wrote: Interestingly, there's a lot of discussion about the efficiency of different ways to do something, especially in conditions.
They advice using ":" built-in instead of calling true as an external command, or to rely on case match instead of external calls, especially within loops. Calling a program was something you had to think about.

It still is, especially if you’re looping over stuff. Forking is damn expensive. Granted, it probably was a lot worse back then, but it’s still one of the first things I try to optimize.

(05-12-2020, 11:26 AM)venam Wrote: One thing that caught my attention that I didn't know about was that you don't have to give for loops a value, it by default loops over `$*`.

It’s actually "$@", the original argument vector (without argv[0]). I wonder why they don’t use "$@" all the time and instead make a lot of examples using $*. (I already made that remark, didn’t I?) Luckily, they explain what "$@" does on page 161.
phillbush
Long time nixers
It's time to bump this thread.
How was your week? Have you have a good reading?

I lost my notes for chapter 6 and 7, could not find it on my hard drive.
I probably deleted it.
Sorry about that.
I'll write what I remember from the last part of those chapters.

The main challenge of these chapters is to convert the old K&R C code into modern C code. The translation is very straight-forward if you know K&R peculiarities.
Also, the book covers something that does not work anymore: open(2)ing directories.

Chapter 7 begins with a introduction to file descriptors and low-level I/O functions like read(2) and write(2), and provides the implementation of a cat-like program called readslow(1) that reads a file while waiting for more input, similar behavior to tail -f.

Quote:Even though one call to read(2) return 0 and thus signals end of file, if more data is written on that file, a subsequent read(2) will find more bytes available. This observation is the basis of a program called readslow(1), which continues to read its input regardless of whether it got and end of file or not.

Structurally, readslow(1) is identical to cat(1) except that it loops instead of quitting when it encounters the current end of the input. It has to use low level I/O because the standard library routines continue to report EOF after the first end of file.

Then, it provides a simplified and limited implementation for cp(1).

It provides the implementation of a mail watcher using sleep() and stat() in a loop.

It then provides sv(1), a handy utility that implements the same behavior of the flag -u of GNU cp(1).

Quote:Exercise 7-15 (scapegoat).
LOL, I remember this exercise blaming Ken Thompson when I first read the book.

It implements waitfile(1), that nowadays can be implemented with inotify on Linux.

In the end, it implements timeout(1), an application that “illustrates almost everything talked about in the past two sections”

Quote:There is no detailed description of the UNIX system implementation, in part because the code is proprietary
Thank goodness we have free UNIXes now.

Sorry for the short review this week.
venam
Administrators
Chapter 6 and 7 introduce C programming with the same philosophy as previously in the book, namely combining program functionalities, using the hard work done by someone else. However, here it focuses on things other than text such as monitoring files and inspecting metadata.

Overall, these two chapters were longer to tackle because of the amount of examples. Many of them have to be modified to work in our current environment, especially some missing headers (especially <unistd.h> and <stdlib.h>) that aren't included by default anymore. The style of programming in C is also very different with the definition of the function parameters put after the function head, and the return type being skipped when it is an int. The lack of curly braces around single statement loops and conditions bugs me, for(;;) instead of while, and the use of exit instead of return. Though it clearly mentions that return can be used in main and will be the exit value, I guess it's to make it more explicit. The programming style feels a bit "hacky" compared to today's stricter standards. It's ironic considering these are the main authors, that shows how people reinterpret things with time. Maybe for the best, maybe for the worst.

C is introduced as the standard language of UNIX systems because the kernel and user programs are written in C. I'm not sure it's a valid reason, but that makes the system coherent.


Chapter 6

The first example, vis, a cat that shows non-printing characters as octal, was justified as useful because sed wasn't able to handle long input. That's interesting, today it certainly can.

It's fun to see the authors emphasize why having macros and functions is important, even though you could write them yourself, because they've been debugged already and are faster. Maybe people argued about writing themselves these functions, I'm not sure.

In the first example, vis, the string.h header is missing for it to compiler on newer C lib. This is something that is redundant throughout the exercises, it seems a lot of the headers were included by default at the time.

Argv and argc are introduced as the command line arguments, FILE IO too along with the idea of the default file constant: stdin, stdout, stderr. There used to be a limit of 20 files opened at the same time, which is extremely small considering today's standards.

They recommend using getopt(3) but don't in their example and go on to parse the arguments manually.

In an example we implement a screen-at-a-time printer because according to the authors there was no standard program to print per screen because UNIX was initially paper based. Wasn't there less/more at the time? Also 24 lines terminal :D .

There's a discussion about what features to include in programs or not. There's no definitive answer but the main principle is that the program shouldn't be hard to debug, don't do too many things, and features should have a reason to be there to not lie unused.

We go on to rewrite the pick command from the previous chapter, but in C this time. Note the use of stderr to ask the user a question, and then select the output on stdout.

I've also noted a weird way of having external function signatures right in the middle of the functions using them when we know we'll define them later:

Code:
FILE *efopen();

There's a section about linting using lint(1), and debugging core dumps using adb and sdb. Which they call arcane but indispensable. To me these both look similar to gdb, which isn't any less arcane.

zap gets rewritten in C because of the problem with spawning too many processes. It uses popen(3) and calls ps to parse its output.

idiff introduces mktemp(3) and unlink(3).

Reading environment variables is taught to be a way to not have to type the same arguments all the time, this is done through getenv(3).

Chapter 7: Unix System Calls

System calls are the lowest level of interaction with a UNIX OS, stuff like IO, inodes, processes, signals and interrupts.

The concept of file is polished a bit here, we define them through the interaction with a file descriptor. By default we have 0,1,2 open and new file descriptors will increment from there. The shell is allowed to change the meaning of the descriptors, which is what allows us to redirect outputs from one to another.
Anything is a file descriptor and on them you can read and write. If we write or read one byte at a time it means it's unbuffered, otherwise it's advised to use a number that is equal to the size of a disc bloc, 512 or 1024 (the BUFSIZ constant).

We can see the effect that using a wrong buffer size has in the table shown, reading a 54KB file was extremely slow if not matching the bloc size.

Then there's the question of what happens when many processes read or write to the same file. I guess this needs to be mentioned in a multi-user system.

We write a program called readslow that continues reading even when read returns 0 (no byte left EOF), which is kind of like tail -f.

The file lifecycle is talked about through: open, creat, close, unlink. Plus their 9 permission bits rwx for group,everyone, owner. A newly created file is said to be directly in writing mode, but I think it depends on the flag given during creation.

In most examples, I find myself adding missing headers. The next one introducing error handling was giving me troubles with sys_nerr and sys_errlist which were not exposed in glib so I used the err.h BSD extension.

Code:
#include <err.h> // BSD extension
#include <errno.h>

The errno is not reset to 0 when things go well, you must reset it manually. But signals are reset automatically, yep things are a bit hacky.

Another example teaches us to seek in a file, and the next one to walk a directory.
At that time the directory was readable directly as a normal file, these days it isn't and we have to rely on a system call.
The modern way to do this is getdirentries present on the BSDs, and a few other systems. Or if we want to have the same structure, struct dirent, as in the book we can use readdir(3) in sys/dir.h.

Now that we have an understanding of some low level structure, the book goes into inodes and the stat system call to inspect them.

Then it goes into spawning processes and signal handling.

We can either use system(3) or the exec family of system calls. These last one would overlay the existing program. This is part of the lifecycle of processes.
And so the technique that allows us to spawn completely new processes is the fork() system call, to regain control after running a program, split it into two copies. The call returns the ID of the child to the parent, and 0 to the child.
The parent can wait for the child, and will receive its exits status.

Fork and exec will let the child inherit the 3 file descriptors, so they have the same file opened. And so we think again of the issue with having multiple process reading or writing to the same file, and the consideration of flushing to avoid issues.

That's when we are shown how we can save these file descriptors into variables and disconnect and reconnect them as we see fit. The following idioms is interesting, closing and duplicating in the lowest unallocated file descriptor:

Code:
close(0); dup(tty);

Signal handling was messy, it was still using signal(2) instead of sigaction(2), so the signal handler was reset everytime it was called so the first line of the handler was always to set it back again.
That reminds me of something I wrote.

One thing I wasn't aware of are these for non-local jumps:

Code:
jmp_buf sjbuf;
setjmp(sjbuf);
longjmp(sjbuf, 0); // non local goto

These were 2 fun chapters, giving insight into how writing C at the time was done and for which reasons.
movq
Long time nixers
(I couldn’t really keep up with you this time. Quite a fast pace in
general, if you ask me. :))

One thing that caught my eye at the beginning of chapter 6:

Quote:We will be writing in C because it is the standard language of UNIX
systems … and, realistically, no other language is nearly as well
supported.

This reminds me so much of the old DOS days. There were batch files and
BASIC. OS/2 also came with another scripting language, REXX. But that
was it, the basic systems didn’t provide anything else. From today’s
perspective, this feels like a massive restriction, like a cage. (Sure,
you could install more programs, but you had to know where to get them
from or buy them.)

Starting with SUSE Linux at the end of the 1990ies, a whole new world
opened up. I still have the box of version 6.4 from 2000 with its 6 (!)
CD-ROMs, which we bought (!) in a store.

And today, we have so many things at our fingertips. Want to do Rust?
Sure, download it. Oh, and the Rust book is online, too.
phillbush
Long time nixers
(13-12-2020, 04:17 AM)vain Wrote: (I couldn’t really keep up with you this time. Quite a fast pace in
general, if you ask me. :))

Introduction of chapter 8:
Quote:This is a very long chapter, because there's a lot of detail involved in getting a non-trivial program written correctly, let alone presented. [...] Hang in, and be prepared to read the chapter a couple of times

In order to make everyone who are reading get the same pace, and because of the length and terseness of the 8th chapter, I think it is a good idea to delay the next book club session to the next weekend.
What do you think?
movq
Long time nixers
(18-12-2020, 09:31 AM)phillbush Wrote: I think it is a good idea to delay the next book club session to the next weekend.
Well, I’d certainly appreciate it. :) I also saw a couple of other people on IRC who have fallen behind, so …

Maybe even two weeks? Because Christmas and all that. :)
venam
Administrators
I agree, let's wait till the end of this week to see if people are on pace then you can decide if you want to initiate the discussion or not.
phillbush
Long time nixers
Ok, let's do the next book club section on 2021-01-02.
Happy xmas to everyone!
phillbush
Long time nixers
Merry Christmas and happy new year, nixers!
It's time to bump this thread!

Chapter 8 is about the full use of the UNIX environment and its development tools to develop a project of an interpreter for a language that deals with floating point arithmetic. The development is broken down into six stages that “parallel the way that they actually wrote the program”. The stages go from a simple calculator executed as it is parsed, to a stack machine interpreter, to a full language with functions calls and recursion.

In the first stage, expressions are evaluated as the input is parsed. In more complicated situations (including hoc4 and its successors), the parsing process generates code for later execution. We implement a four-function calculator with parentheses.

In the second stage, we use an array of characters from a through z to implement variables.

In the third stage, we implement arbitrarily long variable names and built-in functions by implementing a symbol table (which is actually a linked list of structures holding variable names and a union of a value and pointer to function).

In the fourth stage, we move from executing code as the input is parsed to generate code as the input is parsed for later execution (rather than being evaluated on the fly).

In the fifth stage, we add control-flow (if-else and while), conditions and statement grouping.

In the sixth and last stage, we add recursive functions and string handling.

The main tool is the yacc(1) parser generator, that is used to write one of the most important parts of the project (along with the machine interpreter): the grammatical parser. The chapter also introduces the make(1) utility to manage the project, and does a short digression on lex(1) (the lexical analyzer generator).

The interpreter we develop is hoc, the High Order Calculator. Although hoc(1) hasn't survived into modern UNIXes, it survives on plan 9. Its source code is disponible here (bundled in a single shell script!).

It was very fun to write hoc, you can see my implementation here on github.

Quote:The machine itself is an array of pointers that point either to routines like `mul` that perform an operation, or to data in the symbol table.

I think that the `Inst` type is a botch. It is used not only to type a Instruction operation in the machine memory, but also to type instruction operands (entries to the symbol table) and also to type pointer to other instructions (as is needed in stage 5 for while and if statements). The original implementation uses a multiple purpose pointer to function that is casted to other stuff. I chose to use a union instead. The code on `whilecode` and `ifcode` became more straightforward with my change. They don't need those complex type casts anymore.

Quote:By the way, we chose to put all this information in one file instead of two. In a larger program, it might be better to divide the header information into several files so that each is included only where really needed.

The one-header method is used through the entire project. I didn't like this approach, as as the program grows the header becomes messy. I opted to separate the code into more files and write a header to each module.

Another feature I implemented was to use variable-length argument list for built-in functions, in order to use bultin functions with variable number of arguments. I implemented a argument list in the parser as a grammatical class, and assign it a pointer to a structure containing the arguments of a builtin function. This also prohibit assignment to constants, because I implemented constants as a nullary function, for example pi(), and hoc prohibits assignment to a builtin function. For example, atan2(y, x) will allocate a Arg structure with two elements, and pass its address to the builtin function. The structure will be freed with delargs() after evaluating the builtin function.

This is the first time I used yacc(1) and I liked it a lot. It is a really useful utility. I have seen other implementations of hoc(1) that uses GNU Bison extensions, but I had to maintain myself in the POSIX yacc, as that's what my system has.
venam
Administrators
Chapter 8: Program Development

This chapter is about constructing a mini programming language using specific UNIX toolsets, because UNIX is a program development environment. I wasn't expecting writing a compiler/interpreter in this book.

Writing a programming language is explained as a problem of converting inputs into actions and outputs. For this we'll need the following:

yacc: generates a parser from a grammatical description
make: control process to compile programs
lex: lexical analyzer

The process is divided into 6 steps.

1. A four-function calculator
2. Variables a-z
3. Long variables name, built-in functions
4. Refactoring
5. Control flow
6. Recursive functions and procedues with arguments

Stage 1:

We get introduced to yacc and a little bit with make automaticity.
Yacc is used to build a parse tree with related actions, while the makefile is used to not have to type the compilation command over and over again.

I didn't know that make was yacc file aware, it can automatically know how to build them.


Stage 2:

This step is about adding memory to hoc for 26 variables, a to z, single letter variable. Which is done by using a fixed size array.
Which is impressively simple to add.

We also see how error handling is done, here using long jump and signal. The signal reset doesn't need to be reset because it's going to jump back to a place which will set it again right after the label.


Stage 3:

This stage drops the fixed size array in favor of a linked list for arbitrary variable names an built-in functions.
Weirdly, the programming style also changes, the syntax isn't really the same.
There's a lot of heavy changes involved, and we definitely need a makefile to make the linking process simpler.

We see that make -n can be used to have a view of the full process:
Code:
yacc -d hoc.y
mv -f y.tab.c hoc.c
cc    -c -o hoc.o hoc.c
make: *** No rule to make target 'y.tab.h', needed by 'init.o'.  Stop.
rm hoc.c

The authors of the book also insist on a "pr" rule in the makefile, to print the file content. I'm not sure it's actually usefule these days.
The "clean" rule is definitely useful.

Lastly, in this section we get to know lex, the lexical analyzer. Instead of having to write yylex() yourself, you can rely on it to create the tokens for you. However, the book only uses it here and argues that they'll revert back to C for size related reasons. Using lex makes the binary a bit bigger, on my machine the lex version is 42KB while the one without is 24KB.

Stage 4:

In this section we stop interpreting the code right away and instead will generate and intermediate code (IR intermediate representation). This is then put on a stack machine and executed.
Compilation into a machine

Starting in this section, you are left alone to figure out which functions are missing and that you need to write yourself for the program to compile properly. For example you have to add sub,mul,power,negate yourself.

Stage 5:

In this section we continue the previous stage by adding control flow and relational operators. You're similarly left to write them yourself and figure it out.

The code is a bit fragile as it doesn't have statement separator other than surrounding them with braces. Now you can write stuff like:

Code:
a = 10
while (a>1) {
    print(a)
    a = a-1
}

Or on one line:
Code:
a = 10
while (a>1) { {print(a)} { a = a-1 } }

Stage 6:

In this section we add functions and procedure along with input/output. Though I couldn't really test the input/output as it wasn't explained well and integrated in the hoc final code.

To make functions work in our yacc parse tree we have to rely on embedded actions. These actions are used to delimit the start of the body of a function and the end of it.

The code is tedious and the authors go back to recommending again to rely on lex for lexical analysis instead of having the gigantic yylex() function we reached.

There's a lot of stack play here to make functions possible but which I couldn't really test in the end as I was not able to write the lexer for this.

Finally, we get to compare the speed of our implementation against bas, bc, and C for fib and ackermann's function.
By replacing calls to push and pop in action.c to macros we can actually speed up our hoc. So calling methods is pretty slow when writing a language.
phillbush
Long time nixers
I didn't like this sixth stage, it copes with too much stuff to be added.
It could well be broken down into at least three stages (one for implementing string manipulation, one preparing code.c for the next stage (in the same way stage 4 was a preparation for stage 5), and a final one implementing the stack frame and function definition).

Anyway, I'm doing the exercises and implementing stuff, my hoc(1) is in stage 5 and I'm implementing string manipulations rn.

I'm also using plan9port's hoc as reference, and just found some bugs in it, lol. It is always fun to use other people's code for learning and find something wrong in it while you learn.
movq
Long time nixers
Hmmm, I found an old repo of mine from 2010 where I implemented hoc. Well, hoc1, that is. And then I stopped. For some reason, this chapter never really sparked a lot of interest in me, apparently.

Only once did I ever have a need to use a tool like yacc: When writing a raytracer. I needed a simple language to describe the objects in the 3D scene. That might have been a great use of yacc/lex, but I didn’t know they existed back then and my program was written in Java anyway.

Maybe it’s because we have so many tools at hand these days. I don’t really feel a need to come up with a new language. And for data, we have JSON, XML, and this YAML abomination.
phillbush
Long time nixers
I forgot to bump this thread yesterday, sorry about that.

The 9th and last chapter is a very light chapter compared to the previous one.
It has five main topics: the troff(1) "low-level" commands, the ms(7) and man(7) macro packages, and the tbl(7) and eqn(7) preprocessors.

We write two documents: a manual for the hoc language using ms(7) and a manual for the hoc interpreter using man(7). The former is presented in its final form as an appendix.

However, man(7) may be considered deprecated in favor of mdoc(7), which is a semantic markup for manual pages. See this video on the topic. I have to confess that to this day I still use man(7). This is a bad practice that I take with me from the time I learned how to write manpages in Linux, as some GNU and Linux manuals are written in it; compared to BSD, in which mdoc(7) is the norm.

Quote:The man language was the standard formatting language for AT&T UNIX manual pages from 1979 to 1989. Do not use it to write new manual pages: it is a purely presentational language and lacks support for semantic markup. Use the mdoc(7) language, instead.
-- OpenBSD man(7)

The chapter 10 (the Epilog) summarizes the UNIX philosophy and history. It explains how a system free of market pressure or commercial interest became a success.

The chapter also cites the feature creep in modern UNIXes.
Quote:The UNIX system [...] with marked dominance has come responsibility and the need for “features” provided by competing system. As a result, the kernel has grown in size by a factor of 10 in the past decade, although it has certainly not improved by the same amount. This growth has been accompanied by a surfeit of ill-conceived programs that don't build on the existing environment. Creeping featurism encrusts commands with options that obscure the original intention of the programs.
Because source code is often not distributed with the system, models of good style are harder come by.

... the UNIX philosophy
Quote:The principles on which UNIX is based -- simplicity of structure, the lack of disproportionate means, building on existing programs rather than recreating, programmability of the command interpreter, a tree-structured file system, and so on.
[...]
We said in the preface that there is a UNIX approach or philosophy, a style of how to approach a programming task.

... this approach is summarized:
  • First, let the machine do the work: use existing programs to mechanize tasks that you might do by hand on other systems.
  • Second, let other people do the work: use programs that already exist as building blocks in your programs, with the shell and the programmable filters to glue them together.
  • Third, do the job in stages: build the simplest thing that will be useful, and let your experience with that determine what (if anything) is worth doing next.
  • Fourth, build tools: write programs that mesh with the existing environment, enhancing it rather than merely adding to it;

...and a comment on the future
Quote:The UNIX system can't last forever, but systems that hope to supersede it will have to incorporate many of its fundamental ideas.

That's it.
It has been a good reading. I learned a lot of stuff: yacc(1), lex(1), troff(1), ms(7), the UNIX history and principles, its tools and the way to glue them together, etc.
venam
Administrators
Chapter 9 - Document Preparation

This chapter focuses on document editing and formatting, it is a bit special because this was UNIX first application — it was the word editor of the time.
In the chapter you get introduced to troff, which originated from roff, which was then adapter to nroff to support more features and be programmable.

troff is used as a formatting macro language, sort of like markdown today (or bbcode here on the forums), where you define big blocks such as headers, title, pagination, paragraphs, etc..
And it has many standards and format, mm(newer System V) and ms (standard).

However, troff on itself doesn't support everything and you have to combine it with other tools such as tbl and eqn, for tables and math equations respectively.

troff syntax is made up of commands that start with a dot (.) followed by 1 or 2 letters or digit and maybe a parameter along with them.

Example:

Code:
.pp
.ft B
Bold font

Or the command can be within the text itself when they begin with \, for example \fB to switch font to bold.

Then you can generate a neat document that looks really clean. I've used groff on my machine:

Code:
groff -ms hoc.ms -T pdf > hoc.pdf

As for tbl, it processes things between the command:

Code:
.TS
.TE

And for eqn for equations between
Code:
.EQ and .EN

Then you can create a pipeline to process all this:

Code:
tbl hoc.ms | eqn -Tpdf | groff -ms -T pdf > hoc.pdf && xdg-open hoc.pdf

NB: In the code I've used `inf` instead of infinity.

Additionally you have other things you can add to the pipeline such as the refer command for bibliography.

Troff can be used to write manpages. The man command is a printer for man pages, it finds them from /usr/man directory and uses nroff or troff with a macro package to print them.

Interestingly the man command used to be a shell script. It's weird how many of the commands used to be shell scripts and now they are compiled binaries. It was much simpler to inspect what programs do at the time, explore to discover. I think that might be because of licenses related issues.

They give us a typical layout for a man page but basically there's nothing enforcing it other than goodwill. but at least command name and section should be there.

Bunch of useful commands mentioned:
  • refer(1) to look up references and cite authors
  • pic(1) to insert pictures/diagrams
  • spell(1) to check spelling errors
  • style(1)
  • diction(1)

[b]Chapter 10[b/] Epilogue

The epilogue tries to answer the question of why Unix systems got popular. Is it because talented people created an environment that is good for development?
Then, as usual, growth leads to the insertion of ill-designed software, creeping featurism.

Maybe it's because of the programming style that is encouraged on unix-like systems, one that (like phillbush also summarizes):
1. Let machine do the work
2. Let others do the work
3. Do the job in stages
4. Build tools

Unix system leave room for trial and errors, it is not burdened like big operating systems.
movq
Long time nixers
Alright, here are my notes:

Quote:Users talk about the logical components of a document … instead of sizes, fonts, and positions.

And yet to this day, users don’t do that. Not even (most) manpages (on GNU). Instead, we have conventions like “bold text = type exactly as shown”.

Time to look into mandoc.

(Or is it time to abandon all this and write manpages in Markdown?)

Quote:9.1 … The paper takes this general form in ms:

Nice and clean. Easier and faster than LaTeX.

Quote:9.3 The tbl and eqn preprocessors

Alright, this is where the fun starts. Up until this point, everything was pretty basic and simple.

It’s pretty crazy how much “code” is produced even by a simple table. The eqn example is even longer. On the other hand, I have no idea what LaTeX does behind the scenes when you use math stuff. At least with roff, it’s easy to inspect the result.

And the results are good. This is one of my favorite chapters in the book. Always makes me want to use roff for everything. :) (I once used it for e-mails and my phlog.)

Quote:Epilogue

Lol, they complain about feature creep. I wonder what they think about the systems we have today.


---


(10-01-2021, 11:49 AM)phillbush Wrote: However, man(7) may be considered deprecated in favor of mdoc(7), which is a semantic markup for manual pages.

People didn’t really notice, it seems. I have a couple of mdoc pages on my system, but the vast majority of files is still “traditional” manpages. As are the manpages of my own projects …

It’s all dying anyway. If anything, projects are using Markdown and then convert it to a manpage using something like pandoc. Or there are no manpages at all. :(