The role of distributions &/or Unix flavors, where does pkg management stands

The role of distributions &/or Unix flavors, where does pkg management stands - Psychology, Philosophy, and Licenses

Users browsing this thread:

	venam Offline \| 29-03-2020, 02:23 PM \| #11

I finally got the time to write something about this. I thought of recording a podcast but I found it easier to simply post the content here in text/blog form. So here we go, I hope you enjoy the research.

What is a distribution

What are software distributions? You may think you know everything there
is to know about the term software distribution, but take a moment to
think about it, take a step back and try to see the big picture.

We often have in mind the thousands of Linux distributions when we hear
it, however, this is far from limited to Linux, BSD, Berkeley Software
Distribution, has software distribution right in the name. Android,
and iOS are software distributions too.

Actually, it's so prevalent, we may have stopped paying attention to
the concept. We find it hard to put a definition together.
There's definitely the part about distributing software in it. Software
that may be commercial or not, open source or not.
To understand it better maybe investigating what problems software
distributions address would clear things up.

Let's imagine a world before software distributions, does that world
exist? A world where software stays within boundaries, not shared with anyone
outside of it.
Once we break these boundaries and we want to share it, we'll find that we
have to package all the software together in a meaningful way, configure
them so that they work well together, adding some glue in between when
necessary, find the appropriate medium to distribute the bundle, get
it all from one end to another safely, make sure it installs properly,
and follow up on it.

Thus, software distribution is about the mechanism and the community
that takes the burden and decisions to build an assemblage of coherent
software that can be shipped.

The operating system, or kernel if you like, could be, and is often,
part of the collage offered, a software just like others.

The people behind it are called distribution maintainers, or package
maintainers. Their role vary widely, they could write the software that
stores all the packages called the repository, maintain a package manager
with its format, maintain a full operating system installer, package and
upload software they built or that someone else built on a specific time
frame/life cycle, make sure there aren't any malicious code uploaded on
the repository, follow up on the latest security issues and bug reports,
fix third party software to fit the distribution philosophical choices
and configurations, and most importantly test, plan, and make sure
everything holds up together.
These maintainers are the source of trust of the distribution, they
take responsibility for it. In fact, I think it's more accurate to call
them distributors.

Different ways to approach it

There's so many distributions it can make your head spin. The software
world is booming, especially the open source one. For instance, we can
find bifurcations of distributions that get copied by new maintainers
and divert. This creates a tree like aspect, a genealogy of both common
ancestors and/or influences in technical and philosophical choices.
Overall, we now have a vibrant ecosystem where a thing learned on a
branch can help a completely unrelated leaf on another tree. There's
something for everyone.

Target and speciality

So what could be so different between all those software distributions,
why not have a single platform that everyone can build on.

One thing is specialization and differentiation. Each distro caters to
a different audience and is built by a community with its philosophy.

Let's go over some of them:

A distribution can support specific sets and combinations of hardware:
from CPU ISA to peripherals drivers
A distribution may be specifically optimized for a type of environment:
Be it desktop, portable mobile device, servers, warehouse size computers,
embedded devices, virtualised environment, etc..
A distribution can be commercially backed or not
A distribution can be designed for different levels of knowledge in a
domain, professional or not. For instance, security research, scientific
computing, music production, multimedia box, HUD in cars, mobile device
interface, etc..
A distribution might have been certified to follow certain standards
that need to be adhere to in professional settings, for example security
standards and hardening
A distribution may have a single purpose in a commodity machine,
specific machine functionalities such as firewall, a computer cluster,
a router, etc..

That all comes to the raison d'être, the philosophy of the distribution,
it guides every decision the maintainers have to make. It guides how they
configure every software, how they think about security, portability,
comprehensiveness.

For example, if a distribution cares about free software, it's going to
be strict about what software it includes and what licenses it allows
in its repository, having software to check the consistency of licenses
in the core.
Another example is if their goal is to target a desktop audience then
internationalization, ease of use, user friendliness, having a large
number of packages, is going to be prioritized. While, again, if the
target is a real time embedded device, the size of the kernel is going
to be small, configured and optimized for this purpose, and limiting and
choosing the appropriate packages that work in this environment. Or if
it's targeted at advanced users that love having control of their machine,
the maintainers will choose to let the users make most of the decisions,
providing as many packages as possible with the latest version possible,
with a loosely way to install the distribution, having a lot of libraries
and software development tools.

What this means is that a distribution does anything it can to provide
sane defaults that fit its mindset. It composes and configures a layer
of components, a stack of software.

The layering

Distribution maintainers often have at their disposition different blocks
and the ability to choose them, stacking them to create a unit we call a
software distribution. There's a range of approaches to this, they could
choose to have more, or less, included in what they consider the core of
the distribution and what is externally less important to it.
Moreover, sometimes they might even leave the core very small and loose,
instead providing the glue software that makes it easy for the users
to choose and swap the blocks at specific stages in time: installation,
run time, maintenance mode, etc..

So what are those blocks of interdependent components.

The first part is the method of installation, this is what everything
hinges on, the starting point.

The second part is the kernel, the real core of all operating systems
today. But that doesn't mean that the distribution has to enforce
it. Some distributions may go as far as to provide multiple kernels
specialised in different things or none at all.

The third part is the filesystem and file hierarchy, the component that
manages where and how files are spread out on the physical or virtual
hardware. This could be a mix and match where sections of the file system
tree are stored on separate filesystems.

The fourth part is the init system, PID 1. This choice has generated
a lot of contention these days. PID 1 being the mother process of all
other processes on the system. What role it has and what functionalities
it should include is a subject of debate.

The fifth part is composed of the shell utilities, what we sometimes
refer to as the userland or user space, as its the first layer the user
can directly interface with to have control of the operating system, the
place where processes run. The userland implementations on Unix-based
systems usually tries to follow the POSIX standard. There are many such
implementations, also subject of contention.

The sixth part is made up of services and their management. The daemons,
long running processes that keep the system in order. Many argue if the
management functionality should be part of the init system or not.

The seventh part is documentation. Often it is forgotten but it is still
very important.

The last part is about everything else, all the user interfaces and
utilities a user can have and ways to manage them on the system.

Stable releases vs Rolling

There exists a spectrum on which distributions place themselves when
it comes to keeping up to date with the versions of the software they
provide. This most often applies to external third party open source
software.
The spectrum is the following: Do we allow the users to always have the
latest version of every software while running the risk of accidentally
breaking their system, what we call bleeding edge or rolling distro, or
do we take a more conservative approach and take the time to test every
software properly before allowing it in the repository, while not having
all the latest updates, features, and optimizations of those software,
what we call release based distro.

The extreme of the first scenario would be to let users directly download
from the software vendor/creator source code repository, or the opposite,
let the software vendor/creator push directly to the distribution
repository. Which could easily break or conflict with the user's system
or lead to security vulnerability. We'll come back to this later, as
this could be avoided if the software runs in a containerized environment.

When it comes to release distributions, it usually involves having a long
term support stable version that keeps receiving and syncing with the
necessary security updates and bug fixes on the long run while having
another version running a bit ahead testing the future changes. On
specific time frames, users can jump to the latest release of the
distribution, which may involve a lot of changes in both configuration
and software.
Some distributions decide they may want to break ABI or API of the kernel
upon major releases, that means that everything in the system needs to
be rebuilt and reinstalled.

The release cycle, and the rate of updates is really a spectrum.

When it comes to updates, in both cases, the distribution maintainers
have to decide how to communicate and handle them. How to let the users
know what changes. If a user configuration was swapped for a new one or
merged with the new one, or copied aside.
Communication is essential, be it through official channels, logging,
mails, etc.. Communication needs to be bi-directional, users report bugs
and maintainers posts what their decisions are and if users need to be
involved in them. This creates the community around the distribution.

Rolling releases require intensive efforts from package maintainers as
they constantly have to keep up with software developers. Especially
when it comes to the thousands of newest libraries that are part of
recent programming languages and that keep on increasing.

Various users will want precise things out of a system. Enterprise
environments and mission critical tasks will prefer stable releases,
and software developers or normal end users may prefer to have the
ability to use the latest current software.

Interdistribution standard

With all this, can't there be an interdistribution standard that creates
order, and would we want such standard.

At the user level, the differences are not always noticeable, most
of the time everything seems to work as Unix systems are expected to
work.
There's no real standard between distributions other than that they are
more or less following the POSIX standards.

Within the Linux ecosystem, the Free Standards Group tries to improve
interoperability of software by fixing a common Linux ABI, file system
hierarchy, naming conventions, and more. But that's just the tip of the
iceberg when it comes to having something that works interdistributions.

Furthermore, each part of the layering we've seen before could be said
to have its own standards: There are desktop interoperability standards,
filesystem standards, networking standards, security standards, etc..

The biggest player right now when it comes to this is systemd in
association with the free desktop group, it tries to create (force)
an interdistribution standard for Linux distribution.

But again, the big Question: Do we actually want such inter-distribution
standards, can't we be happy with the mix and match we currently
have. Would we profit from such thing?

The package manager and packaging

Let's now pay attention to the package themselves, how we store them, how
we give secure access to them, how we are able to search amongst them,
download them, install them, remove them, and anything related to their
local management, versioning, and configuration.

Method of distribution

How do we distribute software, share them, what's the front-end to
this process.

First of all, where do we store this software.

Historically and still today, software can be shared via physical
medium such as CD-ROM, DVD, USBs, etc.. This is common when it comes
to proprietary vendors to have the distribution come with a piece of
hardware they are selling, it's also common for the procurement of the
initial installation image.
However, with today's hectic software growth, using a physical medium
isn't flexible. Sharing over the internet is more convenient, be it via
FTP, HTTP, HTTPS, a publicly available svn or git repo, via central
website hubs such as Github or appliation stores such the ones Apple
and Google provide.

A requirement is that the storage and the communication to it should be
secure, reliable against failures, and accessible from anywhere. Thus,
replication is often done to avoid failures but also to have a sort of
edge network speeding effect across the world, load balancing. Replication
could be done in multiple ways, it could be a P2P distributed system
for instance.

How we store it and in what format is up to the repository
maintainers. Usually, this is a file system with a software API users
can interact with over the wire. Two main format strategies exist:
source based repositories and binary repositories.

Second of all, who can upload and manage the host of packages. Who has
the right to replicate the repository.

As a source of truth for the users, it is important to make sure the
packages have been verified and secured before being accepted on the
repository.

Many distribution have the maintainers be the only ones that are able
to do this. Giving them cryptographic keys to sign packages and validate
them.

Others have their own users build the packages, send them to a central
hub for automatic or manual verification and then uploaded to the
repository. Each user having their own cryptographic key for signature
verification.

This comes down to an issue of trust and stability. Having the users
upload packages isn't always feasible when using binary packages if the
individual packages are not containerized properly.

There's a third option, the road in between, having the two types, the
core managed by the official distribution maintainers and the rest by
its user community.

Finally, the packages reach the user.

How the user interact with the repository locally and remotely depends on
the package management choices. Do users cache a version of the remote
repository, like is common with the BSD port tree system.
How flexible can it be to track updates, locking versions of software,
allowing downgrades. Can users download from different sources. Can
users have multiple version of the same software on the their machine.

Format

As we've said there are two main philosophy of software sharing format:
source code port-style and pre-built binary packages.

The software that manages those on the user side is called the package
manager, it's the link with the repository. Though, in source based repo
I'm not sure we can call them this way, but regardless I'll still refer
to them as such.
Many distributions create their own or reuse a popular one. It does the
search, download, install, update, and removal of local software. It's
not a small task.

The rule of the book is that if it isn't installed by the package manager
then it won't be aware of its existence. Noting that distributions don't
have to be limited to a single package manager, there could be many.

Each package manager relies on a specific format and metadata to be able
to manage software, be it source or binary formatted. This format can
be composed of a group of files or a single binary file with specific
information segments that together create recipes that help throughout
its lifecycle. Some are easier to put together than others, incidentally
allowing more user contributions.

Here's a list of common information that the package manager needs:

The package name
The version
The description
The dependencies on other packages, along with their versions
The directory layout that needs to be created for the package
Along with the configuration files that it needs and if they should
be overwritten or not
An integrity, or ECC, on all files, such as SHA256
Authenticity, to know that it comes from the trusted source, such as
cryptographic signatures checked against a trusted store on the user's
machine
If this is a group of package, meta package, or a direct one
The actions to take on certain events: pre-installation,
post-installation, pre-removal, and post removal
If there are specific configuration flags or parameter to pass to the
package manager upon installation

So what's the advantage of having pre-compiled binary packages instead
of cloning the source code and compiling ourselves. Won't that remove
a burden from package maintainers.

One advantage is that pre-compiled packages are convenient, it's easier to
download them and run them instantly. It's also hard, if not impossible,
these days, and energy intensive, to compile huge software such as
web browsers.
Another point, is that proprietary software are often already distributed
as binary packages, which would creates a mix of source and binary
packages.

Binary formats are also space efficient as the code is stored in a
compressed archived format. For example: APK, Deb, Nix, ORB, PKG, RPM,
Snap, pkg.tar.gz/xz, etc..
Some package managers may also choose to leave the choice of compression
up to the user and dynamically discern from its configuration file how
to decompress packages.

Let's add that there exists tools, such as "Alien", that facilitate the
job of package maintainers by converting from one binary package format
to another.

Conflict resolution & Dependencies management

Resolving dependencies

One of the hardest job of the package manager is to resolve dependencies.

A package manager has to keep a list of all the packages and their
versions that are currently installed on the system and their
dependencies.
When the user wants to install a package, it has to take as input the
list of dependencies of that package, compare it against the one it
already has and output a list of what needs to be installed in an order
that satisfies all dependencies.

This is a problem that is commonly encountered in the software development
world with build automation utilities such as make. The tool creates a
directed acyclic graph (DAG), and using the power of graph theory and the
acyclic dependencies principle (ADP) tries to find the right order. If
no solution is found, or if there are conflicts or cycles in the graph,
the action should be aborted.

The same applies in reverse, upon removal of the package. We have to make
a decision, do we remove all the other packages that were installed as
a dependency of that single one. What if newer packages depend on those
dependencies, should we only allow the removal of the unused dependencies.

This is a hard problem, indeed.

Versioning

This problem increases when we add the factor of versioning to the mix,
if we allow multiple versions of the same software to be installed on
the system.

If we don't, but allow switching from one version to another, do we also
switch all other packages that depend on it too.

Versioning applies everywhere, not only to packages but to release
versions of the distribution too. A lot of them attach certain version
of packages to specific releases, and consequentially releases may have
different repositories.

The choice of naming conventions also plays a role, it should convey to
users what they are about and if any changes happened.

Should the package maintainer follow the naming convention of the software
developer or should they use their own. What if the name of two software
conflict with one another, this makes it impossible to have it in the
repo, some extra information needs to be added.

Do we rely on semantic versioning, major, minor, patch, or do we rely
on names like so many distributions releases do (toy story, deserts,
etc..), or do we rely on the date it was released, or maybe simply an
incremental number.

All those convey meaning to the user when they search and update packages
from the repository.

Static vs dynamic linking

One thing that may not apply to source based distro, is the decision
between building packages as statically linked to libraries or dynamically
linked.

Dynamic linking is the process in which a program chooses not to include
a library it depends upon in its executable but only a reference to it,
which is then resolved at run-time by a dynamic linker that will load
the shared object in memory upon usage. On the opposite, static linking
means storing the libraries right inside the compiled executable program.

Dynamic linking is useful when many software rely on the same library,
thus only a single instance of the library has to be in memory at a time.
Executables sizes are also smaller, and when it is updated all programs
relying on it get the benefit (as long as the interfaces are the same).

So what does this have to do with distributions and package management.

Package managers in dynamic linking environment have to take care of
the versions of the libraries that are installed and which packages
depend on them. This can create issues if different packages rely on
different versions.

For this reason, some distro communities have chosen to get rid of dynamic
linking altogether and rely on static linking, at least for things that
are not related to the core system.

Another incidental advantage of static linking is that it doesn't have
to resolve dependencies with the dynamic linker, which makes it gain a
small boost in speed.

So static builds simplify the package management process. There
doesn't need to be a complex DAG because everything is self
contained. Additionally, this can allow to have multiple versions of the
same software installed alongside one another without conflicts. Updates
and rollbacks are not messy with static linking.

This gives rise to more containerised software, and continuing on this
path leads to market platforms such as Android and iOS where distribution
can be done by the individual software developers themselves, skipping the
middle-man altogether and giving the ability for increasingly impatient
users to always have the latest version that works for their current
OS. Everything is self-packaged.
However, this relies heavily on the trust of the
repository/marketplace. There needs to be many security mechanisms in
place to not allow rogue software to be uploaded. We'll talk more about
this when we come back to containers

This is great for users and, from a certain perspective, software
developers too as they can directly distribute pre-built packages,
especially when there's a stable ABI for the base system.

All this breaks the classic distribution scheme we're accustomed to on
the desktop.

Is it all roses and butterflies, though.

As we've said, packages take much more space with static linking, thus
wasting resources (storage, memory, power).
Moreover, because it's a model where software developers push directly
to users, this removes the filtering that distribution maintainers have
over the distro, and encourages licenses uncertainties. There's no more
overall philosophies that surrounds the distribution.
There's also the issue of library updates, the weight is on the software
developers to make sure they have no vulnerabilities or bugs in their
code. This adds a veil on which software uses what, all we see is the
end products.

From a software developer using this type of distribution perspective,
this adds extra steps to download the source code of each libraries their
software depends on, and build each one individually. Turning the system
into a source based distro.

Reproducibility

Because package management is increasingly becoming messier the past few
years, a new trend has emerged to put back a sense of order in all this,
reproducibility.

It has been inspired by the world of functional programming and the world
of containers. Package managers that respect reproducibility have each
of their builds asserted to always produce the same output.
They allow for packages of different versions to be installed alongside
one another, each living in its own tree, and it allows normal users
to install packages only them can access. Thus, many users can have
different packages.

They can be used as universal package managers, installed alongside any
other package managers without conflict.

The most prominent example is Nix and Guix, that use a purely functional
deployment model where software is installed into unique directories
generated through cryptographic hashes. Dependencies from each software
are included within each hash, solving the problem of dependency
hell. This approach to package management promises to generate more
reliable, reproducible, and portable packages.

Stateless and verifiable systems

The discussion about trust, portability, and reproducibility can also
be applied to the whole system itself.

When we talked about repositories as marketplaces, where software
developers push directly to it and the users have instant access to the
latest version, we said it was mandatory to have additional measures
for security.

One of them is to containerised, to sandbox every software. Having each
software run in their own space not affecting the rest of the system
resources. This removes the heavy burden of auditing and verifying each
and every software. Many solutions exist to achieve this sandboxing, from
docker, chroot, jails, firejail, selinux, cgroups, etc..

We could also distance the home directory of the users, making them
self-contained, never installing or modifying the globally accessible
places.

This could let us have the core of the system verifiable as it is not
changed, as it stays pristine. Making sure it's secure would be really
easy.

The idea of having the user part of the distro as atomic, movable,
containerized, and the rest reproducible is game changing. But again,
do we want to move to a world where every distro is interchangeable?

Do Distros matter with containers, virtualisation, and specific and universal package managers

It remains to be asked if distributions still have a role today with
all the containers, virtualisation, and specific and universal package
managers.

When it comes to containers, they are still very important as they most
often are the base of the stack the other components build upon.

The distribution is made up of people that work together to build and
distribute the software and make sure it works fine. It isn't the role
of the person managing the container and much more convenient for them
to rely on a distribution.

Another point, is that containers hide vulnerabilities, they aren't
checked after they are put together, while on the other hand, distribution
maintainers, have as a role to communicate and follow up on security
vulnerabilities and other bugs. Community is what solves daunting problems
that everyone shares.
A system administrator building containers can't possibly have the
knowledge to manage and builds hundreds of software and libraries and
ensure they work well together.

If packages are self-contained

Do distributions matter if packages are self-contained?

To an extent they do as they could be in this ecosystem the
providers/distributors of such universal self-contained packages. And
as we've said it is important to keep the philosophy of the distro and
offer a tested toolbox that fits the use case.

What's more probable is that we'll move to a world with multiple package
managers, each trusted for its specific space and purpose. Each with a
different source of philosophical and technical truth.

Programming language package management specific

This phenomena is already exploding in the world of programming language
package management.

The speed and granularity at which software is built today is almost
impossible to follow using the old method of packaging. The old software
release life cycle has been thrown out the window. Thus language-specific
tools were developed, not limited to installing libraries but also
software. We can now refer to the distribution offered package manager as
system-level and others as application-level or specific package managers.

Consequentially, the complexity and conflicts within a system has
exploded, and distribution package managers are finding it pointless
to manage and maintain anything that can already be installed via those
tools. Vice-versa, the specific tool makers are also not interested in
having what they provide included in distribution system-level package
managers.

Package managers that respect reproducibility, such as Nix, that we've
mentioned, handle such cases more cleanly as they respect the idea
of locality, everything residing withing a directory tree that isn't
maintained by the system-level package manager.

Again, same conclusion here, we're stuck with multiple package managers
that have different roles.

Going distro-less

A popular topic in the container world is "distro-less".

It's about replacing everything provided in a distribution, removing
it's customization, or building an image from scratch and maybe relying
on universal package managers or none.

The advantage of such containers is that they are really small and
targeted for a single purpose. This let the sysadmin have full control
of what happens on that box.

However, remember that there's a huge cost to controlling everything,
just like we mentioned earlier. This moves the burden upon the sysadmin
to manage and be responsible to keep up with bugs and security updates
instead of the distribution maintainers

Conclusion

With everything we've presented about distributions, I hope we now have
a clearer picture of what they are providing and their place in our
current times.

What's your opinion on this topic? Do you like the diversity? Which stack
would you use to build a distribution? What's your take on static builds,
having users upload their own software to the repo? Do you have a solution
to the trust issue? How do you see this evolve?