The role of distributions &/or Unix flavors, where does pkg management stands - Psychology, Philosophy, and Licenses
Users browsing this thread: 2 Guest(s)
|
|||
I finally got the time to write something about this. I thought of recording a podcast but I found it easier to simply post the content here in text/blog form. So here we go, I hope you enjoy the research.
What is a distribution What are software distributions? You may think you know everything there is to know about the term software distribution, but take a moment to think about it, take a step back and try to see the big picture. We often have in mind the thousands of Linux distributions when we hear it, however, this is far from limited to Linux, BSD, Berkeley Software Distribution, has software distribution right in the name. Android, and iOS are software distributions too. Actually, it's so prevalent, we may have stopped paying attention to the concept. We find it hard to put a definition together. There's definitely the part about distributing software in it. Software that may be commercial or not, open source or not. To understand it better maybe investigating what problems software distributions address would clear things up. Let's imagine a world before software distributions, does that world exist? A world where software stays within boundaries, not shared with anyone outside of it. Once we break these boundaries and we want to share it, we'll find that we have to package all the software together in a meaningful way, configure them so that they work well together, adding some glue in between when necessary, find the appropriate medium to distribute the bundle, get it all from one end to another safely, make sure it installs properly, and follow up on it. Thus, software distribution is about the mechanism and the community that takes the burden and decisions to build an assemblage of coherent software that can be shipped. The operating system, or kernel if you like, could be, and is often, part of the collage offered, a software just like others. The people behind it are called distribution maintainers, or package maintainers. Their role vary widely, they could write the software that stores all the packages called the repository, maintain a package manager with its format, maintain a full operating system installer, package and upload software they built or that someone else built on a specific time frame/life cycle, make sure there aren't any malicious code uploaded on the repository, follow up on the latest security issues and bug reports, fix third party software to fit the distribution philosophical choices and configurations, and most importantly test, plan, and make sure everything holds up together. These maintainers are the source of trust of the distribution, they take responsibility for it. In fact, I think it's more accurate to call them distributors. Different ways to approach it There's so many distributions it can make your head spin. The software world is booming, especially the open source one. For instance, we can find bifurcations of distributions that get copied by new maintainers and divert. This creates a tree like aspect, a genealogy of both common ancestors and/or influences in technical and philosophical choices. Overall, we now have a vibrant ecosystem where a thing learned on a branch can help a completely unrelated leaf on another tree. There's something for everyone. Target and speciality So what could be so different between all those software distributions, why not have a single platform that everyone can build on. One thing is specialization and differentiation. Each distro caters to a different audience and is built by a community with its philosophy. Let's go over some of them:
That all comes to the raison d'être, the philosophy of the distribution, it guides every decision the maintainers have to make. It guides how they configure every software, how they think about security, portability, comprehensiveness. For example, if a distribution cares about free software, it's going to be strict about what software it includes and what licenses it allows in its repository, having software to check the consistency of licenses in the core. Another example is if their goal is to target a desktop audience then internationalization, ease of use, user friendliness, having a large number of packages, is going to be prioritized. While, again, if the target is a real time embedded device, the size of the kernel is going to be small, configured and optimized for this purpose, and limiting and choosing the appropriate packages that work in this environment. Or if it's targeted at advanced users that love having control of their machine, the maintainers will choose to let the users make most of the decisions, providing as many packages as possible with the latest version possible, with a loosely way to install the distribution, having a lot of libraries and software development tools. What this means is that a distribution does anything it can to provide sane defaults that fit its mindset. It composes and configures a layer of components, a stack of software. The layering Distribution maintainers often have at their disposition different blocks and the ability to choose them, stacking them to create a unit we call a software distribution. There's a range of approaches to this, they could choose to have more, or less, included in what they consider the core of the distribution and what is externally less important to it. Moreover, sometimes they might even leave the core very small and loose, instead providing the glue software that makes it easy for the users to choose and swap the blocks at specific stages in time: installation, run time, maintenance mode, etc.. So what are those blocks of interdependent components. The first part is the method of installation, this is what everything hinges on, the starting point. The second part is the kernel, the real core of all operating systems today. But that doesn't mean that the distribution has to enforce it. Some distributions may go as far as to provide multiple kernels specialised in different things or none at all. The third part is the filesystem and file hierarchy, the component that manages where and how files are spread out on the physical or virtual hardware. This could be a mix and match where sections of the file system tree are stored on separate filesystems. The fourth part is the init system, PID 1. This choice has generated a lot of contention these days. PID 1 being the mother process of all other processes on the system. What role it has and what functionalities it should include is a subject of debate. The fifth part is composed of the shell utilities, what we sometimes refer to as the userland or user space, as its the first layer the user can directly interface with to have control of the operating system, the place where processes run. The userland implementations on Unix-based systems usually tries to follow the POSIX standard. There are many such implementations, also subject of contention. The sixth part is made up of services and their management. The daemons, long running processes that keep the system in order. Many argue if the management functionality should be part of the init system or not. The seventh part is documentation. Often it is forgotten but it is still very important. The last part is about everything else, all the user interfaces and utilities a user can have and ways to manage them on the system. Stable releases vs Rolling There exists a spectrum on which distributions place themselves when it comes to keeping up to date with the versions of the software they provide. This most often applies to external third party open source software. The spectrum is the following: Do we allow the users to always have the latest version of every software while running the risk of accidentally breaking their system, what we call bleeding edge or rolling distro, or do we take a more conservative approach and take the time to test every software properly before allowing it in the repository, while not having all the latest updates, features, and optimizations of those software, what we call release based distro. The extreme of the first scenario would be to let users directly download from the software vendor/creator source code repository, or the opposite, let the software vendor/creator push directly to the distribution repository. Which could easily break or conflict with the user's system or lead to security vulnerability. We'll come back to this later, as this could be avoided if the software runs in a containerized environment. When it comes to release distributions, it usually involves having a long term support stable version that keeps receiving and syncing with the necessary security updates and bug fixes on the long run while having another version running a bit ahead testing the future changes. On specific time frames, users can jump to the latest release of the distribution, which may involve a lot of changes in both configuration and software. Some distributions decide they may want to break ABI or API of the kernel upon major releases, that means that everything in the system needs to be rebuilt and reinstalled. The release cycle, and the rate of updates is really a spectrum. When it comes to updates, in both cases, the distribution maintainers have to decide how to communicate and handle them. How to let the users know what changes. If a user configuration was swapped for a new one or merged with the new one, or copied aside. Communication is essential, be it through official channels, logging, mails, etc.. Communication needs to be bi-directional, users report bugs and maintainers posts what their decisions are and if users need to be involved in them. This creates the community around the distribution. Rolling releases require intensive efforts from package maintainers as they constantly have to keep up with software developers. Especially when it comes to the thousands of newest libraries that are part of recent programming languages and that keep on increasing. Various users will want precise things out of a system. Enterprise environments and mission critical tasks will prefer stable releases, and software developers or normal end users may prefer to have the ability to use the latest current software. Interdistribution standard With all this, can't there be an interdistribution standard that creates order, and would we want such standard. At the user level, the differences are not always noticeable, most of the time everything seems to work as Unix systems are expected to work. There's no real standard between distributions other than that they are more or less following the POSIX standards. Within the Linux ecosystem, the Free Standards Group tries to improve interoperability of software by fixing a common Linux ABI, file system hierarchy, naming conventions, and more. But that's just the tip of the iceberg when it comes to having something that works interdistributions. Furthermore, each part of the layering we've seen before could be said to have its own standards: There are desktop interoperability standards, filesystem standards, networking standards, security standards, etc.. The biggest player right now when it comes to this is systemd in association with the free desktop group, it tries to create (force) an interdistribution standard for Linux distribution. But again, the big Question: Do we actually want such inter-distribution standards, can't we be happy with the mix and match we currently have. Would we profit from such thing? The package manager and packaging Let's now pay attention to the package themselves, how we store them, how we give secure access to them, how we are able to search amongst them, download them, install them, remove them, and anything related to their local management, versioning, and configuration. Method of distribution How do we distribute software, share them, what's the front-end to this process. First of all, where do we store this software. Historically and still today, software can be shared via physical medium such as CD-ROM, DVD, USBs, etc.. This is common when it comes to proprietary vendors to have the distribution come with a piece of hardware they are selling, it's also common for the procurement of the initial installation image. However, with today's hectic software growth, using a physical medium isn't flexible. Sharing over the internet is more convenient, be it via FTP, HTTP, HTTPS, a publicly available svn or git repo, via central website hubs such as Github or appliation stores such the ones Apple and Google provide. A requirement is that the storage and the communication to it should be secure, reliable against failures, and accessible from anywhere. Thus, replication is often done to avoid failures but also to have a sort of edge network speeding effect across the world, load balancing. Replication could be done in multiple ways, it could be a P2P distributed system for instance. How we store it and in what format is up to the repository maintainers. Usually, this is a file system with a software API users can interact with over the wire. Two main format strategies exist: source based repositories and binary repositories. Second of all, who can upload and manage the host of packages. Who has the right to replicate the repository. As a source of truth for the users, it is important to make sure the packages have been verified and secured before being accepted on the repository. Many distribution have the maintainers be the only ones that are able to do this. Giving them cryptographic keys to sign packages and validate them. Others have their own users build the packages, send them to a central hub for automatic or manual verification and then uploaded to the repository. Each user having their own cryptographic key for signature verification. This comes down to an issue of trust and stability. Having the users upload packages isn't always feasible when using binary packages if the individual packages are not containerized properly. There's a third option, the road in between, having the two types, the core managed by the official distribution maintainers and the rest by its user community. Finally, the packages reach the user. How the user interact with the repository locally and remotely depends on the package management choices. Do users cache a version of the remote repository, like is common with the BSD port tree system. How flexible can it be to track updates, locking versions of software, allowing downgrades. Can users download from different sources. Can users have multiple version of the same software on the their machine. Format As we've said there are two main philosophy of software sharing format: source code port-style and pre-built binary packages. The software that manages those on the user side is called the package manager, it's the link with the repository. Though, in source based repo I'm not sure we can call them this way, but regardless I'll still refer to them as such. Many distributions create their own or reuse a popular one. It does the search, download, install, update, and removal of local software. It's not a small task. The rule of the book is that if it isn't installed by the package manager then it won't be aware of its existence. Noting that distributions don't have to be limited to a single package manager, there could be many. Each package manager relies on a specific format and metadata to be able to manage software, be it source or binary formatted. This format can be composed of a group of files or a single binary file with specific information segments that together create recipes that help throughout its lifecycle. Some are easier to put together than others, incidentally allowing more user contributions. Here's a list of common information that the package manager needs:
So what's the advantage of having pre-compiled binary packages instead of cloning the source code and compiling ourselves. Won't that remove a burden from package maintainers. One advantage is that pre-compiled packages are convenient, it's easier to download them and run them instantly. It's also hard, if not impossible, these days, and energy intensive, to compile huge software such as web browsers. Another point, is that proprietary software are often already distributed as binary packages, which would creates a mix of source and binary packages. Binary formats are also space efficient as the code is stored in a compressed archived format. For example: APK, Deb, Nix, ORB, PKG, RPM, Snap, pkg.tar.gz/xz, etc.. Some package managers may also choose to leave the choice of compression up to the user and dynamically discern from its configuration file how to decompress packages. Let's add that there exists tools, such as "Alien", that facilitate the job of package maintainers by converting from one binary package format to another. Conflict resolution & Dependencies management Resolving dependencies One of the hardest job of the package manager is to resolve dependencies. A package manager has to keep a list of all the packages and their versions that are currently installed on the system and their dependencies. When the user wants to install a package, it has to take as input the list of dependencies of that package, compare it against the one it already has and output a list of what needs to be installed in an order that satisfies all dependencies. This is a problem that is commonly encountered in the software development world with build automation utilities such as make. The tool creates a directed acyclic graph (DAG), and using the power of graph theory and the acyclic dependencies principle (ADP) tries to find the right order. If no solution is found, or if there are conflicts or cycles in the graph, the action should be aborted. The same applies in reverse, upon removal of the package. We have to make a decision, do we remove all the other packages that were installed as a dependency of that single one. What if newer packages depend on those dependencies, should we only allow the removal of the unused dependencies. This is a hard problem, indeed. Versioning This problem increases when we add the factor of versioning to the mix, if we allow multiple versions of the same software to be installed on the system. If we don't, but allow switching from one version to another, do we also switch all other packages that depend on it too. Versioning applies everywhere, not only to packages but to release versions of the distribution too. A lot of them attach certain version of packages to specific releases, and consequentially releases may have different repositories. The choice of naming conventions also plays a role, it should convey to users what they are about and if any changes happened. Should the package maintainer follow the naming convention of the software developer or should they use their own. What if the name of two software conflict with one another, this makes it impossible to have it in the repo, some extra information needs to be added. Do we rely on semantic versioning, major, minor, patch, or do we rely on names like so many distributions releases do (toy story, deserts, etc..), or do we rely on the date it was released, or maybe simply an incremental number. All those convey meaning to the user when they search and update packages from the repository. Static vs dynamic linking One thing that may not apply to source based distro, is the decision between building packages as statically linked to libraries or dynamically linked. Dynamic linking is the process in which a program chooses not to include a library it depends upon in its executable but only a reference to it, which is then resolved at run-time by a dynamic linker that will load the shared object in memory upon usage. On the opposite, static linking means storing the libraries right inside the compiled executable program. Dynamic linking is useful when many software rely on the same library, thus only a single instance of the library has to be in memory at a time. Executables sizes are also smaller, and when it is updated all programs relying on it get the benefit (as long as the interfaces are the same). So what does this have to do with distributions and package management. Package managers in dynamic linking environment have to take care of the versions of the libraries that are installed and which packages depend on them. This can create issues if different packages rely on different versions. For this reason, some distro communities have chosen to get rid of dynamic linking altogether and rely on static linking, at least for things that are not related to the core system. Another incidental advantage of static linking is that it doesn't have to resolve dependencies with the dynamic linker, which makes it gain a small boost in speed. So static builds simplify the package management process. There doesn't need to be a complex DAG because everything is self contained. Additionally, this can allow to have multiple versions of the same software installed alongside one another without conflicts. Updates and rollbacks are not messy with static linking. This gives rise to more containerised software, and continuing on this path leads to market platforms such as Android and iOS where distribution can be done by the individual software developers themselves, skipping the middle-man altogether and giving the ability for increasingly impatient users to always have the latest version that works for their current OS. Everything is self-packaged. However, this relies heavily on the trust of the repository/marketplace. There needs to be many security mechanisms in place to not allow rogue software to be uploaded. We'll talk more about this when we come back to containers This is great for users and, from a certain perspective, software developers too as they can directly distribute pre-built packages, especially when there's a stable ABI for the base system. All this breaks the classic distribution scheme we're accustomed to on the desktop. Is it all roses and butterflies, though. As we've said, packages take much more space with static linking, thus wasting resources (storage, memory, power). Moreover, because it's a model where software developers push directly to users, this removes the filtering that distribution maintainers have over the distro, and encourages licenses uncertainties. There's no more overall philosophies that surrounds the distribution. There's also the issue of library updates, the weight is on the software developers to make sure they have no vulnerabilities or bugs in their code. This adds a veil on which software uses what, all we see is the end products. From a software developer using this type of distribution perspective, this adds extra steps to download the source code of each libraries their software depends on, and build each one individually. Turning the system into a source based distro. Reproducibility Because package management is increasingly becoming messier the past few years, a new trend has emerged to put back a sense of order in all this, reproducibility. It has been inspired by the world of functional programming and the world of containers. Package managers that respect reproducibility have each of their builds asserted to always produce the same output. They allow for packages of different versions to be installed alongside one another, each living in its own tree, and it allows normal users to install packages only them can access. Thus, many users can have different packages. They can be used as universal package managers, installed alongside any other package managers without conflict. The most prominent example is Nix and Guix, that use a purely functional deployment model where software is installed into unique directories generated through cryptographic hashes. Dependencies from each software are included within each hash, solving the problem of dependency hell. This approach to package management promises to generate more reliable, reproducible, and portable packages. Stateless and verifiable systems The discussion about trust, portability, and reproducibility can also be applied to the whole system itself. When we talked about repositories as marketplaces, where software developers push directly to it and the users have instant access to the latest version, we said it was mandatory to have additional measures for security. One of them is to containerised, to sandbox every software. Having each software run in their own space not affecting the rest of the system resources. This removes the heavy burden of auditing and verifying each and every software. Many solutions exist to achieve this sandboxing, from docker, chroot, jails, firejail, selinux, cgroups, etc.. We could also distance the home directory of the users, making them self-contained, never installing or modifying the globally accessible places. This could let us have the core of the system verifiable as it is not changed, as it stays pristine. Making sure it's secure would be really easy. The idea of having the user part of the distro as atomic, movable, containerized, and the rest reproducible is game changing. But again, do we want to move to a world where every distro is interchangeable? Do Distros matter with containers, virtualisation, and specific and universal package managers It remains to be asked if distributions still have a role today with all the containers, virtualisation, and specific and universal package managers. When it comes to containers, they are still very important as they most often are the base of the stack the other components build upon. The distribution is made up of people that work together to build and distribute the software and make sure it works fine. It isn't the role of the person managing the container and much more convenient for them to rely on a distribution. Another point, is that containers hide vulnerabilities, they aren't checked after they are put together, while on the other hand, distribution maintainers, have as a role to communicate and follow up on security vulnerabilities and other bugs. Community is what solves daunting problems that everyone shares. A system administrator building containers can't possibly have the knowledge to manage and builds hundreds of software and libraries and ensure they work well together. If packages are self-contained Do distributions matter if packages are self-contained? To an extent they do as they could be in this ecosystem the providers/distributors of such universal self-contained packages. And as we've said it is important to keep the philosophy of the distro and offer a tested toolbox that fits the use case. What's more probable is that we'll move to a world with multiple package managers, each trusted for its specific space and purpose. Each with a different source of philosophical and technical truth. Programming language package management specific This phenomena is already exploding in the world of programming language package management. The speed and granularity at which software is built today is almost impossible to follow using the old method of packaging. The old software release life cycle has been thrown out the window. Thus language-specific tools were developed, not limited to installing libraries but also software. We can now refer to the distribution offered package manager as system-level and others as application-level or specific package managers. Consequentially, the complexity and conflicts within a system has exploded, and distribution package managers are finding it pointless to manage and maintain anything that can already be installed via those tools. Vice-versa, the specific tool makers are also not interested in having what they provide included in distribution system-level package managers. Package managers that respect reproducibility, such as Nix, that we've mentioned, handle such cases more cleanly as they respect the idea of locality, everything residing withing a directory tree that isn't maintained by the system-level package manager. Again, same conclusion here, we're stuck with multiple package managers that have different roles. Going distro-less A popular topic in the container world is "distro-less". It's about replacing everything provided in a distribution, removing it's customization, or building an image from scratch and maybe relying on universal package managers or none. The advantage of such containers is that they are really small and targeted for a single purpose. This let the sysadmin have full control of what happens on that box. However, remember that there's a huge cost to controlling everything, just like we mentioned earlier. This moves the burden upon the sysadmin to manage and be responsible to keep up with bugs and security updates instead of the distribution maintainers Conclusion With everything we've presented about distributions, I hope we now have a clearer picture of what they are providing and their place in our current times. What's your opinion on this topic? Do you like the diversity? Which stack would you use to build a distribution? What's your take on static builds, having users upload their own software to the repo? Do you have a solution to the trust issue? How do you see this evolve? |
|||