[PureOS] Next steps for Laniakea: The archive problem
Matthias Klumpp
matthias.klumpp at puri.sm
Mon Feb 1 11:47:43 PST 2021
Hi!
I would like to share a part of the plan for future Laniakea
improvements which might likely bind a significant amount of my time,
but may also provide a huge amount of improvements for the PureOS
tooling.
What is Laniakea?
-------------------------
In order to build a Debian derivative, many tools are needed - besides
managing the archive, you also want to do QA, migrate packages between
suites, sync packages from Debian, show some web UI about packages,
etc.
If all tools are separate and don't know of each other, you have to
manage them separately, which is a lot of work, but also coordinate
them somehow and manage their configuration (e.g. if a suite
stabilizes, you need to adjust the new development target 5 times for
5 different tools).
Laniakea is a suite of tools which either provides features for
archive management on its own, or wraps existing tools and manages
their configuration. All tools share one database for configuration
which is the single source of truth in the system, and individual
Laniakea modules can also communicate directly via a ZeroMQ-based
messaging protocol.
The Problem
-----------------
Laniakea was originally created as part of the Tanglu Debian
derivative, and then adapted for PureOS. One of the original ideas of
Laniakea was to work with any archive management software, bei it dak
(Debian Archive Kit), reprepro, aptly, ... and just wrap it like it
wraps any other module.
Because that's what it's design was based on: Be very modular and make
each module easily replaceable in case something better comes along.
This also enabled Laniakea to for a while host two ways to get
packages built, while nowadays this has pretty much been collapsed
into one singular job management system, which handles package builds
as well as image builds.
The problem with making the archive manager pluggable is that it goes
against Laniakea's principle of having a single source of truth in its
database. Currently, only dak is properly supported as archive manager
for Laniakea, which keeps its own package database, handles all
incoming packages and generates data which Laniakea then imports into
its own copy of the archive database. This is terrible for multiple
reasons:
* Laniakea and dak can - even if just briefly - desynchronize. In
reality this makes people wonder which version of a package currently
is in what suite, as the web UI does not reflect the actual archive
status.
* Due to potentially desynchronized states, Laniakea can not trigger
certain actions, or can only trigger them in a static sequence of
operation when it *knows* that the archive data is currently the same
as the data in its database. This results in a lot of archive
operations being executed in strict sequences and some changes will
not take effect until those were run (like package removals,
migrations, syhfornizations, ...)
* Laniakea does not know of certain actions taken by dak: For
example, if dak rejects a package from upload, Laniakea can only guess
that this has happened as there is no direct communication channel
that would be race-free. This results in heuristics for the package
build system to figure out whether an upload actually reached its
destination suite.
* The constant interactions between Laniakea and dak are a huge
performance hit - Laniakea needs to constantly synchronize its
database with the dak data, which can occasionally take more than
20min. That combined with the sequences that Laniakea has to run
anyway destroys performance and the "immediateness" that archive
actions could have, as everything runs at timed intervals by cron
jobs.
The Solution
-----------------
The solution seems simple: Just implement archive management in
Laniakea itself! I did shy away from this for years though. Reasons
are that creating the apt archive actually is not an easy task, and a
lot of work went into dak for a lot of nice features that would all
have to be implemented again. Also, the Debian archive is constantly
evolving, and by detaching from dak we would not be able to benefit
from them unless we put in work from our side to implement them again.
Another reason why this is a bad idea is that it would bind my time,
so other features which I consider important would get realized much
later.
There's also personal reasons for not attempting that rewrite: Dak has
some insane SQL to generate indices performantly directly from
database tables, and me being no experienced database developer would
not help with achieving a better result.
However, despite all of these concerns I think I should attempt to
write the Laniakea archive module, with the aim to replace dak
mid-term. Reasons being the following:
* The issues due to interfacing with dak are touching every area of
our archive, and it is honestly a pain to work with. Also, 90% of
feature requests brought forth by users directly tie in with this
issue and could be easily addressed if Laniakea could control the
archive directly, and would be the only entity doing so (instead of
having some external thing to synchronize with).
* Ubuntu's Launchpad is a good example of what can be built if the
archive is integrated a bit more tightly with other distribution
management features.
* Launchpad also uses apt-ftparchive for some metadata read actions.
Performance and flexibility is a huge worry for me in an endeavour
like this, so I looked into using apt-ftparchive for a "Laniakea
archive" prototype. It's not a tool that makes sense to run directly
on the archive, but it is great to obtain exactly the metadata we need
to build archive index files. Those can be transformed into JSON and
stored in Postgres JSON table elements, so we can very easily extend
the extra metadata elements in future, if we need to. Also, reading a
large quantity of data elements from Postgres tables with SQLAlchemy's
ORM works very quickly, and converting and writing them with C++ code
is almost instant. Building on top of apt-ftparchive and APT's
libapt-pkg also ensures we benefit from improvements made on these
tools in Debian.
* We don't actually need to support all features initially. Some
features we would certainly lose initially are pdiffs, by-hash etc,
but a functioning APT archive, without any legacy support but also
without any of the very fancy new features, could actually be built in
a reasonable amount of time. Of course that would be annoying for some
users, but the audience for PureOS would unlikely mind these changes
much. And we could gradually add them back later.
So, due to that I think it makes sense to spend some serious time to
look into implementing this module. I would build on top of apt-pkg
and apt-ftparchive, using C++ code where necessary for performance,
and Python code everywhere else. I would also make heavy use of
json/jsonb entries in Postgres for many of the metadata elements, as
it's a convenient, performant but most of all easily extensible way to
store most of the auxiliary metadata.
For actions such as dependency-resolution I would also use the
existing APT library. There are a few open questions on design details
that I haven't nailed down, but unlike a year ago, I think a project
like this is actually doable and will not take years to complete. But
as always with these things, what may look easily achievable now can
turn out to be much harder in future ;-)
So, what do you think? Am I crazy, or is this a good thing to finally attempt?
Cheers,
Matthias
P.S: To address the obvious question "Why not adjust dak?" - I looked
into this, but the amount of changes we would need is quite big, and
since there is no chance to merge these upstream, we would essentially
be forking dak and still having a separate database and separate-ish
tooling. Also, dak serves its own purpose and serves it well for
Debian - trying to make it something that it isn't just to work for
Laniakea is not a great plan.
More information about the PureOS-project
mailing list