Elfshaker: Version control system fine-tuned for binaries

2021-11-1912:41599113github.com

elfshaker stores binary objects efficiently. Contribute to elfshaker/elfshaker development by creating an account on GitHub.

Show article

elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

elfshaker is a CLI tool written in the Rust programming language.
It stores snapshots of directories into highly-compressed pack files and provides fast on-demand access to the stored files. It is particularly good for storing lots of similar files, for example object files from an incremental build.
It allows few-second access to any commit of clang with the manyclangs project. For example, this accelerates bisection of LLVM by a factor of 60x! This is done by extracting builds of LLVM on-demand from locally stored elfshaker packs, each of which contains ~1,800 builds and is about 100 MiB in size, even though the full originals would take TiBs to store! Extracting a single builds takes 2-4s on modern hardware.

Getting started

See our Installation guide for instructions.

System Compatibility

The following platforms are used for our CI tests:

But we aim to support all popular Linux platforms, macOS and Windows in production.

We officially support the following architectures:

Current Status

The file format and directory structure is stable. We intend that pack files created with the current elfshaker version will remain compatible with future versions. Please kick the tyres and do your own validation, and file bugs if you find any. We have done a fair amount of validation for our use cases but there may be things we haven't needed yet, so please start a discussion and file issues/patches.

Documentation

See our Usage guide for instructions.

Contributing

Contributions are highly-appreciated. Refer to our Contributing guide.

Contact

The best way to reach us to join the elfshaker/community on Gitter. The original authors of elfshaker are Peter Waller (@peterwaller-arm) <peter.waller@arm.com> and Veselin Karaganev (@veselink1) <veselin.karaganev@arm.com> and you may also contact us via email.

Security

Refer to our Security policy.

License

elfshaker is licensed under the Apache License 2.0.

Page 2

You can’t perform that action at this time.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

Read the original article

Comments

By nh2 2021-11-1914:242 reply

I experimented with something similar with a Linux distribution's package binary cache.

Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.

Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.

Details: https://github.com/NixOS/nixpkgs/issues/89380

By peterwaller-arm 2021-11-1914:323 reply

Author here. I've used bup, and elfshaker was partially inspired by it! It's great. However, during initial experiments on this project I found bup to be slow, taking quite a long time to snapshot and extract. I think this could in principle be fixed in bup one day, perhaps.

By nh2 2021-11-1914:421 reply

I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.

By pdimitar 2021-11-1922:521 reply

I used `bupstash` and evaluated it for a while. I am looking to do 5+ offsite backups of a small personal directory to services that offer 5GB of cloud space for free.

`bupstash` lacked good compression. I settled with `borg` because I could use `zstd` compression with it. Currently at 60 snapshots of the directory and the `borg` repo directory is at ~1.52GB out of 5GB quota. The source directory is ~12.19GB uncompressed. Very happy with `borg` + `zstd` and how they handle my scenario.

I liked `bupstash` a lot, and the author is responsive and friendly. But I won't be giving it another try until it implements much more aggressive compression compared to what it can do now. It's a shame, I really wanted to use it.

I do recognize that for many other scenarios `bupstash` is very solid though.

By lathiat 2021-11-200:42

Borg has been working great for me with zstd.

By ybkshaw 2021-11-1917:05

Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.

By Siira 2021-11-1917:04

Is elfshaker any good for backuping non-text data?

By iTokio 2021-11-206:43

interesting design document: https://raw.githubusercontent.com/bup/bup/master/DESIGN

By mhx77 2021-11-1916:592 reply

Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.

By peterwaller-arm 2021-11-1917:05

Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.

By pdimitar 2021-11-1922:55

That's super impressive, I will definitely give it a go. Thanks for sharing!

By thristian 2021-11-1913:331 reply

This seems very much like the Git repository format, with loose objects being collected into compressed pack files - except I think Git has smarter heuristics about which files are likely to compress well together. It would be interesting to see a comparison between this tool and Git used to store the same collection of similar files.

By peterwaller-arm 2021-11-1913:412 reply

An author here, I agree! The packfile format is heavily inspired by git, and git may also do quite well at this.

We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.

On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.

By johnyzee 2021-11-1914:032 reply

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

By peterwaller-arm 2021-11-1914:132 reply

Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here.

By derefr 2021-11-1916:071 reply

One could, in theory, write a git-clean filter (like the one used for git-lfs), that teaches git various heuristic approaches to "take apart" well-known binary container formats into trees of binary object leaf-nodes.

Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.

On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.

Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])

I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)

By josephg 2021-11-203:402 reply

As I understand it, this is essentially what the Google chrome updater does. It disassembles the binary and recalculates jump labels. Then it generates a diff based on the assembly code. When it applies that diff on people’s computers, the your computer again pulls the chrome binary apart and reconstructs it. The code for this is complex, but it’s all opensource.

I remember reading about this technique years ago, thinking “cool when this catches on, software updates in all my software will be tiny”. But no, for some reason macos updates are still gigabytes in size. I have no idea why?

By hcs 2021-11-204:29

I think you're referring to courgette, which does the disassemble + patch + reassemble thing. Fwiw, Chrome now uses a simpler but competetive format called zucchini, see my recent comment: https://news.ycombinator.com/item?id=29028534

By derefr 2021-11-2017:36

It does!

The major distinction (besides this being generalized to pluggable binary container formats) is that what Courgette does happens in the context of delta comparison between two known binaries, where the goal is to create a minimal delta patch that can move you from one known binary to another known binary (given a sufficiently-smart updater, which actually "contains" a lot of the information in a Kolmogorov-complexity sense.)

There are efficiencies you can gain in encoding, that only work in a one-to-one, known-to-known comparison context. (Think "interstitial frames" in video; or RLE'd 1-bit delta-sigma audio encoding.) In these contexts, you need known1 on-hand already, plus a patch specific to the pair of {known1, known2}, to be able to recover known2. (This is why, in streaming video where non-delta-compressed "keyframes" are rare, you can't seek to arbitrary locations in the stream without the decoded video turning into a garbled mess for a while: you're receiving patches, but you don't have the [correct, clean] known1s to apply them to.)

But these efficiencies don't apply in the context of a system like git, where you might be checking out any arbitrary version of the data at any time, with any [or no!] other arbitrary version(s) already cached or checked out. To enable people to efficiently grab the delta they need to do the particular, arbitrary version-to-version jumps they need to do, you'd either need to generate the power-set of all updates; or at least enough entries to populate a skip-list (as Microsoft apparently does for Windows updates! [1]).

The powerset of updates is impractical to generate+store, but gets clients the ability to perform arbitrary jumps in O(1) time. With a skip-list of updates, on the other hand, either the server or the client needs to compute its way through all the skips, to combine them into one final update — so each jump would take O(log N) "patch application" computation steps to resolve. (This is why it takes the Microsoft servers a while to figure out which Windows updates your computer needs!) Neither approach really makes sense for a system like Git, where Git repos are entirely mirrored to every client (so the clients would need to store all those update patches), and where git checkouts are something you're supposed to be able to do often and quickly, in the middle of your daily workflow.

Meanwhile, the sort of approach I outlined would be optimized instead for storing currently known binaries, in a way most amenable to inter-compression against future, unknown versions of the binary, without the need to store any explicit known-to-known patches, or to compute the composition of such patches.

Rather than seeking to compress together existing available things as well as possible, a known-to-unknown compression approach seeks to segment and normalize the data in such a way that future revisions of the same data, put through the same process, would generate lots of content-identical duplicate component objects, which would end up being dedup'ed away when stored in a Content-Addressable Store.

It's the difference between "meeting in the middle" and canonicalization. Levenshtein distance vs. document fingerprinting. RAID5 vs. a fountain-codec-encoded dataset. Or, to be very literal, it's streaming-window Huffman coding, vs. merkle-trie encoding.

This "normalize, to later deduplicate" known-to-unknown compression approach, is the approach that Git itself is built on — but it only works well when the files Git works with are decomposed into mostly-independent leaf-node chunks. My proposed thought here, is just that it wouldn't be impossible to translate other types of files, into "mostly-independent leaf-node chunks" at commit time, such that Git's approach would then be applicable to them.

[1] https://devblogs.microsoft.com/oldnewthing/20200212-00/?p=10...

By JoshTriplett 2021-11-1914:571 reply

Can you talk a bit more about what ELF-specific heuristics elfshaker uses? What kind of preprocessing do you do before zstd? Do you handle offsets changing in instructions, like the BCJ/BCJ2 filter? Do you do anything to detect insertions/deletions?

By peterwaller-arm 2021-11-1915:121 reply

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

By JoshTriplett 2021-11-1918:13

Ah, I see! Makes sense that you can do much better if you get to compile the programs with your choice of options.

By ChrisMarshallNY 2021-11-1914:231 reply

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

By mdaniel 2021-11-1916:14

It was my recollection, and I confirmed it, that they've almost always had a "the first hit is free" model for small teams, and they also explicitly call out indie game studios as getting free stuff too: https://www.perforce.com/how-buy

By 3np 2021-11-1914:47

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)

Hacker News