New Project: cl-tar

Tagged as blog, common-lisp

Written on 2021-09-23 15:10:00 UTC

I have just published the first release of a new project: cl-tar. This was supposed to be my summer side-project, but it ran long as they often do :).

The goal of this project is to provide a Common Lisp interface to tar archives. It has its foundations in Nathan Froyd's archive library, but has been significantly extended and improved.

cl-tar-file

There are actually two subprojects under the cl-tar umbrella. The first is cl-tar-file, which provides the ASDF system and package tar-file. This project provides low-level access to physical entries in tar files. As a consequence, two tar files that extract to the same set of files on your filesystem may have two very different sets of entries of tar-file's point of view, depending on the tar format used (PAX vs ustar vs GNU vs v7).

The cl-tar-file project is technically a fork of archive. Except, all non-portable bits have been removed (such as code to create symlinks), better support for the various archive variants has been added, better blocking support added (tar readers/writers are supposed to read/write in some multiple of 512 bytes), cpio support removed, and a test suite added, along with other miscellaneous fixes and improvements.

cl-tar

The second sub project is cl-tar itself, which provides three ASDF systems and packages: tar, tar-simple-extract, and tar-extract.

The tar system provides a thin wrapper over the tar-file system that operates on logical entries in tar files. That is, a regular file is represented as a single entry, no matter how many entries it is composed of in the actual bits that get written to the tar file. This system is useful for analyzing a tar file or creating one using data that is not gotten directly from the file system.

The tar-simple-extract system provides a completely portable interface to extract a tar archive to your file system. The downside of portability is that there is information loss. For example, file owners, permissions, and modification times cannot be set. Additionally, symbolic links cannot be extracted as symbolic links (but they can be dereferenced).

The tar-extract system provides a more lossless extraction capability. The downside of being lossless is that it is more demanding (osicat must support your implementation and OS) and it raises security concerns.

A common security concern is that a malicious tar file can extract a symlink that points to an arbitrary location in your filesystem and then trick you into overwriting files at the location by extracting later files through that symlink. This system tries its best to mitigate that (but makes no guarantees), so long as you use its default settings. If you find a bug that allows an archive to extract to an arbitrary location in your filesystem, I'd appreciate it if you report it!

Also note that tar-extract currently requires a copy of osicat that has the commits associated with this PR applied.

next steps

First, close the loop on the osicat PR. It started off as a straightforward PR that just added new functions. However, when I tested on Windows, I realized I couldn't load osicat. So I added a commit that fixed that. There may be some feedback and changes requested on how I actually acomplished that.

Second, integrate tar-extract into CLPM. CLPM currently shells out to a tar executable to extract archives. I'd like to use this pure CL solution instead. Plus, using it with CLPM will act as a stress test by exposing it to many tar files.

Third, add it to Quicklisp. tar-extract won't compile without the osicat changes, so those definitely need to be merged first. Additionally, I want to have at least some experience with real world tar files before making this project widely available.

Fourth, add support for creating archives from the filesystem.

Fifth, add the ability to compile to an executable so you could use this in place of GNU or BSD tar :).

If the fourth and fifth steps excite you, I'd love to have your help making them a reality! They're not on my critical path for anything at the moment, so it'll likely be a while before I can get to them.

comments powered by Disqus