DROID Signature File Minimizer

filter DROID signature file based on a list of PUIDs

This project has been discontinued

Its source code is still on GitHub. Feel free to fork it, or let me know if you see a real world-use case (beyond the fun research project it was) that justifies actively maintaining it. There’s a blog post that explains the ideas behind this tool.

Use droidsfmin to analyze and filter the contents of a DROID signature file. Minimize it based on a list of PUIDs, remove all format entries you don’t need and speed up the file format identification process.

You should not use droidsfmin if you need exact identification for all file formats because obviously removing format entries from the signature file will make identification less precise (see also the note on container signatures below). It’s the usual tradeoff between speed and accuracy. And of course you simply don’t need droidsfmin if you don’t care how fast DROID runs when identifying files.

That said, droidsfmin can be useful if you want DROID to run faster, you need exact identification for some file formats only (like those accepted by your archival policy or those that are of interest in your current project) and it’s okay to label all others as “Unknown”.

Read more about …

A little droidsfmin tutorial

Getting started

Use the --help option to show the available options:

$ droidsfmin --help

The DROID Signature File Minimizer - filter a signature file based on
a list of PUIDs and keep only entries for those file formats that you
really need.

Usage: droidsfmin [options] [signature-file]

-h       --help                  show help message
-p PUID  --puid=PUID             include file format with this PUID in the
                                 output
-P FILE  --puids-from-file=FILE  like -p, but read list of PUIDs from file
                                 (one PUID per line)
-l       --list                  return a list of PUIDs instead of XML
-S       --include-supertypes    include file formats that are supertypes
                                 of the selected formats
-s       --include-subtypes      include file formats that are subtypes of
                                 the selected formats
-o FILE  --output=FILE           output file

List all file formats that occur in a signature file, giving you an overview of its content:

$ droidsfmin --list signatures.xml

fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation   95
fmt/126 Microsoft Powerpoint Presentation   97-2003
fmt/124 Encapsulated PostScript File Format 3
...

Create a new signature file that contains entries only for the formats x-fmt/111 (plain text) and fmt/95 (PDF/A-1a):

$ droidsfmin --puid x-fmt/111 --puid fmt/95 signatures.xml -o new.xml

Adding some command line power

If you don’t specify input/output files then droidsfmin will read from STDIN and write to STDOUT, just like any decent command line tool would. This allows some nice signature file analysis and creation techniques.

Linux/Bash

First, list all PUIDs in a signature file to get an overview:

$ droidsfmin -l signatures.xml

fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation   95
fmt/126 Microsoft Powerpoint Presentation   97-2003
fmt/124 Encapsulated PostScript File Format 3
...

Use piping and grep to find interesting PUIDs in this list:

$ droidsfmin -l signatures.xml | grep 'PDF/A'

fmt/95  Acrobat PDF/A - Portable Document Format    1a
fmt/354 Acrobat PDF/A - Portable Document Format    1b
...

$ droidsfmin -l signatures.xml | grep -E 'Waveform Audio|Broadcast WAVE'

fmt/6   Waveform Audio
fmt/2   Broadcast WAVE  1 Generic
...

The PUID, format name and format version fields in each line are separated by tab characters. That makes it easy to use another classic Unix tool, cut, to write only the PUIDs to a file:

$ droidsfmin -l signatures.xml | grep 'PDF/A' | cut -f 1 > puids.txt

Now you can make a new signature file containing only the interesting PUIDs:

$ droidsfmin -P puids.txt signatures.xml -o pdfa-signatures.xml

For some extra geekiness you could even merge all this into a single command, although this works only in Bash and looks a bit weird:

$ droidsfmin signatures.xml \
    -P <(droidsfmin -l signatures.xml | grep 'PDF/A' | cut -f 1)

Windows/PowerShell

First, list all PUIDs in a signature file to get an overview:

PS> droidsfmin.exe -l signatures.xml

fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation   95
fmt/126 Microsoft Powerpoint Presentation   97-2003
fmt/124 Encapsulated PostScript File Format 3
...

Use piping and the Select-String commandlet (which can be abbreviated to sls) to find interesting PUIDs in this list:

PS> droidsfmin.exe -l signatures.xml | sls 'PDF/A'

fmt/95  Acrobat PDF/A - Portable Document Format    1a
fmt/354 Acrobat PDF/A - Portable Document Format    1b
...

PS> droidsfmin.exe -l signatures.xml | sls 'Waveform Audio|Broadcast WAVE'

fmt/6   Waveform Audio
fmt/2   Broadcast WAVE  1 Generic
...

The PUID, format name and format version fields in each line are separated by tab characters. As far as I know PowerShell has no direct equivalent to the Unix tool cut, but you can do something like this to write only the PUIDs to a file:

PS> droidsfmin.exe -l signatures.xml | sls 'PDF/A' | `
    % { ($_ -split '\t')[0] } > puids.txt

Now you can make a new signature file containing only the interesting PUIDs:

PS> droidsfmin.exe -P puids.txt signatures.xml -o pdfa-signatures.xml

Analyzing format relationships

The --include-supertypes option will include all file formats that are direct or indirect supertypes of one of the selected formats. That means selecting fmt/354 (PDF/A-1b) will implicitly also include fmt/18 (PDF 1.4, on which PDF/A-1 is based), amongst others.

Likewise, the --include-subtypes option will include all file formats that are direct or indirect subtypes of one of the selected formats. That means selecting fmt/18 (PDF 1.4) will implicitly also include fmt/354 (PDF/A-1b, which is based on PDF 1.4), amongst others. Yes, this is like --include-supertypes the other way around.

Using both --include-supertypes and --include-subtypes together will yield only (direct and indirect) supertypes and subtypes of the given formats (--puid or --puids-from-file). It will not collect all supertypes of subtypes of …, since that path quickly leads to madness.

These options are primarily useful when analyzing the relationship between several file formats in a given signature file. (Particularly, including supertypes is pretty useless in minimizing a signature file because supertypes have a lower priority and thus can never “win” against one of their subtypes, so we can safely ignore them.) For example, the following command will output a list of all direct or indirect supertypes of the PDF/A-1b format:

$ droidsfmin --list -p fmt/354 --include-supertypes signatures.xml

The sub/supertype information is expressed in the signature file element HasPriorityOverFileFormatID of the subtype and has the semantics “subtype has priority over supertype”. Actually, priority is used not only for sub/supertype relationships in the DROID signature file but in a more general form: For example, both PDF/A-1b and PDF 1.5 have priority over PDF 1.4. But while PDF/A-1b may well be regarded as a subtype of PDF 1.4, PDF 1.5 is merely a newer version! I am aware of the semantic imprecision introduced by this simplification, but since --include-formats-with-higher-priority would make a considerably less memorable option name, I am willing to live with this ambiguity.

A note on container signatures

Besides the regular signature file DROID also uses a container signature file to improve identification of container based file formats like some newer office formats. If these are relevant to you, make sure that the respective trigger PUIDs from the container signature file are included in the format list you feed to droidsfmin! The container signature file itself cannot be processed by droidsfmin since its internal structure is different from the “original” signature file.