Use droidsfmin to analyze and filter the contents of a DROID signature file. Minimize it based on a list of PUIDs, remove all format entries you don’t need and speed up the file format identification process.
You should not use droidsfmin if you need exact identification for all file formats because obviously removing format entries from the signature file will make identification less precise (see also the note on container signatures below). It’s the usual tradeoff between speed and accuracy. And of course you simply don’t need droidsfmin if you don’t care how fast DROID runs when identifying files.
That said, droidsfmin can be useful if you want DROID to run faster, you need exact identification for some file formats only (like those accepted by your archival policy or those that are of interest in your current project) and it’s okay to label all others as “Unknown”.
Use the --help
option to show the available options:
$ droidsfmin --help
The DROID Signature File Minimizer - filter a signature file based on
a list of PUIDs and keep only entries for those file formats that you
really need.
Usage: droidsfmin [options] [signature-file]
-h --help show help message
-p PUID --puid=PUID include file format with this PUID in the
output
-P FILE --puids-from-file=FILE like -p, but read list of PUIDs from file
(one PUID per line)
-l --list return a list of PUIDs instead of XML
-S --include-supertypes include file formats that are supertypes
of the selected formats
-s --include-subtypes include file formats that are subtypes of
the selected formats
-o FILE --output=FILE output file
List all file formats that occur in a signature file, giving you an overview of its content:
$ droidsfmin --list signatures.xml
fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation 95
fmt/126 Microsoft Powerpoint Presentation 97-2003
fmt/124 Encapsulated PostScript File Format 3
...
Create a new signature file that contains entries only for the
formats x-fmt/111
(plain text) and fmt/95
(PDF/A-1a):
$ droidsfmin --puid x-fmt/111 --puid fmt/95 signatures.xml -o new.xml
If you don’t specify input/output files then droidsfmin will read from STDIN and write to STDOUT, just like any decent command line tool would. This allows some nice signature file analysis and creation techniques.
First, list all PUIDs in a signature file to get an overview:
$ droidsfmin -l signatures.xml
fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation 95
fmt/126 Microsoft Powerpoint Presentation 97-2003
fmt/124 Encapsulated PostScript File Format 3
...
Use piping and grep to find interesting PUIDs in this list:
$ droidsfmin -l signatures.xml | grep 'PDF/A'
fmt/95 Acrobat PDF/A - Portable Document Format 1a
fmt/354 Acrobat PDF/A - Portable Document Format 1b
...
$ droidsfmin -l signatures.xml | grep -E 'Waveform Audio|Broadcast WAVE'
fmt/6 Waveform Audio
fmt/2 Broadcast WAVE 1 Generic
...
The PUID, format name and format version fields in each line are separated by tab characters. That makes it easy to use another classic Unix tool, cut, to write only the PUIDs to a file:
$ droidsfmin -l signatures.xml | grep 'PDF/A' | cut -f 1 > puids.txt
Now you can make a new signature file containing only the interesting PUIDs:
$ droidsfmin -P puids.txt signatures.xml -o pdfa-signatures.xml
For some extra geekiness you could even merge all this into a single command, although this works only in Bash and looks a bit weird:
$ droidsfmin signatures.xml \
-P <(droidsfmin -l signatures.xml | grep 'PDF/A' | cut -f 1)
First, list all PUIDs in a signature file to get an overview:
PS> droidsfmin.exe -l signatures.xml
fmt/122 Encapsulated PostScript File Format 1.2
fmt/125 Microsoft Powerpoint Presentation 95
fmt/126 Microsoft Powerpoint Presentation 97-2003
fmt/124 Encapsulated PostScript File Format 3
...
Use piping and the Select-String commandlet (which can be abbreviated
to sls
) to find interesting PUIDs in this list:
PS> droidsfmin.exe -l signatures.xml | sls 'PDF/A'
fmt/95 Acrobat PDF/A - Portable Document Format 1a
fmt/354 Acrobat PDF/A - Portable Document Format 1b
...
PS> droidsfmin.exe -l signatures.xml | sls 'Waveform Audio|Broadcast WAVE'
fmt/6 Waveform Audio
fmt/2 Broadcast WAVE 1 Generic
...
The PUID, format name and format version fields in each line are separated by tab characters. As far as I know PowerShell has no direct equivalent to the Unix tool cut, but you can do something like this to write only the PUIDs to a file:
PS> droidsfmin.exe -l signatures.xml | sls 'PDF/A' | `
% { ($_ -split '\t')[0] } > puids.txt
Now you can make a new signature file containing only the interesting PUIDs:
PS> droidsfmin.exe -P puids.txt signatures.xml -o pdfa-signatures.xml
The --include-supertypes
option will include all file
formats that are direct or indirect supertypes of one of the selected
formats. That means selecting fmt/354 (PDF/A-1b) will implicitly also
include fmt/18 (PDF 1.4, on which PDF/A-1 is based), amongst others.
Likewise, the --include-subtypes
option will include all
file formats that are direct or indirect subtypes of one of the selected
formats. That means selecting fmt/18 (PDF 1.4) will implicitly also
include fmt/354 (PDF/A-1b, which is based on PDF 1.4), amongst others.
Yes, this is like --include-supertypes
the other way
around.
Using both --include-supertypes
and
--include-subtypes
together will yield only (direct and
indirect) supertypes and subtypes of the given formats
(--puid
or --puids-from-file
). It will
not collect all supertypes of subtypes of …, since that path
quickly leads to madness.
These options are primarily useful when analyzing the relationship between several file formats in a given signature file. (Particularly, including supertypes is pretty useless in minimizing a signature file because supertypes have a lower priority and thus can never “win” against one of their subtypes, so we can safely ignore them.) For example, the following command will output a list of all direct or indirect supertypes of the PDF/A-1b format:
$ droidsfmin --list -p fmt/354 --include-supertypes signatures.xml
The sub/supertype information is expressed in the signature file
element HasPriorityOverFileFormatID
of the subtype and has
the semantics “subtype has priority over supertype”. Actually, priority
is used not only for sub/supertype relationships in the DROID signature
file but in a more general form: For example, both PDF/A-1b and PDF 1.5
have priority over PDF 1.4. But while PDF/A-1b may well be regarded as a
subtype of PDF 1.4, PDF 1.5 is merely a newer version! I am aware of the
semantic imprecision introduced by this simplification, but since
--include-formats-with-higher-priority
would make a
considerably less memorable option name, I am willing to live with this
ambiguity.
Besides the regular signature file DROID also uses a container signature file to improve identification of container based file formats like some newer office formats. If these are relevant to you, make sure that the respective trigger PUIDs from the container signature file are included in the format list you feed to droidsfmin! The container signature file itself cannot be processed by droidsfmin since its internal structure is different from the “original” signature file.