XML stream processing with Haskell
Memory-efficient processing of large XML documents requires the use of a streaming parser. This post gives an introduction to XML stream processing with the Haskell programming language, in particular to the streaming API of the xml-conduit package. It shows examples for reading, writing and transforming XML data in a conduit pipeline.
Introduction
There are two widespread types of XML parsers, DOM-based and streaming. DOM-based parsers read a whole XML document and build a tree representation of the XML structure (the DOM, or Document Object Model) in memory where each node in the tree represents a part of the document like an element or a piece of text content. This makes XML processing very flexible because we can navigate through the tree representation in every direction, upwards, downwards, sideways, collecting all the information we need from different places around the DOM or even changing parts of the tree. Combined with a concise syntax to address specific locations, like XPath, this is quite convenient.
However, maintaining the DOM in memory can be infeasible when it requires too much, well, memory. Don’t get me wrong, you can get very far with a DOM-based parser; it took me years of working with XML before I ran into serious memory issues. But when the day comes and the DOM blows up your memory, you want a streaming parser.
Streaming parsers take another approach; they don’t build an in-memory tree representation of the whole XML document but read it piece by piece, flattening its tree structure into a linear stream of so-called events where each event represents a piece of XML syntax like an opening or closing tag, a comment, or text content. (Yes, that’s like a stream of tokens if you’re more into classic parsing terms. Just like the DOM can be seen as an abstract syntax tree.) With the right streaming framework, this makes XML processing in constant memory possible, where “constant” can be a pretty small amount of memory (see below for numbers).
Compared to DOM-based parsing this is strictly more low-level; it requires a whole different approach to XML processing because we can no longer move around in the DOM tree and access any node we like. Instead we have to deal with the linear stream of events, reacting to events as they fly by. As we will see though, that’s not too bad because we can often turn the stream of events into a stream of values of a (custom) data type that’s closer to our application domain. If you have ever used a parser combinator library like Parsec, Megaparsec or Attoparsec this will feel very similar.
Prerequisites
To follow along with the code examples below you should, apart from
some XML basics, know enough Haskell to feel comfortable with type
classes like Monoid, Applicative, and Monad. You should also know the
fundamentals and terminology of the conduit streaming
framework (if you don’t, the first half of the conduit tutorial
should get you up and running). In particular, you should understand the
different parameters of the ConduitT i o m r
type (tl;dr →
type of input stream, output stream, base monad, and result).
We will use xml-conduit (version 1.9.1.3 at the time of writing) throughout the examples, but you don’t have to know anything about that package yet. We will start with a high-level overview in the rest of this section and explore the relevant details as we move along.
The xml-conduit package contains both a DOM-based and a streaming XML parser and renderer for the conduit framework. The DOM-based API is defined in the Text.XML and Text.XML.Cursor modules, the streaming API (which we will use) can be found in the Text.XML.Stream.Parse and Text.XML.Stream.Render modules. The official tutorial does a good job at explaining the DOM-based API but unfortunately it doesn’t cover the streaming interface; there is only a short code example in the API docs.
Parsers (and renderers) based on xml-conduit are centered around the
Event
type. It is defined as follows:
data Event
= EventBeginDocument
| EventEndDocument
| EventBeginDoctype Text (Maybe ExternalID)
| EventEndDoctype
| EventInstruction Instruction
| EventBeginElement Name [(Name, [Content])]
| EventEndElement Name
| EventContent Content
| EventComment Text
| EventCDATA Text
As you see, the Event
type describes different elements
of XML syntax, so a stream of this type contains values that look like
EventBeginElement "book" [("isbn", ["..."])]
or
EventEndElement "book"
when tags
<book isbn="...">
or </book>
are
encountered. (OK, it will get a little more verbose if we don’t use the
IsString
instances for Name
and
Content
, but you get the idea.) However, although many
streams in xml-conduit use the Event
type we can often get
away without touching it directly, using higher level parser generator
functions instead.
Now with the basics out of the way, let’s dive straight into some
examples! In the next sections, we will see how to read XML into custom
data types, how to write XML from custom data types, and how to
transform XML by working with the Event
values directly
(yes, I know I just said we rarely have to do that, but it’s still
useful to know how it works). The source code for the examples is also
available on GitHub.
Reading XML into custom data types
Suppose we want to read the following XML data:
<?xml version="1.0" encoding="UTF-8"?>
library xmlns:dc="http://purl.org/dc/elements/1.1/">
<book isbn="9781593272838">
<dc:title>Learn You a Haskell for Great Good!</dc:title>
<dc:creator>Miran Lipovača</dc:creator>
<dc:description>This is a book. It contains text. And pictures!</dc:description>
<book>
</book>
<dc:title>Pride and Prejudice</dc:title>
<dc:creator>Jane Austen</dc:creator>
<dc:date>1813</dc:date>
<dc:subject>marriage</dc:subject>
<dc:subject>wealth</dc:subject>
<dc:subject>class</dc:subject>
<book>
</<!-- many more books ... -->
library> </
Nothing fancy, just a library
root element with a long
list of book
elements inside, where each book has at least
a title
and an author (aka creator
), plus an
optional description
and date
and a (possibly
empty) list of keywords (subject
). For the sake of example,
there is also an optional isbn
attribute. Note that the
child elements of book
live in the Dublin Core namespace,
while the library
and book
elements have no
namespace.
For reference, here is the schema this XML conforms to. Technically, our Haskell programs won’t require a schema definition. I think though that the possibility of validating input data up-front is one of the nicer aspects of working with XML. Also, being Haskell programmers, we do like some precise data type definitions, right?
<?xml version="1.0" encoding="UTF-8"?>
xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/">
<xs:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/simpledc20021212.xsd"/>
<xs:element name="library">
<xs:complexType>
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element ref="dc:title"/>
<xs:element ref="dc:creator"/>
<xs:element ref="dc:date" minOccurs="0"/>
<xs:element ref="dc:description" minOccurs="0"/>
<xs:element ref="dc:subject" minOccurs="0" maxOccurs="unbounded"/>
<xs:sequence>
</xs:attribute name="isbn" type="xs:string"/>
<xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema> </
So now that we have a solid understanding of the XML input, let’s move on to some Haskell code! First, the usual preamble of language pragmas and imports (explicit import lists omitted for brevity):
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}
module Read where
import Conduit
import Data.Maybe
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Parse
Next, we will define a data type to represent books in Haskell. Let’s
just agree that we are not interested in the description
XML element, shall we? Apart from that the record fields directly
correspond to the schema definition.
data Book = Book
isbn :: Maybe Text,
{ title :: Text,
author :: Text,
date :: Maybe Text,
keywords :: [Text]
}deriving (Show)
Now our XML processing task consists of three steps: reading XML from
a file, turning the XML into values of our Book
type, and
using the Book
values for fun and profit (in this case, we
simply print them to STDOUT). These steps are reflected by the three
components of our conduit pipeline, separated by the .|
operator:
main :: IO ()
= runConduitRes $ parseFile def "data.xml" .| readXml .| output
main
readXml :: (MonadThrow m) => ConduitT Event Book m ()
= _
readXml
output :: (MonadIO m) => ConduitT Book o m ()
= mapM_C (liftIO . print) output
First,
parseFile :: (MonadResource m) => ParseSettings -> FilePath -> ConduitT i Event m ()
reads the XML file content and turns it into a stream of
Event
values. Then readXml
, to be defined
later, turns the stream of Event
s into a stream of
Book
values. These are finally piped into the
output
component which prints the Book
s to
STDOUT one by one as they are received from upstream. Note how the
stream types of the three components align: parseFile
has
an output stream of type Event
, readXml
has an
input stream of type Event
and an output stream of type
Book
, and output
has an input stream of type
Book
.
To implement readXml
, we could manually process
the stream of Event
values received from the upstream
component, but that would essentially mean looking at each
Event
and keeping track of opening and closing tags,
content, etc. – a little too low-level. Luckily, the
Text.XML.Stream.Parse module provides functions that make parsing the
Event
stream easier. The most general of them is the
tag
function.
tag ::
MonadThrow m) =>
(NameMatcher a -> -- (1)
-> AttrParser b) -> -- (2)
(a -> ConduitT Event o m c) -> -- (3)
(b ConduitT Event o m (Maybe c)
The tag
function takes a description of an XML element
and returns a ConduitT
that parses this element (opening
and closing tag and all) in a stream of Event
values. Its
final result Maybe c
signals parser success or failure and
contains an arbitrary value based on the parsed XML. The element
description has three parts:
A
NameMatcher
for the tag name. Thanks to itsIsString
instance this can be as simple as"book"
for names without a namespace. For namespace sensitive matching you can either use Clark notation like"{http://purl.org/dc/elements/1.1/}title"
or construct a properNameMatcher
likematching (== Name "title" (Just "http://purl.org/dc/elements/1.1/") (Just "dc"))
. Note that the prefix part (Just "dc"
) is ignored when comparing names with==
, just as one would expect from XML namespace semantics.A function that takes the result of the
NameMatcher
(which will usually be the tag name with typeName
, unless we are doing something funny) and returns anAttrParser
for the attributes of the element. AnAttrParser
can be built from simple functions provided by the Text.XML.Stream.Parse module likeattr :: Name -> AttrParser (Maybe Text)
for an optional attribute,requireAttr :: Name -> AttrParser Text
for a required attribute, andignoreAttrs :: AttrParser ()
to skip any unparsed attributes. Parsers for single attributes are combined using theirApplicative
(orMonad
) instance:(,) <$> attr "foo" <*> attr "bar" <* ignoreAttrs
.A function that takes the result of the
AttrParser
and returns aConduitT
that parses the element content (i.e., child elements or text content) and produces the final result of the parser (which will eventually be wrapped inMaybe c
, signalling parser success or failure). When parsing the element content, text content can be dealt with using thecontent :: MonadThrow m => ConduitT Event o m Text
function, while child elements are handled by recursively building more parsers with thetag
function and combining them using theMonad
orApplicative
instances ofConduitT
. (Bear with me, example ahead!)
As you see, the result of each parsing stage (tag name, attributes,
content) is forwarded to the next and can thus be accessed when
producing the final result. If that’s not necessary it may be more
convenient to use one of the following simplified variants of the
tag
function (also from Text.XML.Stream.Parse):
-- Does not forward the NameMatcher result to the AttrParser.
tag' ::
MonadThrow m) =>
(NameMatcher a ->
AttrParser b ->
-> ConduitT Event o m c) ->
(b ConduitT Event o m (Maybe c)
-- Like tag', but requires that no attributes exist.
tagNoAttr ::
MonadThrow m) =>
(NameMatcher a ->
ConduitT Event o m c ->
ConduitT Event o m (Maybe c)
-- Like tag', but ignores any attributes.
tagIgnoreAttrs ::
MonadThrow m) =>
(NameMatcher a ->
ConduitT Event o m c ->
ConduitT Event o m (Maybe c)
Now, without further ado, here’s a parser for a single
book
element, including a parser for its isbn
attribute and a sequence of nested parsers for its child elements:
book :: (MonadThrow m) => ConduitT Event o m (Maybe Book)
= tag' "book" (attr "isbn") $ \isbn -> do
book <- fromMaybe "" <$> tagNoAttr (withName $ dc "title") content
title <- fromMaybe "" <$> tagNoAttr (withName $ dc "creator") content
author <- tagNoAttr (withName $ dc "date") content
date <- ignoreTree (withName $ dc "description") ignoreAttrs
_ <- many $ tagNoAttr (withName $ dc "subject") content
keywords pure $ Book {..}
withName :: Name -> NameMatcher Name
= matching . (==)
withName
dc :: Text -> Name
= Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc") dc n
Note how we use fromMaybe
to provide a default value (an
empty string) for the title
and creator
elements, but not for date
. That’s because
date
is an optional XML element, so we actually want a
Maybe
result in our Book
value. On the other
hand, title
and creator
are required by the
schema, so we could even use fromJust
or throw a useful
error when their parsers fail because this can only happen when the XML
input is invalid – use your judgement.
Remember how we agreed that we are not interested in the
description
element when we defined the Book
data type? Well, we have to let the parser know about that; otherwise it
will fail when it encounters this unexpected element. We use the
ignoreTree :: (MonadThrow m) => NameMatcher a -> AttrParser b -> ConduitT Event o m (Maybe ())
function to do this. It parses an element similar to tag
but without returning a result (other than Maybe ()
) or
yielding anything downstream, effectively skipping the
description
element if it’s present.
The
many :: (Monad m) => ConduitT Event o m (Maybe a) -> ConduitT Event o m [a]
function used to parse the subject
elements does just what
you probably expect: It takes a parser and applies it as long as it
succeeds (i.e., returns Just
), collecting the results in a
(possibly empty) list.
Now there is only one small part missing; we have to parse the
library
root element and apply our book
parser
to its content:
readXml :: (MonadThrow m) => ConduitT Event Book m ()
= force "failed reading XML" $ tagNoAttr "library" (manyYield book) readXml
The parser for the library
element should by now look
familiar, the only new function is
manyYield :: (Monad m) => ConduitT i o m (Maybe o) -> ConduitT i o m ()
which takes a parser and applies it as long as it succeeds, yielding its
results downstream one by one. Don’t confuse this with many
which also applies a parser as long as it succeeds but does not
stream the results! Instead, it collects the results in a list and waits
until the parser returns Nothing
. Sometimes this is just
what we want, and sometimes it unexpectedly blows up memory. (As a rule
of thumb, use many
if you are still collecting data to
construct a bigger result, like when parsing the subject
elements as part of a Book
value. Use
manyYield
when you are sending complete results like our
Book
values down the pipeline.)
The force
function makes sure that a parser actually
succeeds (i.e., its Maybe
result type is a
Just
) and throws an error otherwise.
Reading nested XML
As an aside, the fact that our example XML data is only a flat list
of books should not leave you with the impression that stream processing
of deeply, maybe recursively nested XML trees is significantly harder or
even impossible. Consider this structure of nested node
elements:
<?xml version="1.0" encoding="UTF-8"?>
node label="A">
<node label="B">
<node label="C"/>
<node label="D">
<node label="E"/>
<node label="F"/>
<node>
</node>
</node label="G">
<node label="H"/>
<node label="I">
<node label="J"/>
<node label="K"/>
<node>
</node>
</node> </
How can we write a streaming parser that finds all leaf nodes (i.e., nodes that contain no child nodes: C, E, F, H, J, K) on all levels of the structure, no matter how deeply nested?
readXml :: (MonadThrow m) => ConduitT Event Text m ()
= force "failed reading XML" node
readXml
node :: (MonadThrow m) => ConduitT Event Text m (Maybe ())
= tag' "node" (requireAttr "label") $ \label -> do
node <- node
mbNode case mbNode of
Just () -> void $ many node
Nothing -> yield label
We start with a parser for a node
tag and its
label
attribute. When parsing the content of the
node
element we first try to parse exactly one child node
by recursive application of our parser. If we succeed, the current
element has child nodes (at least one) and thus obviously is an inner
node, so we keep parsing the remaining child nodes. If we fail, the
current node has no child nodes and thus is a leaf node, so we send its
label downstream.
But let’s get back to our list of books.
Writing XML from custom data types
Now suppose we would like to render a stream of Book
values into XML. We reuse the Book
data type from the
previous section, as well as the language extensions and most of the
imports; but instead of Text.XML.Stream.Parse we now use the
Text.XML.Stream.Render module.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}
module Write where
import Conduit
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Render
data Book = Book
isbn :: Maybe Text,
{ title :: Text,
author :: Text,
date :: Maybe Text,
keywords :: [Text]
}deriving (Show)
In our conduit pipeline we provide a stream of Book
values, turn it into a stream of Event
values, and render
the XML corresponding to the Event
stream to STDOUT:
main :: IO ()
= runConduit $ mkBooks .| writeXml .| output
main
mkBooks :: (Monad m) => ConduitT i Book m ()
=
mkBooks
yieldManyBook
[ = Just "9781593272838",
{ isbn = "Learn You a Haskell for Great Good!",
title = "Miran Lipovača",
author = Nothing,
date = []
keywords
},Book
= Nothing,
{ isbn = "Pride and Prejudice",
title = "Jane Austen",
author = Just "1813",
date = ["marriage", "wealth", "class"]
keywords
}
]
writeXml :: (Monad m) => ConduitT Book Event m ()
= _
writeXml
output :: (MonadIO m, PrimMonad m) => ConduitT Event o m ()
= renderBytes def .| stdoutC output
The mkBooks
component turns a list of Book
values into a stream with the yieldMany
function. At the
other end of the pipeline, the output
component renders a
stream of Event
values into a stream of
ByteString
s and prints them to STDOUT one by one, resulting
in a full XML document. It follows from the types that the
writeXml
component in between has to turn the stream of
Book
values into a stream of Event
values.
Again, manually dealing with Event
values in the
implementation of writeXml
would be more tedious than
necessary: Just like there is a high level parsing function in
the Text.XML.Stream.Parse module that reads XML elements in a stream of
Event
values, there is also a high level rendering
function in the Text.XML.Stream.Render module that writes a stream of
Event
values representing an XML element. They are both
called tag
, which may be slightly (but only slightly)
confusing. The tag
function from Text.XML.Stream.Render has
the following type:
tag ::
Monad m) =>
(Name -> -- (1)
Attributes -> -- (2)
ConduitT i Event m () -> -- (3)
ConduitT i Event m ()
Similar to its parsing counterpart it takes a description of an XML
element and returns a ConduitT
that renders this element
(opening and closing tag and all) into a stream of Event
values. The element description has three parts:
The tag name. The
IsString
instance of theName
type is quite similar to that of theNameMatcher
type used for parsing, so"book"
,"{http://purl.org/dc/elements/1.1/}title"
, andName "title" (Just "http://purl.org/dc/elements/1.1/") (Just "dc")
are all valid choices.The attributes. Single attributes are built with the
attr :: Name -> Text -> Attributes
andoptionalAttr :: Name -> Maybe Text -> Attributes
functions (whereoptionalAttr
renders the given attribute only if its value is aJust
) and combined using theirMonoid
instance.The element content. Simple text content can be created with the
content :: (Monad m) => Text -> ConduitT i Event m ()
function, while child elements are built by nested invocations of thetag
function and combined using theMonad
orApplicative
instances ofConduitT
.
With the tag
function we are now ready to implement the
writeXml
component:
writeXml :: (Monad m) => ConduitT Book Event m ()
= tag "library" mempty $ awaitForever book
writeXml
book :: (Monad m) => Book -> ConduitT Book Event m ()
Book {..} = tag "book" (optionalAttr "isbn" isbn) $ do
book "title") mempty (content title)
tag (dc "creator") mempty (content author)
tag (dc maybe (pure ()) (tag (dc "date") mempty . content) date
.| awaitForever (tag (dc "subject") mempty . content)
yieldMany keywords
dc :: Text -> Name
= Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc") dc n
The writeXml
component renders the library
root element. This element has no attributes, so the second argument is
mempty
. For its content, we need a sequence of (XML)
book
elements based on the (Haskell) Book
values received from upstream.
A single Book
value is rendered using the
book
function which is essentially just a sequence of
nested tag
function calls that create the necessary XML
elements. But how do we map the book
function over the
input stream of Book
values, creating an output stream of
Event
values? We can’t use something like
mapC :: (Monad m) => (i -> o) -> ConduitT i o m ()
here because our book
function doesn’t have type
Book -> Event
but returns a full-blown
ConduitT
. We need a function that applies our function to
every Book
value in the input stream and combines the
resulting Event
output streams into one. As it happens,
awaitForever :: (Monad m) => (i -> ConduitT i o m r) -> ConduitT i o m ()
does exactly that.
By the way, we use the same pattern to render the list of keywords
into subject
elements, first creating a stream of
Text
values from the list and then calling
awaitForever
with a function with type
Text -> ConduitT Text Event m ()
.
Transforming XML (events) directly
So now we know how to parse XML into custom data types, and we know
how to render XML from custom data types, all in a streaming fashion.
What if we want to modify XML data so that our input and output are in
the same XML format (i.e., conform to the same schema), only with some
changes applied? For a toy example, suppose we want to append another
keyword to the list of XML subject
elements. We could of
course combine the two steps we already learned about in a parse →
transform → render approach, where the transform step could conveniently
work with our Book
data type, pushing XML processing to the
start and end of the pipeline:
main :: IO ()
=
main $
runConduitRes "data.xml"
parseFile def .| readXml
.| transform
.| writeXml
.| output
transform :: (Monad m) => ConduitT Book Book m ()
= mapC (\b -> b {keywords = keywords b <> ["..."]}) transform
Isn’t this nice and modular and all? Whenever this is a viable approach I recommend you do it that way.
But then, in real life, XML can get quite unwieldy, and some XML
schemas are huge, complex beasts. In such cases it is way too much
trouble to define and work with custom data types that are able to
preserve all information we need to re-construct the XML representation.
(By the way, did you notice we already would have lost the
description
element by now?) That’s even more annoying when
we modify only small parts of an otherwise unchanged, complex XML
document.
In my experience, in such situations we just have to bite the bullet
and work with the XML Event
stream directly. This is more
low-level and inconvenient than the parse → transform → render approach,
but it enables us to modify the XML data very precisely, and only modify
the parts we are actually interested in while leaving everything else
unchanged. (If you are familiar with XSLT this might vaguely remind you
of an identity
transform.)
We start pretty simliar to the previous examples by outlining a
conduit pipeline that reads an XML file, turns it into a stream of
Event
values, transforms those values and finally renders
them to STDOUT. Since we are now working directly with the
Event
stream the input and output streams of our
transform
component both have type Event
.
{-# LANGUAGE OverloadedStrings #-}
module Transform where
import Conduit
import Control.Monad
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Parse (parseFile)
import Text.XML.Stream.Render (content, def, renderBytes, tag)
main :: IO ()
= runConduitRes $ parseFile def "data.xml" .| transform .| output
main
transform :: (Monad m) => ConduitT Event Event m ()
= _
transform
output :: (MonadIO m, PrimMonad m) => ConduitT Event o m ()
= renderBytes def .| stdoutC output
Note that we now need both the Text.XML.Stream.Parse and Render
modules which produce some name clashes when used together (most
notably, the tag
and content
functions), so
it’s time for explicit import lists.
Before we move on to implementing transform
let’s
revisit what we are trying to achieve. In each book
element, we want to append another keyword to the list of
subject
elements. So we have to insert new
subject
elements into the stream, and we have to insert
them at the right places. From the XML schema we know that the list of
subject
elements may be empty, and we know that the
subject
elements, if present, are always the last children
of a book
element. In other words, since our new
subject
element has to be appended to the list
(i.e., it becomes its last element) it has to be inserted right before
the closing </book>
tag, regardless of whether the
list is empty or not.
So how about that: We look at the Event
stream and just
pass on its values until we encounter an Event
signalling a
closing </book>
tag. Then we stop, call the
tag
function to create a series of Event
values representing a new subject
element and send them
downstream. After that we send the Event
for the
</book>
tag downstream as well and continue looking
for more </book>
tags.
transform :: (Monad m) => ConduitT Event Event m ()
= do
transform /= EventEndElement "book")
takeWhileC (<- nullC
atEndOfStream $ do
unless atEndOfStream "subject") mempty (content "...")
tag (dc 1
takeC
transform
dc :: Text -> Name
= Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc") dc n
Since we are now dealing with the “raw” Event
stream (as
opposed to higher level parsers based on the tag
function)
the functions from the more general Conduit module come in handy. We use
takeWhileC
to skip over (and send downstream) all
Event
values that don’t denote a closing
</book>
element. That means when this function stops
consuming values we are either right before a
</book>
tag or we have reached the end of input. If
the latter is not the case (which we check with nullC
) we
first insert a new subject
element into the output stream
with the tag
function, then pass on exactly one more value
from the input stream with takeC 1
(which, thanks to the
predicate given to takeWhileC
, can’t be anything but an
EventEndElement "book"
aka a </book>
tag) and finally repeat the process by calling transform
recursively.
Conclusion
In this post we have worked through examples for reading, writing and transforming XML in a streaming fashion. I hope you agree with me that this, while not as convenient as with a DOM-based parser, was actually quite doable. In my experience it just takes some time getting used to thinking in terms of a flat stream instead of a DOM tree, particularly if one has only used DOM-based parsers before.
Granted, some use cases are more awkward than others when using a
streaming parser. For example, compiling data from a lot of different
parts of an XML document is easy with a DOM representation but tedious
with a stream representation. However, in my experience most XML
documents that are so large that they require a streaming approach
usually (at some level) consist of a very long but linear list of not
too deeply structured records of some kind. Particularly when using the
tag
functions discussed above a streaming parser is usually
very well suited for such XML documents.
If you would like to see more examples/exercises I recommend my xml-processing repository which contains examples built with both the DOM-based and the streaming parser from the xml-conduit package, as well as with other Haskell packages and in other languages.
Epilogue: runtime statistics
To justify the claim that a streaming parser is much more
memory-efficient than a DOM-based parser it’s about time to look at some
numbers. If you recall the first example above when
we read XML into our Book
data type – the following program
does the same but uses the DOM-based parser from the xml-conduit package
instead of the streaming parser:
{-# LANGUAGE OverloadedStrings #-}
module ReadDom where
import Data.Maybe
import Data.Text
import Text.XML
import Text.XML.Cursor
import Prelude hiding (readFile)
data Book = Book
isbn :: Maybe Text,
{ title :: Text,
author :: Text,
date :: Maybe Text,
keywords :: [Text]
}deriving (Show)
main :: IO ()
= do
main <- fromDocument <$> readFile def "data.xml"
xml mapM_ print $ xml $/ element "book" &| mkBook
mkBook :: Cursor -> Book
=
mkBook xml Book
= listToMaybe $ attribute "isbn" xml,
{ isbn = mconcat $ xml $/ element (dc "title") &/ content,
title = mconcat $ xml $/ element (dc "creator") &/ content,
author = listToMaybe $ xml $/ element (dc "date") &/ content,
date = xml $/ element (dc "subject") &/ content
keywords
}
dc :: Text -> Name
= Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc") dc n
Let’s inflate our XML file to contain not just two but 200,000 books (amounting to a file size of 47 MiB) and run it through both the streaming and the DOM-based version, gathering some runtime statistics along the way!
Here are the stats from my laptop running GHC 9.6.2 for the streaming version:
$ cabal run -O2 -- read +RTS -s > /dev/null
27,353,012,488 bytes allocated in the heap
17,311,640 bytes copied during GC
181,432 bytes maximum residency (111 sample(s))
33,728 bytes maximum slop
7 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6514 colls, 0 par 0.060s 0.065s 0.0000s 0.0001s
Gen 1 111 colls, 0 par 0.015s 0.015s 0.0001s 0.0003s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.853s ( 4.839s elapsed)
GC time 0.075s ( 0.080s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 4.928s ( 4.920s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 5,636,624,439 bytes per MUT second
Productivity 98.5% of total user, 98.4% of total elapsed
And here is the DOM-based version:
$ cabal run -O2 -- read-dom +RTS -s > /dev/null
27,471,655,576 bytes allocated in the heap
4,237,957,944 bytes copied during GC
571,523,864 bytes maximum residency (17 sample(s))
4,545,184 bytes maximum slop
1572 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6727 colls, 0 par 2.616s 2.623s 0.0004s 0.0063s
Gen 1 17 colls, 0 par 1.979s 1.980s 0.1165s 0.5329s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.361s ( 4.338s elapsed)
GC time 4.595s ( 4.603s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 8.957s ( 8.950s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,299,123,852 bytes per MUT second
Productivity 48.7% of total user, 48.5% of total elapsed
Probably the most interesting number is the “bytes maximum residency” which according to the GHC User’s Guide is an approximation of the “maximum space actually used by your program”. In the streaming case this is a mere 0.17 MiB while the DOM-based approach gobbles 545 MiB. Of course there might also be a memory leak hiding somewhere in my program (in either the streaming or the DOM-based version, or both) but I daresay these numbers make a point for streaming XML parsers.