XML stream processing with Haskell

2023-09-11 | Martin Hoppenheit | 26 min read

Memory-efficient processing of large XML documents requires the use of a streaming parser. This post gives an introduction to XML stream processing with the Haskell programming language, in particular to the streaming API of the xml-conduit package. It shows examples for reading, writing and transforming XML data in a conduit pipeline.

Introduction

There are two widespread types of XML parsers, DOM-based and streaming. DOM-based parsers read a whole XML document and build a tree representation of the XML structure (the DOM, or Document Object Model) in memory where each node in the tree represents a part of the document like an element or a piece of text content. This makes XML processing very flexible because we can navigate through the tree representation in every direction, upwards, downwards, sideways, collecting all the information we need from different places around the DOM or even changing parts of the tree. Combined with a concise syntax to address specific locations, like XPath, this is quite convenient.

However, maintaining the DOM in memory can be infeasible when it requires too much, well, memory. Don’t get me wrong, you can get very far with a DOM-based parser; it took me years of working with XML before I ran into serious memory issues. But when the day comes and the DOM blows up your memory, you want a streaming parser.

Streaming parsers take another approach; they don’t build an in-memory tree representation of the whole XML document but read it piece by piece, flattening its tree structure into a linear stream of so-called events where each event represents a piece of XML syntax like an opening or closing tag, a comment, or text content. (Yes, that’s like a stream of tokens if you’re more into classic parsing terms. Just like the DOM can be seen as an abstract syntax tree.) With the right streaming framework, this makes XML processing in constant memory possible, where “constant” can be a pretty small amount of memory (see below for numbers).

Compared to DOM-based parsing this is strictly more low-level; it requires a whole different approach to XML processing because we can no longer move around in the DOM tree and access any node we like. Instead we have to deal with the linear stream of events, reacting to events as they fly by. As we will see though, that’s not too bad because we can often turn the stream of events into a stream of values of a (custom) data type that’s closer to our application domain. If you have ever used a parser combinator library like Parsec, Megaparsec or Attoparsec this will feel very similar.

Prerequisites

To follow along with the code examples below you should, apart from some XML basics, know enough Haskell to feel comfortable with type classes like Monoid, Applicative, and Monad. You should also know the fundamentals and terminology of the conduit streaming framework (if you don’t, the first half of the conduit tutorial should get you up and running). In particular, you should understand the different parameters of the ConduitT i o m r type (tl;dr → type of input stream, output stream, base monad, and result).

We will use xml-conduit (version 1.9.1.3 at the time of writing) throughout the examples, but you don’t have to know anything about that package yet. We will start with a high-level overview in the rest of this section and explore the relevant details as we move along.

The xml-conduit package contains both a DOM-based and a streaming XML parser and renderer for the conduit framework. The DOM-based API is defined in the Text.XML and Text.XML.Cursor modules, the streaming API (which we will use) can be found in the Text.XML.Stream.Parse and Text.XML.Stream.Render modules. The official tutorial does a good job at explaining the DOM-based API but unfortunately it doesn’t cover the streaming interface; there is only a short code example in the API docs.

Parsers (and renderers) based on xml-conduit are centered around the Event type. It is defined as follows:

data Event
  = EventBeginDocument
  | EventEndDocument
  | EventBeginDoctype Text (Maybe ExternalID)
  | EventEndDoctype
  | EventInstruction Instruction
  | EventBeginElement Name [(Name, [Content])]
  | EventEndElement Name
  | EventContent Content
  | EventComment Text
  | EventCDATA Text

As you see, the Event type describes different elements of XML syntax, so a stream of this type contains values that look like EventBeginElement "book" [("isbn", ["..."])] or EventEndElement "book" when tags <book isbn="..."> or </book> are encountered. (OK, it will get a little more verbose if we don’t use the IsString instances for Name and Content, but you get the idea.) However, although many streams in xml-conduit use the Event type we can often get away without touching it directly, using higher level parser generator functions instead.

Now with the basics out of the way, let’s dive straight into some examples! In the next sections, we will see how to read XML into custom data types, how to write XML from custom data types, and how to transform XML by working with the Event values directly (yes, I know I just said we rarely have to do that, but it’s still useful to know how it works). The source code for the examples is also available on GitHub.

Reading XML into custom data types

Suppose we want to read the following XML data:

<?xml version="1.0" encoding="UTF-8"?>
<library xmlns:dc="http://purl.org/dc/elements/1.1/">
  <book isbn="9781593272838">
    <dc:title>Learn You a Haskell for Great Good!</dc:title>
    <dc:creator>Miran Lipovača</dc:creator>
    <dc:description>This is a book. It contains text. And pictures!</dc:description>
  </book>
  <book>
    <dc:title>Pride and Prejudice</dc:title>
    <dc:creator>Jane Austen</dc:creator>
    <dc:date>1813</dc:date>
    <dc:subject>marriage</dc:subject>
    <dc:subject>wealth</dc:subject>
    <dc:subject>class</dc:subject>
  </book>
  <!-- many more books ... -->
</library>

Nothing fancy, just a library root element with a long list of book elements inside, where each book has at least a title and an author (aka creator), plus an optional description and date and a (possibly empty) list of keywords (subject). For the sake of example, there is also an optional isbn attribute. Note that the child elements of book live in the Dublin Core namespace, while the library and book elements have no namespace.

For reference, here is the schema this XML conforms to. Technically, our Haskell programs won’t require a schema definition. I think though that the possibility of validating input data up-front is one of the nicer aspects of working with XML. Also, being Haskell programmers, we do like some precise data type definitions, right?

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <xs:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/simpledc20021212.xsd"/>
  <xs:element name="library">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="book" minOccurs="0" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element ref="dc:title"/>
              <xs:element ref="dc:creator"/>
              <xs:element ref="dc:date" minOccurs="0"/>
              <xs:element ref="dc:description" minOccurs="0"/>
              <xs:element ref="dc:subject" minOccurs="0" maxOccurs="unbounded"/>
            </xs:sequence>
            <xs:attribute name="isbn" type="xs:string"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

So now that we have a solid understanding of the XML input, let’s move on to some Haskell code! First, the usual preamble of language pragmas and imports (explicit import lists omitted for brevity):

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}

module Read where

import Conduit
import Data.Maybe
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Parse

Next, we will define a data type to represent books in Haskell. Let’s just agree that we are not interested in the description XML element, shall we? Apart from that the record fields directly correspond to the schema definition.

data Book = Book
  { isbn :: Maybe Text,
    title :: Text,
    author :: Text,
    date :: Maybe Text,
    keywords :: [Text]
  }
  deriving (Show)

Now our XML processing task consists of three steps: reading XML from a file, turning the XML into values of our Book type, and using the Book values for fun and profit (in this case, we simply print them to STDOUT). These steps are reflected by the three components of our conduit pipeline, separated by the .| operator:

main :: IO ()
main = runConduitRes $ parseFile def "data.xml" .| readXml .| output

readXml :: (MonadThrow m) => ConduitT Event Book m ()
readXml = _

output :: (MonadIO m) => ConduitT Book o m ()
output = mapM_C (liftIO . print)

First, parseFile :: (MonadResource m) => ParseSettings -> FilePath -> ConduitT i Event m () reads the XML file content and turns it into a stream of Event values. Then readXml, to be defined later, turns the stream of Events into a stream of Book values. These are finally piped into the output component which prints the Books to STDOUT one by one as they are received from upstream. Note how the stream types of the three components align: parseFile has an output stream of type Event, readXml has an input stream of type Event and an output stream of type Book, and output has an input stream of type Book.

To implement readXml, we could manually process the stream of Event values received from the upstream component, but that would essentially mean looking at each Event and keeping track of opening and closing tags, content, etc. – a little too low-level. Luckily, the Text.XML.Stream.Parse module provides functions that make parsing the Event stream easier. The most general of them is the tag function.

tag ::
  (MonadThrow m) =>
  NameMatcher a ->                -- (1)
  (a -> AttrParser b) ->          -- (2)
  (b -> ConduitT Event o m c) ->  -- (3)
  ConduitT Event o m (Maybe c)

The tag function takes a description of an XML element and returns a ConduitT that parses this element (opening and closing tag and all) in a stream of Event values. Its final result Maybe c signals parser success or failure and contains an arbitrary value based on the parsed XML. The element description has three parts:

  1. A NameMatcher for the tag name. Thanks to its IsString instance this can be as simple as "book" for names without a namespace. For namespace sensitive matching you can either use Clark notation like "{http://purl.org/dc/elements/1.1/}title" or construct a proper NameMatcher like matching (== Name "title" (Just "http://purl.org/dc/elements/1.1/") (Just "dc")). Note that the prefix part (Just "dc") is ignored when comparing names with ==, just as one would expect from XML namespace semantics.

  2. A function that takes the result of the NameMatcher (which will usually be the tag name with type Name, unless we are doing something funny) and returns an AttrParser for the attributes of the element. An AttrParser can be built from simple functions provided by the Text.XML.Stream.Parse module like attr :: Name -> AttrParser (Maybe Text) for an optional attribute, requireAttr :: Name -> AttrParser Text for a required attribute, and ignoreAttrs :: AttrParser () to skip any unparsed attributes. Parsers for single attributes are combined using their Applicative (or Monad) instance: (,) <$> attr "foo" <*> attr "bar" <* ignoreAttrs.

  3. A function that takes the result of the AttrParser and returns a ConduitT that parses the element content (i.e., child elements or text content) and produces the final result of the parser (which will eventually be wrapped in Maybe c, signalling parser success or failure). When parsing the element content, text content can be dealt with using the content :: MonadThrow m => ConduitT Event o m Text function, while child elements are handled by recursively building more parsers with the tag function and combining them using the Monad or Applicative instances of ConduitT. (Bear with me, example ahead!)

As you see, the result of each parsing stage (tag name, attributes, content) is forwarded to the next and can thus be accessed when producing the final result. If that’s not necessary it may be more convenient to use one of the following simplified variants of the tag function (also from Text.XML.Stream.Parse):

-- Does not forward the NameMatcher result to the AttrParser.
tag' ::
  (MonadThrow m) =>
  NameMatcher a ->
  AttrParser b ->
  (b -> ConduitT Event o m c) ->
  ConduitT Event o m (Maybe c)

-- Like tag', but requires that no attributes exist.
tagNoAttr ::
  (MonadThrow m) =>
  NameMatcher a ->
  ConduitT Event o m c ->
  ConduitT Event o m (Maybe c)

-- Like tag', but ignores any attributes.
tagIgnoreAttrs ::
  (MonadThrow m) =>
  NameMatcher a ->
  ConduitT Event o m c ->
  ConduitT Event o m (Maybe c)

Now, without further ado, here’s a parser for a single book element, including a parser for its isbn attribute and a sequence of nested parsers for its child elements:

book :: (MonadThrow m) => ConduitT Event o m (Maybe Book)
book = tag' "book" (attr "isbn") $ \isbn -> do
  title <- fromMaybe "" <$> tagNoAttr (withName $ dc "title") content
  author <- fromMaybe "" <$> tagNoAttr (withName $ dc "creator") content
  date <- tagNoAttr (withName $ dc "date") content
  _ <- ignoreTree (withName $ dc "description") ignoreAttrs
  keywords <- many $ tagNoAttr (withName $ dc "subject") content
  pure $ Book {..}

withName :: Name -> NameMatcher Name
withName = matching . (==)

dc :: Text -> Name
dc n = Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc")

Note how we use fromMaybe to provide a default value (an empty string) for the title and creator elements, but not for date. That’s because date is an optional XML element, so we actually want a Maybe result in our Book value. On the other hand, title and creator are required by the schema, so we could even use fromJust or throw a useful error when their parsers fail because this can only happen when the XML input is invalid – use your judgement.

Remember how we agreed that we are not interested in the description element when we defined the Book data type? Well, we have to let the parser know about that; otherwise it will fail when it encounters this unexpected element. We use the ignoreTree :: (MonadThrow m) => NameMatcher a -> AttrParser b -> ConduitT Event o m (Maybe ()) function to do this. It parses an element similar to tag but without returning a result (other than Maybe ()) or yielding anything downstream, effectively skipping the description element if it’s present.

The many :: (Monad m) => ConduitT Event o m (Maybe a) -> ConduitT Event o m [a] function used to parse the subject elements does just what you probably expect: It takes a parser and applies it as long as it succeeds (i.e., returns Just), collecting the results in a (possibly empty) list.

Now there is only one small part missing; we have to parse the library root element and apply our book parser to its content:

readXml :: (MonadThrow m) => ConduitT Event Book m ()
readXml = force "failed reading XML" $ tagNoAttr "library" (manyYield book)

The parser for the library element should by now look familiar, the only new function is manyYield :: (Monad m) => ConduitT i o m (Maybe o) -> ConduitT i o m () which takes a parser and applies it as long as it succeeds, yielding its results downstream one by one. Don’t confuse this with many which also applies a parser as long as it succeeds but does not stream the results! Instead, it collects the results in a list and waits until the parser returns Nothing. Sometimes this is just what we want, and sometimes it unexpectedly blows up memory. (As a rule of thumb, use many if you are still collecting data to construct a bigger result, like when parsing the subject elements as part of a Book value. Use manyYield when you are sending complete results like our Book values down the pipeline.)

The force function makes sure that a parser actually succeeds (i.e., its Maybe result type is a Just) and throws an error otherwise.

Reading nested XML

As an aside, the fact that our example XML data is only a flat list of books should not leave you with the impression that stream processing of deeply, maybe recursively nested XML trees is significantly harder or even impossible. Consider this structure of nested node elements:

<?xml version="1.0" encoding="UTF-8"?>
<node label="A">
  <node label="B">
    <node label="C"/>
    <node label="D">
      <node label="E"/>
      <node label="F"/>
    </node>
  </node>
  <node label="G">
    <node label="H"/>
    <node label="I">
      <node label="J"/>
      <node label="K"/>
    </node>
  </node>
</node>

How can we write a streaming parser that finds all leaf nodes (i.e., nodes that contain no child nodes: C, E, F, H, J, K) on all levels of the structure, no matter how deeply nested?

readXml :: (MonadThrow m) => ConduitT Event Text m ()
readXml = force "failed reading XML" node

node :: (MonadThrow m) => ConduitT Event Text m (Maybe ())
node = tag' "node" (requireAttr "label") $ \label -> do
  mbNode <- node
  case mbNode of
    Just () -> void $ many node
    Nothing -> yield label

We start with a parser for a node tag and its label attribute. When parsing the content of the node element we first try to parse exactly one child node by recursive application of our parser. If we succeed, the current element has child nodes (at least one) and thus obviously is an inner node, so we keep parsing the remaining child nodes. If we fail, the current node has no child nodes and thus is a leaf node, so we send its label downstream.

But let’s get back to our list of books.

Writing XML from custom data types

Now suppose we would like to render a stream of Book values into XML. We reuse the Book data type from the previous section, as well as the language extensions and most of the imports; but instead of Text.XML.Stream.Parse we now use the Text.XML.Stream.Render module.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}

module Write where

import Conduit
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Render

data Book = Book
  { isbn :: Maybe Text,
    title :: Text,
    author :: Text,
    date :: Maybe Text,
    keywords :: [Text]
  }
  deriving (Show)

In our conduit pipeline we provide a stream of Book values, turn it into a stream of Event values, and render the XML corresponding to the Event stream to STDOUT:

main :: IO ()
main = runConduit $ mkBooks .| writeXml .| output

mkBooks :: (Monad m) => ConduitT i Book m ()
mkBooks =
  yieldMany
    [ Book
        { isbn = Just "9781593272838",
          title = "Learn You a Haskell for Great Good!",
          author = "Miran Lipovača",
          date = Nothing,
          keywords = []
        },
      Book
        { isbn = Nothing,
          title = "Pride and Prejudice",
          author = "Jane Austen",
          date = Just "1813",
          keywords = ["marriage", "wealth", "class"]
        }
    ]

writeXml :: (Monad m) => ConduitT Book Event m ()
writeXml = _

output :: (MonadIO m, PrimMonad m) => ConduitT Event o m ()
output = renderBytes def .| stdoutC

The mkBooks component turns a list of Book values into a stream with the yieldMany function. At the other end of the pipeline, the output component renders a stream of Event values into a stream of ByteStrings and prints them to STDOUT one by one, resulting in a full XML document. It follows from the types that the writeXml component in between has to turn the stream of Book values into a stream of Event values.

Again, manually dealing with Event values in the implementation of writeXml would be more tedious than necessary: Just like there is a high level parsing function in the Text.XML.Stream.Parse module that reads XML elements in a stream of Event values, there is also a high level rendering function in the Text.XML.Stream.Render module that writes a stream of Event values representing an XML element. They are both called tag, which may be slightly (but only slightly) confusing. The tag function from Text.XML.Stream.Render has the following type:

tag ::
  (Monad m) =>
  Name ->                   -- (1)
  Attributes ->             -- (2)
  ConduitT i Event m () ->  -- (3)
  ConduitT i Event m ()

Similar to its parsing counterpart it takes a description of an XML element and returns a ConduitT that renders this element (opening and closing tag and all) into a stream of Event values. The element description has three parts:

  1. The tag name. The IsString instance of the Name type is quite similar to that of the NameMatcher type used for parsing, so "book", "{http://purl.org/dc/elements/1.1/}title", and Name "title" (Just "http://purl.org/dc/elements/1.1/") (Just "dc") are all valid choices.

  2. The attributes. Single attributes are built with the attr :: Name -> Text -> Attributes and optionalAttr :: Name -> Maybe Text -> Attributes functions (where optionalAttr renders the given attribute only if its value is a Just) and combined using their Monoid instance.

  3. The element content. Simple text content can be created with the content :: (Monad m) => Text -> ConduitT i Event m () function, while child elements are built by nested invocations of the tag function and combined using the Monad or Applicative instances of ConduitT.

With the tag function we are now ready to implement the writeXml component:

writeXml :: (Monad m) => ConduitT Book Event m ()
writeXml = tag "library" mempty $ awaitForever book

book :: (Monad m) => Book -> ConduitT Book Event m ()
book Book {..} = tag "book" (optionalAttr "isbn" isbn) $ do
  tag (dc "title") mempty (content title)
  tag (dc "creator") mempty (content author)
  maybe (pure ()) (tag (dc "date") mempty . content) date
  yieldMany keywords .| awaitForever (tag (dc "subject") mempty . content)

dc :: Text -> Name
dc n = Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc")

The writeXml component renders the library root element. This element has no attributes, so the second argument is mempty. For its content, we need a sequence of (XML) book elements based on the (Haskell) Book values received from upstream.

A single Book value is rendered using the book function which is essentially just a sequence of nested tag function calls that create the necessary XML elements. But how do we map the book function over the input stream of Book values, creating an output stream of Event values? We can’t use something like mapC :: (Monad m) => (i -> o) -> ConduitT i o m () here because our book function doesn’t have type Book -> Event but returns a full-blown ConduitT. We need a function that applies our function to every Book value in the input stream and combines the resulting Event output streams into one. As it happens, awaitForever :: (Monad m) => (i -> ConduitT i o m r) -> ConduitT i o m () does exactly that.

By the way, we use the same pattern to render the list of keywords into subject elements, first creating a stream of Text values from the list and then calling awaitForever with a function with type Text -> ConduitT Text Event m ().

Transforming XML (events) directly

So now we know how to parse XML into custom data types, and we know how to render XML from custom data types, all in a streaming fashion. What if we want to modify XML data so that our input and output are in the same XML format (i.e., conform to the same schema), only with some changes applied? For a toy example, suppose we want to append another keyword to the list of XML subject elements. We could of course combine the two steps we already learned about in a parse → transform → render approach, where the transform step could conveniently work with our Book data type, pushing XML processing to the start and end of the pipeline:

main :: IO ()
main =
  runConduitRes $
    parseFile def "data.xml"
      .| readXml
      .| transform
      .| writeXml
      .| output

transform :: (Monad m) => ConduitT Book Book m ()
transform = mapC (\b -> b {keywords = keywords b <> ["..."]})

Isn’t this nice and modular and all? Whenever this is a viable approach I recommend you do it that way.

But then, in real life, XML can get quite unwieldy, and some XML schemas are huge, complex beasts. In such cases it is way too much trouble to define and work with custom data types that are able to preserve all information we need to re-construct the XML representation. (By the way, did you notice we already would have lost the description element by now?) That’s even more annoying when we modify only small parts of an otherwise unchanged, complex XML document.

In my experience, in such situations we just have to bite the bullet and work with the XML Event stream directly. This is more low-level and inconvenient than the parse → transform → render approach, but it enables us to modify the XML data very precisely, and only modify the parts we are actually interested in while leaving everything else unchanged. (If you are familiar with XSLT this might vaguely remind you of an identity transform.)

We start pretty simliar to the previous examples by outlining a conduit pipeline that reads an XML file, turns it into a stream of Event values, transforms those values and finally renders them to STDOUT. Since we are now working directly with the Event stream the input and output streams of our transform component both have type Event.

{-# LANGUAGE OverloadedStrings #-}

module Transform where

import Conduit
import Control.Monad
import Data.Text
import Data.XML.Types
import Text.XML.Stream.Parse (parseFile)
import Text.XML.Stream.Render (content, def, renderBytes, tag)

main :: IO ()
main = runConduitRes $ parseFile def "data.xml" .| transform .| output

transform :: (Monad m) => ConduitT Event Event m ()
transform = _

output :: (MonadIO m, PrimMonad m) => ConduitT Event o m ()
output = renderBytes def .| stdoutC

Note that we now need both the Text.XML.Stream.Parse and Render modules which produce some name clashes when used together (most notably, the tag and content functions), so it’s time for explicit import lists.

Before we move on to implementing transform let’s revisit what we are trying to achieve. In each book element, we want to append another keyword to the list of subject elements. So we have to insert new subject elements into the stream, and we have to insert them at the right places. From the XML schema we know that the list of subject elements may be empty, and we know that the subject elements, if present, are always the last children of a book element. In other words, since our new subject element has to be appended to the list (i.e., it becomes its last element) it has to be inserted right before the closing </book> tag, regardless of whether the list is empty or not.

So how about that: We look at the Event stream and just pass on its values until we encounter an Event signalling a closing </book> tag. Then we stop, call the tag function to create a series of Event values representing a new subject element and send them downstream. After that we send the Event for the </book> tag downstream as well and continue looking for more </book> tags.

transform :: (Monad m) => ConduitT Event Event m ()
transform = do
  takeWhileC (/= EventEndElement "book")
  atEndOfStream <- nullC
  unless atEndOfStream $ do
    tag (dc "subject") mempty (content "...")
    takeC 1
    transform

dc :: Text -> Name
dc n = Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc")

Since we are now dealing with the “raw” Event stream (as opposed to higher level parsers based on the tag function) the functions from the more general Conduit module come in handy. We use takeWhileC to skip over (and send downstream) all Event values that don’t denote a closing </book> element. That means when this function stops consuming values we are either right before a </book> tag or we have reached the end of input. If the latter is not the case (which we check with nullC) we first insert a new subject element into the output stream with the tag function, then pass on exactly one more value from the input stream with takeC 1 (which, thanks to the predicate given to takeWhileC, can’t be anything but an EventEndElement "book" aka a </book> tag) and finally repeat the process by calling transform recursively.

Conclusion

In this post we have worked through examples for reading, writing and transforming XML in a streaming fashion. I hope you agree with me that this, while not as convenient as with a DOM-based parser, was actually quite doable. In my experience it just takes some time getting used to thinking in terms of a flat stream instead of a DOM tree, particularly if one has only used DOM-based parsers before.

Granted, some use cases are more awkward than others when using a streaming parser. For example, compiling data from a lot of different parts of an XML document is easy with a DOM representation but tedious with a stream representation. However, in my experience most XML documents that are so large that they require a streaming approach usually (at some level) consist of a very long but linear list of not too deeply structured records of some kind. Particularly when using the tag functions discussed above a streaming parser is usually very well suited for such XML documents.

If you would like to see more examples/exercises I recommend my xml-processing repository which contains examples built with both the DOM-based and the streaming parser from the xml-conduit package, as well as with other Haskell packages and in other languages.

Epilogue: runtime statistics

To justify the claim that a streaming parser is much more memory-efficient than a DOM-based parser it’s about time to look at some numbers. If you recall the first example above when we read XML into our Book data type – the following program does the same but uses the DOM-based parser from the xml-conduit package instead of the streaming parser:

{-# LANGUAGE OverloadedStrings #-}

module ReadDom where

import Data.Maybe
import Data.Text
import Text.XML
import Text.XML.Cursor
import Prelude hiding (readFile)

data Book = Book
  { isbn :: Maybe Text,
    title :: Text,
    author :: Text,
    date :: Maybe Text,
    keywords :: [Text]
  }
  deriving (Show)

main :: IO ()
main = do
  xml <- fromDocument <$> readFile def "data.xml"
  mapM_ print $ xml $/ element "book" &| mkBook

mkBook :: Cursor -> Book
mkBook xml =
  Book
    { isbn = listToMaybe $ attribute "isbn" xml,
      title = mconcat $ xml $/ element (dc "title") &/ content,
      author = mconcat $ xml $/ element (dc "creator") &/ content,
      date = listToMaybe $ xml $/ element (dc "date") &/ content,
      keywords = xml $/ element (dc "subject") &/ content
    }

dc :: Text -> Name
dc n = Name n (Just "http://purl.org/dc/elements/1.1/") (Just "dc")

Let’s inflate our XML file to contain not just two but 200,000 books (amounting to a file size of 47 MiB) and run it through both the streaming and the DOM-based version, gathering some runtime statistics along the way!

Here are the stats from my laptop running GHC 9.6.2 for the streaming version:

$ cabal run -O2 -- read +RTS -s > /dev/null

  27,353,012,488 bytes allocated in the heap
      17,311,640 bytes copied during GC
         181,432 bytes maximum residency (111 sample(s))
          33,728 bytes maximum slop
               7 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6514 colls,     0 par    0.060s   0.065s     0.0000s    0.0001s
  Gen  1       111 colls,     0 par    0.015s   0.015s     0.0001s    0.0003s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.853s  (  4.839s elapsed)
  GC      time    0.075s  (  0.080s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    4.928s  (  4.920s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    5,636,624,439 bytes per MUT second

  Productivity  98.5% of total user, 98.4% of total elapsed

And here is the DOM-based version:

$ cabal run -O2 -- read-dom +RTS -s > /dev/null

  27,471,655,576 bytes allocated in the heap
   4,237,957,944 bytes copied during GC
     571,523,864 bytes maximum residency (17 sample(s))
       4,545,184 bytes maximum slop
            1572 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6727 colls,     0 par    2.616s   2.623s     0.0004s    0.0063s
  Gen  1        17 colls,     0 par    1.979s   1.980s     0.1165s    0.5329s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.361s  (  4.338s elapsed)
  GC      time    4.595s  (  4.603s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time    8.957s  (  8.950s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    6,299,123,852 bytes per MUT second

  Productivity  48.7% of total user, 48.5% of total elapsed

Probably the most interesting number is the “bytes maximum residency” which according to the GHC User’s Guide is an approximation of the “maximum space actually used by your program”. In the streaming case this is a mere 0.17 MiB while the DOM-based approach gobbles 545 MiB. Of course there might also be a memory leak hiding somewhere in my program (in either the streaming or the DOM-based version, or both) but I daresay these numbers make a point for streaming XML parsers.