Pages

SAX parsing of character content

We override DocumentHandler::characters() to let the Xerces SAX parser take some action with the character content of our XML document.

The basic idea is the same that we have seen in the previous posts, but we have to pay attention to the fact that there is no guarantee a single call to character() completes the management for an element. This imply we have to add some logic to our handler class to implement it correctly.

As an example we use an XML document like this:

<?xml version="1.0" encoding="UTF-8"?>
<train>
<car type="Engine">
<color>Black</color>
<!-- ... more stuff here -->
</car>
<car type="Baggage">
<color>Green</color>
<weight>80 tons</weight>
<!-- ... more stuff here -->
</car>

<!-- ... more stuff here -->

<car type="Caboose">
<color>Red</color>
<!-- ... more stuff here -->
</car>
</train>

As a result we would like to have this output to the standard console:

Engine has color Black
Baggage has color Green
...
Caboose has color Red

To get this, we rewrite our SimpleHandler, adding a few private data member, to keep track of the current element and its character content, rewriting the startElement() method, and adding two new methods, characters() and endElement().

Here are the changes:

// ...

namespace
{
const XMLCh* const ELEM_CAR = L"car";
const XMLCh* const ATTR_TYPE = L"type";
const XMLCh* const ELEM_COLOR = L"color";
}

class SimpleHandler : public HandlerBase
{
// ...

private:
bool isColor; // 1.
std::wstring carType; // 2.
std::wstring carColor; // 3.

public:
SimpleHandler() : isColor(false) {} // 4.

/**
* override HandlerBase::startElement(name, attrs)
*/
void startElement(const XMLCh* const name, AttributeList& attrs)
{
if(wcscmp(name, ELEM_CAR) == 0) // 5.
{
const XMLCh* const type = attrs.getValue(ATTR_TYPE);
if(type != 0)
carType = type;
}
else if(wcscmp(name, ELEM_COLOR) == 0) // 6.
{
isColor = true;
}
}

/**
* override HandlerBase::characters(buffer, size)
*/
void characters(const XMLCh* const buffer, const XMLSize_t size)
{
if(isColor) // 7.
carColor += buffer;
}

/**
* override HandlerBase::endElement(name)
*/
void endElement(const XMLCh* const name)
{
if(isColor) // 8.
{
std::wcout << carType.c_str() << " has color " << carColor.c_str() << std::endl;
isColor = false;
carColor.clear();
}
}

1. isColor is used to signal when the parser is working with a "color" element.
2. carType is the wide character string where we locally store the value for the "type" attribute for the current "car" element.
3. carColor is the wide character string for the current "color" character content.
4. Until when explicitly required, we assume no "color".
5. If the starting element is a "car" we try to get the value of its "type" attribute. If we succeed, we store it in carType (2).
6. Otherwise, we check if the starting element is a "color".
7. We are in a "color" element, append the current chunck of character to the carColor wide string.
8. SAX parser is evaluating the end tag a "color" element: we output the generated string containing its character content, and then clean the color local state.

More details on Xerces (but Java implementation) and SAX on chapter 12 of Beginning XML by David Hunter et al. (Wrox).

No comments:

Post a Comment