Configuring Cheerio
In this guide, we'll cover how to configure Cheerio to work with different types of documents, and how to use and configure the different parsers that ship with the library.
Parsing HTML with parse5
By default, Cheerio uses the parse5 parser for HTML
documents. parse5 is an excellent project that rigorously conforms to the HTML
standard. However, if you need to modify parsing options for HTML input, you may
pass an extra object to .load():
const cheerio = require('cheerio');
const $ = cheerio.load('<noscript><h1>Nested Tag!</h1></noscript>', {
scriptingEnabled: false,
});
For example, if you want the contents of <noscript> tags to be parsed as HTML,
you can set the scriptingEnabled option to false.
For a full list of options and their effects, have a look at the API documentation.
Fragment Mode
By default, parse5 treats documents it receives as full HTML documents and
will structure content in an <html> document element with nested <head> and
<body> tags.
const $ = cheerio.load('<li>Apple</li><li>Banana</li>');
$.html(); // => '<html><head></head><body><li>Apple</li><li>Banana</li></body></html>'
parse5 also supports a "fragment mode" that allows you to parse HTML
fragments, rather than complete documents. To use this mode, pass a boolean
indicating whether you are parsing a full document to the .load() method:
// Note that we are passing `false`, as we are not parsing a full document.
const $ = cheerio.load('<li>Apple</li><li>Banana</li>', {}, false);
$.html(); // => '<li>Apple</li><li>Banana</li>'
This will parse the HTML fragment as a standalone document, rather than treating it as a part of a larger document.
Parsing XML with htmlparser2
By default, Cheerio uses htmlparser2 for XML documents. htmlparser2 is a
fast and memory-efficient parser that can handle both HTML and XML. To parse
XML, pass the xml option to .load():
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: true,
});
If you need to customize the parsing options for XML input, you may pass an
object as the xml option to .load(), with the options you want to change:
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
withStartIndices: true,
},
});
When xml is set, the default options are:
{
xmlMode: true, // Enable htmlparser2's XML mode.
decodeEntities: true, // Decode HTML entities.
withStartIndices: false, // Add a `startIndex` property to nodes.
withEndIndices: false, // Add an `endIndex` property to nodes.
}
The options in the xml object are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well.
For a full list of options and their effects, see the API documentation.
Using htmlparser2 for HTML
Some users may wish to parse markup with the htmlparser2 library, and traverse
and manipulate the resulting structure with Cheerio. This may be the case for
those upgrading from pre-1.0 releases of Cheerio (which relied on
htmlparser2), for those dealing with invalid markup (because htmlparser2 is
more forgiving1), or for those operating in performance-critical situations
(because htmlparser2 is often faster and the resulting DOM consumes less
memory).
To support these cases, you can simply disable xmlMode inside of the xml
option:
const $ = cheerio.load('<ul id="fruits">...</ul>', {
xml: {
// Disable `xmlMode` to parse HTML with htmlparser2.
xmlMode: false,
},
});
.load() also accepts a htmlparser2-compatible data structure as its first
argument. Users may install htmlparser2, use it to parse input, and pass the
result to .load():
import * as htmlparser2 from 'htmlparser2';
const dom = htmlparser2.parseDocument(document, options);
const $ = cheerio.load(dom);
The caveat of this method is that this will still use parse5's serializer, so
the resulting output will be HTML, not XML, and not respect any of the supplied
options. Disabling xmlMode, as shown above, is therefore the recommended
approach.
You can also use Cheerio's slim export, which always uses htmlparser2. This
avoids loading parse5, which saves some bytes eg. in browser environments:
import * as cheerio from 'cheerio/slim';
Conclusion
In this guide, we explored how to configure Cheerio for parsing HTML and XML
documents using parse5 and htmlparser2 respectively. We also discussed how
to modify parsing options and use htmlparser2 directly.
Footnotes
-
Note that "more forgiving" means
htmlparser2has error-correcting mechanisms that aren't always a match for the standards observed by web browsers. This behavior may be useful when parsing non-HTML content. ↩