If You Could Write HTML, You Could Make Ebooks

Shuqi Khor
6 min readAug 24, 2023

--

Photo by Spencer on Unsplash

You might have heard of the EPUB format. It is the standard e-book format for most of the e-book platforms (excluding Kindle and iBook).

Believe it or not? It is just a bunch of HTMLs zipped together 😂

See the .xhtml files inside EPUB? They’re essentially just HTML

Unzipping an EPUB

This may sound dumb, but try it for yourself:

  1. First, download a DRM-free EPUB, like Metamorphosis by Kafka.
  2. Change the file extension to .zip. (Eg. rename to pg5200-images-3.zip)
  3. Unzip it.

And viola! You just got yourself the source files of an EPUB!

Tools You Need to Edit an EPUB

I’ll briefly explain the EPUB file structure later. For now, let’s look at these tools that I use every day:

1. Visual Studio Code / Sublime Text / Any IDE

Since that you are most likely an experienced developer, just use your favorite IDE. As usual, you could create a GIT (optional), open the root folder in a window and save it as a workspace/project before you start.

My usual work involves cleaning up messy EPUB exports from Adobe InDesign, thus even though I mostly code in VSCode, I would use Sublime Text instead for EPUB editing, as its UI is much easier to apply regex find and replace.

2. Calibre

Calibre is a popular free software to manage and edit e-books. It has a built-in IDE designed for EPUB editing.

But what I like is not its IDE, though, as I would prefer my usual one.

I like that Calibre could open a folder for editing directly, unlike Sigil which could only open zipped EPUB files.

You could open a folder just like VSCode or Sublime Text

The coolest function it provides is the ability to subset embedded fonts. Subsetting means that it will remove unused characters from the font files based on your content to cut down the EPUB file size.

This is especially important for East Asian fonts (Chinese / Japanese) as they are pretty huge.

Subsetting fonts could effectively cut down the EPUB file size by a lot

That’s why I always use Calibre as my final touch after I completed all the contents.

3. eCanCrusher

eCanCrusher is just a zipping tool for EPUB. Why use it instead of 7zip, though? That’s because EPUB has a special requirement when zipping, where the mimetype file should not be compressed.

If you’re familiar with terminal, you could do this instead (from stackoverflow):

cd "folder of epub content"

# add mimetype 1st without compression
zip -0 -X ../file.epub mimetype

# add the rest
zip -9 -X -r -u ../file.epub *

…but eCanCrusher made it super easy as you just need to drag and drop. My only complain is that its icon isn’t very visually appealing. Luckily you could easily replace an app icon in Mac.

4. Pagina EPUB-Checker

When you upload your EPUB onto Google Play Books, it would run your file through a validator before your book could go live. The validator is an open source program developed by W3C.

Pagina EPUB-Checker is a GUI tool that features the same validator engine. All you need to do is drag your root folder or the compiled EPUB into it, and it will tell you all the errors:

No error!

5. Any EPUB Reader

You need at least an EPUB reader to view your compiled EPUB file. I have several just for sure, such as Calibre / Adobe Digital Editions / Thorium Reader, because each of them behaves very differently.

In fact, it’s quite a challenge to have your EPUB displayed reliably across readers especially when you have special CSS layout in your page (just like email clients, ughh), so it’s important to keep your HTML/CSS as simple as possible.

EPUB 3 File Structure

Before I start, you could just download any public domain EPUB, whether from Project Gutenberg or Github, and use them as your template.

The latest version of EPUB format is EPUB 3.3. Version 3 was planned to support Javascript, but no reader would support it due to security concerns.

Below are the must-have files that you’d find in any EPUB file. Other than these, you’re free to organise other files in your own way.

mimetype

First, you need a file called mimetype in the root folder, without any file extension. It should only contain this exact string:

application/epub+zip

META-INF/container.xml

Next, you need to have a container.xml in a folder named META-INF. It would point to the location of a .opf file (an XML file to define the book details) relative to the root folder, like this:

<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfiles>
<rootfile full-path="content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>

content.opf

You could name this file anything you want. It should be an XML file that lists all the metadata, files (manifest) and page order (spine) of your e-book.

Here’s an example:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" version="3.0" unique-identifier="BookId" prefix="calibre: https://calibre-ebook.com">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title id="BookTitle">一万个你也比不上这个你</dc:title>
<dc:creator id="BookAuthor">许书芹</dc:creator>
<dc:identifier>isbn:9789672466017</dc:identifier>
<dc:language>zh</dc:language>
<dc:date>2019-06-18T16:00:00+00:00</dc:date>
<dc:publisher>Odonata Publishing Sdn Bhd</dc:publisher>
<meta name="cover" content="cover.jpg"/>
<meta property="belongs-to-collection" id="BookSeries">恋习</meta>
<meta refines="#BookSeries" property="collection-type">series</meta>
<meta refines="#BookSeries" property="group-position">3</meta>
</metadata>
<manifest>
<item id="cover.xhtml" href="Text/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="titlepage" href="Text/titlepage.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter1" href="Text/chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter2" href="Text/chapter2.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter3" href="Text/chapter3.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter4" href="Text/chapter4.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter5" href="Text/chapter5.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter6" href="Text/chapter6.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter7" href="Text/chapter7.xhtml" media-type="application/xhtml+xml"/>
<item id="copyright" href="Text/copyright.xhtml" media-type="application/xhtml+xml"/>
<item id="nav.xhtml" href="Text/nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
<item id="style.css" href="Styles/style.css" media-type="text/css"/>
<item id="cover.jpg" href="Images/cover.jpg" media-type="image/jpeg" properties="cover-image"/>
</manifest>
<spine>
<itemref idref="cover.xhtml" linear="no"/>
<itemref idref="titlepage"/>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
<itemref idref="chapter3"/>
<itemref idref="chapter4"/>
<itemref idref="chapter5"/>
<itemref idref="chapter6"/>
<itemref idref="chapter7"/>
<itemref idref="copyright"/>
</spine>
</package>

toc.ncx

This is a file required in EPUB 2, an older version of the EPUB format. If you find this file in EPUB 3, it’s there for backward compatibility purposes.

nav.xhtml or toc.xhtml

If you see this file, it must be EPUB version 3. The file name of nav.xhtml could be anything, as long as it is declared in the manifest section of content.opf:

<item id="nav.xhtml" href="Text/nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
<!-- The item with properties="nav" will be used as the Table of Content -->

This file must contain a <nav> tag. And inside the <nav> there must be an <ol> tag.

Difference Between EPUB 2 and 3

This is not very important. TL;DR Just ignore EPUB 2 and create EPUB 3 whenever you can.

In EPUB 3, it is possible to have a fixed layout just like PDF, but I won’t go into details as it’s not very useful except for comics. Some platforms like Google Play accepts PDF directly for fixed layout.

Technical wise, the most noticeable difference is, as mentioned above, EPUB 2 uses toc.ncx file for navigation, while EPUB 3 uses any XHTML file that has properties=“nav” attribute when declared in content.opf.

What about EPUB 1?

TL;DR you don’t need to know about it because it almost never existed.

Difference Between XHTML and HTML

doctype

XHTML in EPUB should use this exact format (notice the doctype and attributes in <html> tag):

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
<!-- headers here -->
</head>
<body>
<!-- content here -->
</body>
</html>

Single Tags

In XHTML you need to close every single tags with a slash at the end of the tag:

<meta name="viewport" content="width=device-width" />
<br />
<hr />

HTML Entities

Most of the HTML entities are unusable in EPUB. For example, the &copy; entity for copyright symbol:

Error displayed when the XHTML file is opened in web browser

Instead, you could specify its character code, such as &#169; for the copyright symbol.

That’s All, Folks

If you’re writing your own e-book, or accepted a job to convert some PDF to EPUB, I hope this simple introduction helps you to kick-start your journey.

--

--