How to Use Python in Ebook Production with Orca Book Publishers’ Bruce Keith

Learn how Python scripts make ebook production easier for Orca Book Publishers.

  • Subject(s):

    Blog

  • Resource Type(s):

    Tools

  • Audience:

    Technical

A banner graphic with the title How to Use Python in Ebook Production with Orca Book Publishers' Bruce Keith. A logo of Orca Book Publishers is on the left side of the image, on the right is clip art of a man in glasses looking at a laptop.

Note: This is a guest post from Bruce Keith, Digital Publishing Specialist at Orca Book Publishers.

Introduction

In 2022, Orca Book Publishers had a dozen accessible titles that had been remediated via conversion projects with organizations such as Books BC and eBOUND Canada. The books had been created as mostly accessible EPUBs by outsourced ebook developers. When Orca made a commitment to creating accessibility ebooks, the immediate goal was to pursue Benetech Certification with an eye to adopting a born accessible workflow and to start remediating backlist titles.

Orca has three main streams of books from an EPUB point of view: highly illustrated non-fiction, fiction with few or no images, and picture books. We started by remediating the fiction titles that were already mostly accessible and bringing them up to Benetech standards. Concurrently, we brought the non-fiction production in-house to begin to develop a functional accessible workflow. Non-fiction titles usually feature 80+ illustrations and photographs, multiple sidebars, a glossary, index, and a bibliography.

In publishing circles, a fair amount of time is spent bemoaning the shortcomings of InDesign as a platform for creating good EPUBs, let alone making accessible ones. With a complex design, you can spend a lot of time and effort prepping an InDesign file to export a “well-formed” file and still end up with a “messy” end result. Instead, Orca’s approach was to ignore InDesign as much as possible, export the bare necessities (styles, table of contents, page break markers, etc.), clean out the junk in the EPUB it produces using a series of scripted search and replaces, and then rely on post-processing to produce well-formed, accessible EPUBs in a more efficient manner.

To that end we started building two things: a comprehensive standard structure and its accompanying documentation for an Orca ebook, and a series of Python scripts to apply that structure to EPUBs. These scripts needed to be robust enough to work with both new books and to remediate older titles that spanned everything from old EPUB2’s to mostly-accessible titles that didn’t quite meet Benetech standards.

Python in EPUB Production

Python was the obvious choice for these tasks. Python is a programming language suited for text and data manipulation that is highly extensible, with thousands of external libraries available, and has a focus on readability. It comes already installed with Mac OSX and is easily added to both Windows and Linux.

Python is easy to learn and fairly easy to use.  You can simply write a python script in a text file e.g.:

print('Enter your name:')
name = input()
print('Hello, ' + name)

Then save it as script.py and run it using a Python interpreter. As a general rule, writing and running Python scripts from within an IDE (Integrated Development Environment) like Visual Studio Code, a free IDE created and maintained by Microsoft, makes this pretty simple. Using VS Code allows a developer to easily modify scripts and then run them from within the same application.

Regular Expressions

The other important part of the process, and well worth learning about as much as ebook developers can—even if they don’t dive into Python—is regular expressions (regex). This a system of patterns of that allow you to search and replace highly complex strings of code.

For instance, if you wanted to replace all the <p>’s in a glossary with <li>’s:

<p class="glossary"><b>regular Expression</b>: is a sequence of characters that specifies a match pattern in text.</p>

You could search for:

<p class="glossary">(.*?)</p>

where the bits in parentheses are wildcards…and replace it with:

<li class="glossary">\1</li>.

For each occurrence found, the bit in the parentheses would be stored and then reinserted correctly in the new string.

Once you start to use regexes you’ll quickly get addicted to the power and flexibility and quite a few text editors (even InDesign via grep) support regular expressions.

Scripting Python

With these two tools you can write a fairly basic script that opens a folder (an uncompressed EPUB) and loops through to find a file named glossary.xhtml and replace the <p class="glossary"> tag with a <li> — or whatever else you might need. You can add more regexes to change the <title> to <title>Glossary</title>, add in the proper section epub:types and roles and more. Since InDesign tends to export fairly regular EPUB code once you clean out the junk, if you create a standard set of styles, it means you can easily clean and revise the whole file in a few key strokes.

Taking that one step further, if you ensure that the individual files in an EPUB are named according to their function e.g., about-the-author.xhtml, copyright.xhtml, dedication.xhtml, etc., you can easily have custom lists of search/replaces that are specific to each file, ensuring things like applying epub:types and ARIA roles is done automatically, or you could edit or change existing text with new standardized text in things like the OPF file.

If you build basic functions to perform search and replaces, then you can continually update and revise the list of things you want it to fix as you discover both InDesign and your designer’s quirks, things like moving spaces outside of spans or restructuring the headers. If you can conceptualize what you want to do, you can build a regex to do it and just add it to the list.

You can also build multiple scripts for different stages of the process or expand into automating other common tasks. For instance, the Orca toolset currently has the following scripts:

You can see the InDesign cleaning script here: github.com/b-t-k/epub-python-scripts as a basic example. As we continue to clean up and modify the rest, they will slowly be added to the repository.

So you can see, learning and using Python in your workflow can speed up a lot of repetitive and time consuming tasks and actually ensure a better quality and more standardized book— which incidentally means making future changes to EPUBs becomes much more efficient.

Documentation

Concurrently to all this, Orca maintains and continually revises a set of documents that records all the code and standards we have decided on. It is kept in a series of text files that automatically update a local website and it contains everything from the CSS solutions we use to specific lists of how TOC’s are presented, our standard schema, how we deal with long descriptions, lists of epub:types and ARIA roles, and a record of pretty much any decision that is made regarding how Orca builds EPUBs. Because the website is searchable, a quick search easily finds the answer to most questions.

Our Books

This type of automation has allowed us to produce accessible non-fiction titles in-house and in a reasonable timeframe. Books like Open Science or Get Out and Vote! can be produced in a Benetech certifiable EPUB in just a few days even though they feature things like indexes, linked glossaries, long descriptions for charts and a lot of alt-text that was written after the fact.

Cover: Open Science: Knowledge for Everyone by Monique Polak, illustrated by Catherine Chan.
Cover: Get Out and Vote! How You Can Shape the Future by Elizabeth Macleod, illustrated by Emily Chu.

And if we have the alt-text ready (which is now starting to happen as a part of the workflow), producing non-fiction titles will usually take less than two days. This time frame does grow if we are remediating old EPUBs that were produced out-of-house and we are giving serious thought to going back and redoing them as it might be quicker. Also with the establishment of the new workflow, remediating fiction titles (or producing them from scratch) now just takes a couple of hours! (Excluding QA.)

Producing an Accessible EPUB

Orca’s production process has been continually evolving. We started by focusing on making accessible non-fiction EPUBs without alt-text, and then brought alt-text into the mix after about 9 months (two seasons)—the scripts meant it was easy to go back and update those titles after alt-text was created. Meanwhile we pursued Benetech certification for our fiction titles that were produced out-of-house and developed a QA process to ensure compliance. And just recently we have brought fiction production in-house as well.

At this point, as soon as the files have been sent to the printer, the InDesign files are handed over to produce the EPUB. Increasingly before this stage, the alt-text is produced and entered in a spreadsheet. Then this is merged into the completed EPUB. A “first draft” is produced and run through Pagina’s EPUBCheck and Ace by DAISY to ensure compliance. Then, along with a fresh export of the alt-text in a separate Excel file, it is sent over to our production editor who has a checklist of code elements to work through using BBEdit, and then he views the files in Thorium and Apple Books, and occasionally Colibrio’s excellent online Vanilla Reader, checking styles, hierarchy, visual presentation and listening to the alt-text.

Changes come back and usually within one or two rounds it is declared finished and passed on to the distribution pipeline. There, our Data Specialist does one last check of the metadata ensuring it matches the ONIX files and reruns EPUBCheck and Ace before sending it out.

Spreading the Workload

In the background, we have marketing and sales staff working on spreadsheets of all our backlist, writing and proofing alt-text for the covers and interior illustration of the fiction books so it is ready to go as titles are remediated. The hope is to incorporate this cover alt-text into all of our marketing materials and websites as the work is completed. The editors meanwhile are just starting to incorporate character styles in Word (especially in specifying things like languages and Italics vs. emphasis) and working with authors to build in alt-text creation alongside the existing caption-writing process.

The designers are slowly incorporating standardized character and paragraph styles into their design files and changing how they structure their documents to facilitate EPUB exports. They are also working with the illustrators to collect and preserve their illustration notes in order to help capture the intent of illustrations so those notes can be used as a basis for alt-text. They are also working to document cover discussion as a way to help facilitate more interesting and accurate cover alt-text.

It will take a few more years, but eventually the whole process for producing born accessible, reflowable EPUBs should be fully in place.

The Future

Orca is currently working towards a goal of 300 Benetech accessible EPUB titles in our catalog for February 2024, including everything back to 2020. And then we will continue to remediate all our backlist of over 1200 titles over the next few years.

As soon as the process for fiction EPUBs has solidified, we’d also like to start in on our pictures books and ensure that these fixed EPUBs are as accessible as possible. It is currently an extremely time-consuming task, but we have hope that we can eventually work out a way to automate a lot of the repetitive work.

This means we need to continue to educate ourselves and our suppliers and work towards a way to standardize as many aspects of the workflow as possible. The more standards we create and maintain, the more automation we can employ.

And of course, this means learning even more Python…

Next Steps

1

How a Small Press Made Their Entire Catalogue Accessible

Freehand Books’ Accessible Publishing Journey: How a Small Press Made Their Entire Catalogue Accessible

Find out how Freehand Books remediated their backlist into accessible format, earned their GCA Benetech certification, and changed their eBook production workflows.

Subject(s): Blog
Resource Type(s): Standards and Best Practices
Audience:
Introduction
2

The Complex Work of Making Textbooks Accessible

Brush Education’s Accessible Publishing Journey: The Complex Work of Making Textbooks Accessible

Dive into Brush Education’s thorough process of making textbooks accessible, their Benetech certification journey, and their workflow and production changes.

Subject(s): Blog
Resource Type(s): Standards and Best Practices
Audience:
Introduction
3

The Unique Challenges of Playscript Production

Playwrights Canada Press’s Accessible Publishing Journey: The Unique Challenges of Playscript Production

Learn about the unique editorial and production considerations that go into making accessible playscripts.

Subject(s): Blog
Resource Type(s): Standards and Best Practices
Audience:
Introduction

Want to discuss this resource?