Why can’t I format book descriptions properly?

Image of cat lying on keyboard — You know you typed it correctly. What happened afterwards to turn your text into ill-spaced gibberish?

Over the years you've learned all the tricks to producing good-looking text in end-user applications like Word and Facebook. You know how to use <shift><enter> instead of <enter> to trick Facebook into giving you a line break without ending your message. You know how to use special keyboard control key clusters to enter non-English accented characters directly, instead of looking them up tediously in some sort of character-set chart and selecting them by mouse click.

And now you feel betrayed. All the beautiful text you enter into your book descriptions, add to your ebook's internal metadata, offer to Bowker, and use for author bios, editorial reviews, and all the rest at retailers and distributors… all of it loses paragraph breaks, turns smart-quotes into garbage, and generally looks like a pratfall.

I'm going to try to skim over this topic because it's arcane and geeky, and most people don't care about how we got here, but you need to understand where we are today so that you can work toward improvement. See this link for more on book descriptions.

Teletype machines

God help me, but I'm actually old enough to have used teletype machines for beginner programming, and punched paper tape to guide cannibalized Friden machines (early electro-mechanical calculators) to produce invoices as a teenage summer job.

And this is basically where our story begins, in the setting of standards for character encoding that evolved commercially from the first Jacquard looms and became systematized as a standard via ASCII.

ASCII character encoding standard

(If you're not a computer geek, you might not know the proper pronunciation of ASCII: “ASS-kee”. If you don't understand what “character encoding” means, think of a computer about to display text from a string of binary numbers, and looking up the character to display from a list of matching values.)

With the rise of teletype and similar usage in the 50s and 60s, there was a need for a formal standard, and the first American Standard Code for Information Interchange was proposed in 1963, based on US English character requirements. Modern computers almost all use ASCII (or its backwardly compatible successor UTF-8) in some form (IBM had a competing standard, EBCDIC: “EBB-seh-DIK”).

The basic ASCII character set is quite small and parochial to English and the US — no accented characters, a dollar sign but no other currency, limited punctuation.

For reference, here is a list of the basic character set for which no special coding is necessary, all of which can be found on a standard English computer keyboard:

space character
capital letters A–Z
lower-case letters a–z
digits 0–9
punctuation ! ” ‘ , – . : ; ?
brackets ( ) [ ] { }
symbols # $ % * + / = < > \ @ _ ` | ~

Because ASCII originated from the needs of physical devices that printed on paper, there are also “non-printable” control characters (for “controlling” devices), such as tab, line-feed, and carriage-return. Those are allowed but often ignored (converted into spaces) by software that no longer has line printers to consider. Other control characters (e.g., backspace), are still in the code but ignored everywhere.

The ASCII code standard introduced a lot of ambiguity about the meaning of some of the control characters (if you really want to know, read the ASCII article in depth), with different device manufacturers and operating system coders choosing different interpretations to serve their local needs. This is, in fact, the origin of our woes with how linefeeds (paragraphs) are treated on screen displays.

The inherent ambiguity of many control characters, combined with their historical usage, created problems when transferring “plain text” files between systems. The best example of this is the newline problem on various operating systems. Teletype machines required that a line of text be terminated with both “Carriage Return” (which moves the printhead to the beginning of the line) and “Line Feed” (which advances the paper one line without moving the printhead). The name “Carriage Return” comes from the fact that on a manual typewriter the carriage holding the paper moved while the position where the typebars struck the ribbon remained stationary. The entire carriage had to be pushed (returned) to the right in order to position the left margin of the paper for the next line.

DEC operating systems (OS/8, RT-11, RSX-11, RSTS, TOPS-10, etc.) used both characters to mark the end of a line so that the console device (originally Teletype machines) would work. By the time so-called “glass TTYs” (later called CRTs or terminals) came along, the convention was so well established that backward compatibility necessitated continuing the convention…

Unfortunately, requiring two characters to mark the end of a line introduces unnecessary complexity and questions as to how to interpret each character when encountered alone. To simplify matters plain text data streams, including files, on Multics used line feed (LF) alone as a line terminator. Unix and Unix-like systems, and Amiga systems, adopted this convention from Multics. The original Macintosh OS, Apple DOS, and ProDOS, on the other hand, used carriage return (CR) alone as a line terminator; however, since Apple replaced these operating systems with the Unix-based macOS operating system, they now use line feed (LF) as well. The Radio Shack TRS-80 also used a lone CR to terminate lines.

You get the idea. It was not only necessary to learn ASCII to program, but you had to understand the history and context of the environment — the operating system and the devices.

Many countries have extended the ASCII character set in various ways (the yellow, green, and purple lines in the chart below) to accommodate other language and punctuation characters, and some of those extensions are still in use.

In today's world, we tend to think of worldwide standards, but there are many issues like this still preserved in the tools we use, and text passing from one environment to another, with all of the non-text characters it contains, is still vulnerable to these problems of interpretation. This is why new standards arise.

Unicode

ASCII was the most popular character encoding standard for the web until 2007 when it was finally surpassed by Unicode in the form of the UTF-8 standard, which is backward-compatible with ASCII. Today, UTF-8 is the de facto standard for the web, but that doesn't mean that people (and tools) have stopped accommodating ASCII in idiosyncratic ways

What we have is a legacy problem, and the legacy part of it isn't going to go away any time soon.

Take a look at the full Unicode chart, in all its splendor, language by language, symbols, Braille, Egyptian hieroglyphs, etc., and scroll down through it for a while, with your eye on the sidebar. Note that ASCII's original non-printing device control characters and text characters occupy the same position in Unicode as they do in ASCII, even if they are no longer functional, which is how Unicode remains backwardly compatible with ASCII.

The latest version of Unicode contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets. Quite some distance from the original version of ASCII, with 126 characters, English only.

The underlying problem we have now isn't how to refer to the character we want — it's how to tell the interpreting software what character encoding system we're using, and finding out whether they use it, too. Or not. I can present any character as its UTF-8 code, if I want to, but only if I can use HTML to refer to it. It would also be nice if we could format it a little bit (bold, italics, linefeeds).

This problem requires us to understand some more standards…

HTML, XML, XHTML

HTML is the markup language used for presenting web pages that will be read by humans, including formatted text. It has a long history which we'll ignore here.

Its focus is on the presentation of a variety of information types for display by browsers and similar interfaces. Browsers are forgiving of errors in HTML markup, preferring to show as much of a page as possible despite errors. So there can be more than one way to code for styling, graceful ways to fail, or even raw code presented on the webpage without the page failing altogether.

XML is a markup language used for presenting data structures for the internet that are readable both by humans and software. It is a strict language, suitable for transmitting data.

What happens if some of the data in an XML data snippet were HTML markup code? In other words, if the answer to “Who was the forty-second president of the U.S.A.?” were “<p><b>William Jefferson Clinton</b></p>”?

Wouldn't that confuse the XML interpreter, to see angle brackets used to mark up data content instead of XML code?

This was the reason that XHTML was created.

XHTML is an extension application of HTML that allows strict HTML to be used as data in XML.

Web content that is largely text-based is generally styled for presentation in a web page using the HyperText Markup Language (HTML). HTML has been the language of the World Wide Web since its inception and is still the most popular language for constructing web pages. HTML was based upon the Standard Generalized Markup Language (SGML), which has been in use for preparing electronic content in academic and professional publishing since the early 1990s.

XML was developed in the late 1990s as demand grew for ways to use the web for exchanging data and messages that didn’t have to be presented as human-readable web pages. XML is a much stricter language than SGML, so it is generally not possible to incorporate HTML-tagged content directly into an XML message. Responding to demand to make it possible to embed HTML in XML, the World Wide Web Consortium has defined an XML-compatible version of HTML, called XHTML. XHTML text fragments can be embedded in XML messages, provided this is allowed by the tagging rules of the XML application.

So, why should you care about any of this? Because it manifests as an issue for our book descriptions and other formatted text snippets.

We enter our book description and other text data into whatever web form the retailer presents to us. But what do they accept, and what do they do with it? (See this detailed document (APPNOTE HTML markup in ONIX) from BISG for more.)

Unformatted book description and other rich text missing paragraph breaks doesn't happen to the traditional publishers at online stores. Why is that?

ONIX

The book trade sends its data to its partners in the form of ONIX XML data feeds. (Some are no doubt still using spreadsheets, but I'm not talking about them.)

I've spoken about ONIX at some length elsewhere but for this discussion what matters is that the traditional publishers can use HTML-enabled fields to format their text snippets in an ONIX data system, and ONIX will send that data in XML form to their trading partners. It's not perfect — not all the trading partners are themselves ready to receive ONIX XML data — but it provides much more consistency than what we have to deal with as self-publishers.

We face different input forms for text data from each of our retailers, and from the distributors, too, if we use them. Some of those forms accept formatting, but they all have different rules. They're inconsistent with each other and poorly documented, and we can't predict how that text data will be marked up and converted into the data presented on the retailer online store pages.

I'm exploring an ONIX program right now (ONIXEDIT), but the only place I can send that XML datafeed as an indie is to distributors like PublishDrive and StreetLib — I'm not a trading partner in the conventional sense with Amazon, et alia. But it's a start. Maybe I can convince, say, Ingram to take an XML feed from me. Over time, I bet I can improve the situation.

In the meantime, all we can really do is document how each of our retailers behaves and try to guess the best approach for formatted text.

Karen Myers

Karen Myers is a fantasy and science fiction author, best known for her heroic fantasy novels. Her stories feature heroes in real and imagined worlds filled with magic, space travel, and adventure.

View more posts

6 Comments

Alicia Butcher Ehrhardt

Seems it ought to be an Amazon problem, not the problem of the self-publishers it invites and encourages. They turn all kinds of input into ebooks – it shouldn’t be that hard to tell us what they want.

At the very minimum, how to paragraph.

August 21, 2017
|Reply
- Karen Myers
  
  But it’s not just Amazon: it’s every place we have to use a web page (HTML) form field or something similar to enter formatted text.
  
  Some accept HTML, or limited HTML (as in XHTML). Some accept rich text. Do they turn it into XHTML? Some accept no rich text or HTML, and we have to supply a linefeed twice, and hope for the best. Some turn two newlines into nominal HTML paragraph markups, and some don’t.
  
  Some, like Bowker and ebook CONTENT.OPF description files require older ASCII-based workarounds to put HTML formatting into fields that won’t accept that directly. If you want your book description formatted in your internal ebook metadata (CONTENT.OPF), for example, even just to get reliable paragraph breaks, you need to embed HTML into a field in a manner compatible with XML, according to these rules: XML Escape Characters.
  
  See this link for a good discussion of the potential complexities from a coder’s point of view (what’s behind all of those input forms and subsequent processing).
  
  And even when we do use a distributor like PublishDrive, its retail partners may not accept an XML feed, but only a text one (spreadsheets, not ONIX).
  
  Standards evolutions, like ASCII to Unicode, or spreadsheets to XML — are both relatively recent and sometimes slow to spread. And all the partners in a chain have to move forward, which they can never do at exactly the same time. That’s why the most recent companies, like PublishDrive, have the most advanced standards — no legacies to maintain.
  
  What the indie-facing companies CAN do, however, is to properly document exactly what format they expect to receive from us, for their best output of formatted text, instead of leaving us floundering along with trial and error, hoping to hit upon the optimal input requirements.
  
  August 21, 2017
  |Reply
  - Alicia Butcher Ehrhardt
    
    The only indie-facing company I’m likely to deal with for a while is Amazon, so they could start it.
    
    And Amazon, being the 400-lb gorilla, can sometimes set de facto standards. Wish it would.
    
    August 21, 2017
    |Reply
    - Karen Myers
      
      In their defense, I’ll say that Amazon is one of the most reliable. Their input forms offer previews, which is a key element to ensuring we get it right.
      
      When you go to Bowker, for example, it make take 24-hours or longer to see the result of your entry in Bookwire, so you can tell if the formatting worked or not. Very inefficient.
      
      Even if you’re “Amazon-only”, that still means Amazon, CreateSpace, and Bowker (or equivalent). There’s no escaping the problem. 🙂
      
      August 21, 2017
      |Reply
      - Alicia Butcher Ehrhardt
        
        I know you go very wide – I, with my energy limitations, will never have that many books, so things like descriptions on Amazon have to count. Still fiddling with those.
        
        It’s not just the previews – you can see when things go live what they look like.
        
        I found, to my horror, that Look Inside had messed up my careful formatting – and it looked as if a Kindergartener had been playing with the computer. It wasn’t me – I called Amazon, and they admitted it was their fault (phew!) and fixed it as quickly as they could, but it had been fine. That I didn’t need.
        
        August 21, 2017
Using ONIX as an Independent Author | HollowLands

[…] right-facing single or double quotes) are not understood by all systems, depending upon things like character encoding schemes. When you create your book descriptions in a spreadsheet (with default smart quotes) and copy/paste […]

March 10, 2018
|Reply