opubWriter

the online wysiwyg epub editor

opubWriter Source Code available under licence

leave a comment »

There has been enough interest in the source code of opubWriter for me to consider making it available under a Software Code Licence Agreement. Accordingly, I have added a page to the website to give details of the offer.

Source Code For Sale

Written by netkingcol

December 2, 2011 at 4:40 pm

Over 600 users

leave a comment »

More than 600 people have now tried opubWriter, the online wysiwyg epub editor:

600 users

Written by netkingcol

July 11, 2011 at 4:39 pm

Enhancements to user management

with one comment

After nine months of operation and with 350 people having registered to use opubWriter, an error report today revealed the incompleteness of the opubWriter user management functionality. A user reported that they had forgotten their password and could find no way to retrieve or reset it.

Rather hastily, then, I have provided two new functions to opubWriter:

  1. The ability to request a password reset. By correctly specifying a User Name and answering the security question that was defined at user registration, a new password will be emailed to the user.
  2. Once a user has logged in, the ‘Change Password’ option is enabled. By selecting this option a user can change their password.

Both of these options are described in more detail in the Register and Login page of this blog.

Written by netkingcol

January 4, 2011 at 5:31 pm

Posted in epub, open publication, publishing

Tagged with ,

Checking path-rootless with a regular expression

with one comment

Section 3.5.1 of the Open Container Format says:

The values of the full-path attribute MUST contain a “path-component” (as defined by RFC3986) which MUST only take the form of a “path-rootless” (as defined by RFC3986). The path components are relative to the root of the container in which they are used.

RFC3986 discusses the generic syntax of the Uniform Resource Identifier (URI) and declares a URI to comprise a hierarchical sequence of components with the layout:

       URI   = scheme “:” hier-part [ "?" query ] [ "#" fragment ]

       hier-part   = “//” authority path-abempty
           / path-absolute
           / path-rootless
           / path-empty

The RFC gives a very thorough definition of each of these components, including path:

       path          =  path-abempty    ; begins with “/” or is empty
                            / path-absolute   ; begins with “/” but not “//”
                           / path-noscheme   ; begins with a non-colon segment
                           / path-rootless   ; begins with a segment
                           / path-empty      ; zero characters

       path-abempty  = *( “/” segment )
       path-absolute = “/” [ segment-nz *( "/" segment ) ]
       path-noscheme = segment-nz-nc *( “/” segment )
       path-rootless = segment-nz *( “/” segment )
       path-empty    = 0<pchar>

These are the “five rules of path component disambiguation” and the RFC goes on to explain what segment-nz-nc and segment-nz mean. My question for epub validation is: how can I check that the full-path attribute in a <rootfile> in container.xml is of type path-rootless?

The first task is to work out what kind of URI we have in the full-path attribute; it’s just a string and it might contain anything – especially if someone edited it by hand. The C# language has a decent string class and it would be possible to analyse the value using methods like Contains and LastIndexOf. However, there’s a more powerful approach available – regular expressions.

RFC3986 itself provides a regular expression that will perform pattern matching on a URI:

       ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

You have to be a particular type of person to want to live and breathe regular expressions. I have dipped into them occasionally but I find that the interval between dips is just long enough for me to forget what I knew before. The regular expression shown here does the job, and that’s good enough for me – almost. I want to be able to identify which parts of a URI are present in the full-path attribute. Therefore, I extended this expression by inserting group names, resulting in this expression:

       ^((?<scheme>[^:/?#]+):)?(//
          (?<authority>[^/?#]*))?
          (?<path>[^?#]*)(\?
          (?<query>[^#]*))?(#
          (?<fragment>.*))?”;

To see how this works, consider the following erroneous <rootfile> definition:

       <rootfile full-path=”http://opubwriter.com/OPS/epub.opf?option=42#package-doc” 
                          media-type=”application/oebps-package+xml”/>

The full-path attribute clearly has, in addition to a path, a scheme (http), an authority (opubwriter.com), a query (option=42), and a fragment (package-doc). Running the following C# code shows how the regular expression analyses the full-path string:

Regex pathExpr = new Regex(REGEX_URI);
Match m = pathExpr.Match(fullPath);
if (m.Success)
{
    int i = 0;
    foreach (Group g in m.Groups)
    {
        string groupName = pathExpr.GroupNameFromNumber(i++);

        switch (groupName)
        {
            case “scheme”:
                Debug.WriteLine(String.Format(“scheme: {0} at {1}”, g.Value, g.Index));
                break;
            case “authority”:
                Debug.WriteLine(String.Format(“authority: {0} at {1}”, g.Value, g.Index));
                break;
            case “path”:
                Debug.WriteLine(String.Format(“path: {0} at {1}”, g.Value, g.Index));
                break;
            case “query”:
                Debug.WriteLine(String.Format(“query: {0} at {1}”, g.Value, g.Index));
                break;
            case “fragment”:
                Debug.WriteLine(String.Format(“fragment: {0} at {1}”, g.Value, g.Index));
                break;
        }
    }
}

The regular expression pathExpr is initialised with REGEX_URI – a constant set to the raw value of the expression shown above. The Match method is given the ‘fullpath’ variable which contains the string value of the full-path attribute.

If the match is successful, then the Groups of the match are examined. The GroupCollection corresponds with the subexpressions found in the regular expression. Where the group is named, we can look to see if it’s an unwanted URI component. Running this code with the <rootfile> shown above gives the following output:

       scheme: http at 0
       authority: opubwriter.com  at 7
       path: /OPS/epub.opf at 21
       query: option=42 at 35
       fragment:  package-doc at 45

This shows the subexpressions found in the matching process and gives the position at which each match was found. Rootfile validation can now include checks on the full-path attribute. If any part of a URI other than a path-component is detected a validation failure can be reported. If only a path is matched there is the further requirement that it be path-rootless. My interpretation of the five rules of disambiguation is that to be path-rootless, the path must have length greater than zero, not begin with a slash (“/”), and not include a colon (“:”) in the first segment.

I think that trying to explain something to someone else is a good measure of how much you understand; my problem is I’m already starting to forget.

Written by netkingcol

April 7, 2010 at 12:25 pm

The MUST, SHOULD, MAY approach to epub validation

with one comment

This post discusses the approach to epub validation taken by opubWriter. The features described are currently in test and will be available online shortly.

The epub standard documents each has a section called Conformance. For instance, the Open Container Format section 1.4 begins:

1.4 Conformance

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document MUST be interpreted as described in (http://www.ietf.org/rfc/rfc2119.txt).

This section defines conformance requirements for OCF.

The meanings of the keywords listed are summarised below:

1. MUST/REQUIRED/SHALL - an absolute requirement of the specification.
2. MUST NOT/SHALL NOT - an absolute prohibition of the specification.

3. SHOULD/RECOMMENDED – there must be good reasons for deviating from the specification.
4. SHOULD NOT/NOT RECOMMENDED - there must be good reasons for deviating from the specification.

5. MAY/OPTIONAL – a truly optional item.

There is a deliberate space between items 2 and 3 and another between items 4 and 5. This splits the list into three groups:

  1. Mandatory aspects of the specification.
  2. Items where there may be good reasons for deviating from the specification, but they have to be considered carefully.
  3. Optional items where the content provider has complete freedom.

Validation of a publication that purports to conform to the epub standards involves checking the .epub file against the specifications. To do this programmatically requires a degree of organisation, both in the preparation and in the implementation.

The rules for conformance must be extracted from the specifications. This involves scanning the text of the specifications to find each occurrence of the keywords listed above, and identifying a test that can be performed on the files of the .epub under test.

For instance, section 3.1 of the OCF says:

The virtual file system for the OCF “Abstract Container” MUST have a single common root directory for all of the contents of the container.

The following file names in the root directory are reserved:

  • “mimetype”
  • “META-INF”

The “mimetype” file is discussed in Section 4. The META-INF/ directory contains the reserved files used by OCF. These reserved files are described in the following sections. All other files used by the publication rendition(s) within the Abstract Container MAY be in any location descendant from the root directory except for “mimetype” at the root level or within the META-INF directory.

It is RECOMMENDED that the contents of individual publications be stored within dedicated sub-directories to minimize potential file name collisions in the event that multiple renditions are used or that multiple publications per container are supported in future versions of this Specification.

The keywords are in bold. That first sentence:

The virtual file system for the OCF “Abstract Container” MUST have a single common root directory for all of the contents of the container.

contains the MUST keyword. At this point the specification is talking about the virtual file system of an “Abstract Container”. When checking an actual publication which, in the case of a .epub book, is a zip file, this requirement can be formulated as a test of the structure of the zip contents.

Having identified the validation rules and how to test them, design decisions were needed for the opubWriter editor on what to test when and how to show the results to the user.

opubWriter has a ‘file upload’ option that allows the user to add an epub book to their library. This is risky because nothing is known about the uploaded book. There is a case for validating a book before allowing its inclusion in the library. On the other hand, the user should have an opportunity to fix a flawed publication, so it’s not appropriate to refuse to handle such files.

There is also the question of when to perform validation checks. To perform full validation when the user wants to open a book and edit a content document would be onerous and probably slow. A preferred approach is to perform only those checks that are part of unpacking the epub when the user selects it in their library. These could be classified as Container Validation and Package Validation.

The Container Validation checks at this stage don’t even need to be comprehensive; to open the book for editing we need to load and parse container.xml which should be in the META-INF folder. There must be at least one rootfile element with a media-type attribute set to ‘application/oebps-package+xml’ and the full-path attribute of that rootfile element should point to a package document which can be parsed as an XML document. There are plenty of other validation checks that could be performed on the container, but they needn’t prevent the user from writing content.

Package Validation comprises checking that the XML document pointed to by the rootfile contains <metadata>, <manifest>, and <spine> nodes. Further, we must check that the NCX document identified in the ‘toc’ attribute of the <spine> is present and can be loaded as an XML document. This document must include a <navMap> element that holds the table of contents as a collection of <navPoint> elements.

The approach used by opubWriter is to gather observations during the process of unpacking an epub book and, provided they are not fatal, presenting to the user an icon indicating the validation status. The following screenshots illustrate this.

Valid Epub Icon

Valid Epub Icon

In this screenshot, I have downloaded Wuthering Heights from epubBooks, uploaded it to the opubWriter library, and selected it.

The checks described above are performed as part of the unpacking process. If all is well, and in this case it is, the icon displayed to the left of the book title includes a tick:

valid-epub

Remember, this only means that the book passed the tests that were applied; it doesn’t mean it would pass the full validation performed by opubWriter or a third-party epub validation tool like epubCheck. On the other hand, if some problems were encountered, the icon displayed incorporates a red cross.

This is shown in the following screenshot where Holmes.epub has some problems.

Epub with validation errors

Epub with validation errors

When validation failures are detected, the icon becomes a hyperlink which, when clicked, takes the user to a screen which lists the findings. That screen is shown below as part of the discussion of full epub validation.

There is a visual indication that there are problems with the publication but the user’s workflow is not interrupted; they can continue to create content, deciding for themselves when they will review and address the validation failures.

If a more serious error occurs while opening a book for editing, a different approach to reporting the problem is taken. As an example, supposing the epub could not be unzipped because it had become corrupted in some way. That’s not a validation error, it’s a run-of-the-mill System Exception about which the application can do nothing.

An Invalid Epub

An Invalid Epub

Such errors are reported to the user by displaying a message at the top of the screen, as shown here. The test epub is called ‘an invalid.epub’; it’s not in the epub format and is not even zipped.

When the application tries to unzip the file a ZipException is raised; this is trapped and passed to the web page. A headline error message is displayed ‘Failed to open book…’ while the full detail of the problem is held as an InnerException.

In a future release of opubWriter, the user will have access to the stack of errors that can accumulate as system failures bubble up the code hierarchy.

The other validation approach is to perform a systematic and thorough examination of the .epub file using all of the rules extracted from the epub standards. That’s what the ‘validate’ button, beneath each book in the library, is intended to do.

Validation Review

Validation Review

Taking ‘an invalid.epub’ again as the example, if the user clicks on the validate button a more comprehensive series of tests is performed.

These tests will go as far as they can; obviously if there are serious deficiencies in the structure and content of the book it might not be possible to test everything.

The results are shown in the screenshot here. Each observation is tagged with a coloured light. Red indicates a violation of the specifications – a MUST/REQUIRED/SHALL rule that has not been followed.

Yellow indicates following failure to follow a SHOULD/RECOMMENDED rule. These should be treated as a warning that the feature in question should be checked. It may be what is wanted, it may be a mistake.

A green light indicates that additional information is displayed.

For example, when ‘an invalid.epub’ was examined, the initial tests were concerned with the first 60 or so bytes of the file even before the attempt to unzip it. The application found that the first 2 bytes were not ‘PK’ as they should be, mimetype is not the first file in the zip as indicated by the absence of that file name at position 30 in the file, and the application failed to find the bytes ‘application/epub+zip’ at position 38 in the file, which it should do if the mimetype file is valid.

These errors are reported with a red light. As additional information, and therefore with a green light against it, the application displays the first 60 bytes of the file. The application gets as far as it can, but then comes up against the serious failure that it can’t unzip the file. No further analysis is possible.

Conclusion

This post has presented the approach taken in opubWriter to epub validation. A textual analysis of the epub specifications reveals the requirements; the requirements are classified based on the keywords used and tests are devised to check each requirement. When an epub is opened by opubWriter, enough validation is performed to unpack the book and get the user to the point of editing content. A visual indication is provided of the validation status. Full validation is available using the ‘validate’ button. Validation results from both routes are available on a new Validation screen which lists all obsesrvations against a traffic light icon indicating their level of violation. Additional information is provided where it might help diagnose a problem.

Written by netkingcol

April 6, 2010 at 2:59 pm

TinyMCE valid_elements and the OPS Preferred Vocabulary

with one comment

opubWriter uses the tinyMCE editor for the creation and modification of content documents. TinyMCE can output XHTML documents, which is what we need to conform with the Open Publication Specification (OPS). Section 1.4.1.2 of the OPS lists the conditions which a conformant content document must meet. Condition v says:

v. all XHTML elements and attributes not contained in an Inline XML Island are drawn from the XHTML subset identified in this document.

The list of acceptable XHTML elements and attributes can be found in Section 2.2. These make up the OPS Preferred Vocabulary and include all the usual suspects like: h1, h2 , h3, strong, big, small, em, table, img, etc. (hands up – who thinks ‘etc’ is an XHTML element?)

I recognised a while ago that I hadn’t configured opubWriter’s TinyMCE instance to restrict what the author can enter to the OPS Preferred Vocabulary. TinyMCE has a configuration setting called ‘valid_elements’ which is the list of elements and attributes that an instance of the editor will allow. Elements and attributes not in the list are stripped out when TinyMCE outputs its content. This is exactly what I needed and, by comparing the list in the OPS with the default valid_elements set, I was able to derive an OPS-conformant set of valid_elements. So now I’m testing a version of opubWriter with the following configuration setting.

valid_elements : “@[id|class|style|title|dir + "onmousedown|onmouseup|onmouseover|onmousemove|onmouseout|onkeypress|"
+ "onkeydown|onkeyup],a[rel|rev|charset|hreflang|tabindex|accesskey|type|"
+ "name|href|target|title|class|onfocus|onblur],strong/b,em/i,u,”
+ “#p,-ol[type|compact],-ul[type|compact],-li,br,img[longdesc|usemap|"
+ "src|border|alt=|title|hspace|vspace|width|height|align],-sub,-sup,”
+ “-blockquote[cite],-table[border=0|cellspacing|cellpadding|width|frame|rules|"
+ "height|align|summary|bgcolor|background|bordercolor],-tr[rowspan|width|"
+ "height|align|valign|bgcolor|background|bordercolor],tbody,thead,tfoot,”
+ “#td[colspan|rowspan|width|height|align|valign|bgcolor|background|bordercolor"
+ "|scope],#th[colspan|rowspan|width|height|align|valign|scope],caption,-div,”
+ “-span,-code,-pre,address,-h1,-h2,-h3,-h4,-h5,-h6,hr[size|noshade],”
+ “dd,dl,dt,cite,abbr,acronym,del[datetime|cite],ins[datetime|cite],”
+ “object[classid|width|height|codebase|*],param[name|value|_value]“
+ “,script[src|type],map[name],area[shape|coords|href|alt|target],bdo,”
+ “col[align|char|charoff|span|valign|width],colgroup[align|char|charoff|span|"
+ "valign|width],dfn,kbd,noscript,q[cite],samp,small,”
+ “textarea[cols|rows|disabled|name|readonly],tt,var,big”

You can review the syntax of this statement at the link provided above. The following elements and their attributes were removed from the default valid_elements provided by TinyMCE:

strike, u,  font, embed, button, fieldset, form, input, label, legend, optgroup, option, select

Written by netkingcol

April 2, 2010 at 12:50 pm

Preparing to release version 1.0

leave a comment »

For two weeks I have been grappling manfully with treeview data bindings and nested repeater event bubbles to create a new look for opubWriter. I realised that I was trying to do too much on the ‘Content’ screen – edit documents, add and remove documents, and rearrange the table of contents – so I have split this functionality to create two new screens.

The Editor screen simply shows the table of contents as a treeview and the wysiwyg editing area, along with a Save button. A significant improvement is that opubWriter can now create and handle book structures of arbitrary depth – hence the nested repeater event bubbles.

New Editor screen

New Editor screen

This following screenshot shows four levels. To display a document, you click on its name in the treeview.

The plus and minus icons can be used to expand and collapse the divisions of the book so you can bring the part you’re working on conveniently into view.

The screenshot also shows that the editing area can be resized to match your preference.

One feature that’s slightly less attractive than before is that I’ve removed the ‘save’ option from the editing area and replaced it with a ‘Save’ button. I couldn’t get the tinyMCE plugin to work reliably, and there are risks that it will behave differently in different browsers. I notice that WordPress and Blogger both have separate Save buttons, so opubWriter is in good company in that respect.

The next screenshot shows the Navigation tab of the Organisation screen.

Organisation/Navigation screen

Organisation/Navigation screen

On this screen you also see the hierarchical structure of the book, but this time you don’t expand and collapse the view.

To the left of each content document  you will see a group of icons each of which represents an action you can perform. The icons and their meanings are listed in the table below.

Download the content document.
Delete the content document.
Move the content document up within its level.
Move the content document down within its level
Save the title of the content document.

 

The download option allows you to save a single content document, which you might want to do in order to upload it into another book. This post show you how to do that below. This action keeps the document in the current publication and doesn’t affect it in any way.

The delete option allows you to remove a content document completely from the publication. Obviously, you must think carefully before taking this action. For instance, you might want to take a copy of the .epub file or download the content document before you delete it.

The actions to move a content document up and down in the reading order act only at the current level of the document. Where a content document is the first one at its level e.g. the Introduction to Part 1 in the screenshot, the ‘move up’ icon is not shown. Conversely, a document which is the last at its level does not have a ‘move down’ icon. If you really must promote or demote a content document to a different level, the approach is to save it first by downloading it and then uploading it to its new position. Again, uploading a content document is explained below.

The ‘save’ action refers to the text of the table of contents entry. You can use this option to change the text displayed in the table of contents. Technically, this will be saved in the NCX file of the epub in the <navLabel><text> element.

The second tab to mention on the Organisation screen is the Content tab. This provides the ability to add new content at any position within an epub publication. The new content document can either be created from scratch using details you provide or it can be uploaded from an existing content document on your computer system.

The screenshot below shows the Organisation/Content screen.

Organisation/Content screen

Organisation/Content screen

To create a new content document, you enter three pieces of information:

1. The entry to appear in the table of contents.
2. A unique filename for the document.
3. An optional document heading.

 To upload an existing content document, you use the Browse button to locate the file on your computer system; then you enter the following two pieces of information:

Having selected the source of the new content, you then select where it will go. There is a dropdown list that lets you specify a relative position which contains the following options:

1. The entry to appear in the table of contents.
2. A unique filename for the document.

The document relative to which the new content is placed is selected in the next dropdown list. The entries in this list show the full path to each document. This allows you to distinguish between, for instance, “PART ONE/Chapter 5″ and “PART TWO/Chapter 5″.

The screenshot shows the Content screen completed and ready for the Create button to be clicked.

Notice there’s also a tab on the Organisation screen called ‘Spine’. At some point this will be developed to allow control over the reading sequence defined by the spine and the ability to change the ‘linear’ attribute of a spine item which identifies content as primary or  auxiliary.

Written by netkingcol

March 30, 2010 at 11:32 am

Images and Styles added to User Guide

leave a comment »

This morning I added two new articles to the opubWriter User Guide.

The first is concerned with adding images to your epub books, both uploading them to be included in the manifest and then inserting them into the content documents where they are relevant. There are 4 image formats you can use (gif, jpeg, png, and svg). In the current release of opubWriter there’s a limit of 100Kbytes for any one image file.

The second User Guide article covers the subject of CSS styling – how to add, modify, and apply CSS stylesheets to your content documents. The Styles screen is where you upload new stylesheets and edit their style rules. It’s also here you can delete a stylesheet – provided it’s not referenced anywhere in your book. You use the Style Usage screen to control which stylesheets are applied to which documents. The article shows how a content document can have more than one stylesheet and also how you can get close control of styling by creating your style classes which appear in a dropdown list in the editing window. Don’t assume that your readers will like your style; they may want to read taupe text on a puce background, I say ‘let them’.

Written by netkingcol

March 20, 2010 at 1:28 pm

opubWriter new features

leave a comment »

Today I implemented the following changes to opubWriter:

  1. Pass multiple CSS stylesheets to TinyMCE initialisation. Now that the ‘Style Usage’ form is available, the user can assign more than one stylesheet to a content document. The TinyMCE editor has the ability to work with multiple stylesheets; what was needed to TinyMCE to apply a document’s stylesheets was to find all the ‘text/css’ link elements in the selected document and set the content_css property of TinyMCE during initialisation to a comma-separated list of absolute URLs to these stylesheets.
  2. Disallow stylesheet deletion if it’s in use. On the Styles form the user has the ability to upload, modify, and delete stylesheets. An epub would fail validation if a referenced item were not in the manifest. When a user selects a stylesheet for deletion, a check is now made across all content documents (media-type = ‘application/xhtml+xml’) in the manifest for a element that refers to the selected stylesheet. If any reference is found, a message is displayed and the stylesheet is not deleted.
  3. New Epub form now has a dropdown list of languages. When a user creates a new epub book using the New button on the Library form, a screen is displayed which captures some intial metadata. One of the mandatory items captured is the language code identifying the language in which the intellectual content is written. The code must conform to RFC 3066. In order to save the user the effort of looking up the code for their selected language, the New Epub form now offers a dropdown list of language names, defaulted to English, and the list delivers the language code for the selected language to the application.
  4. More validation. Characters that are not allowed in filenames and XML documents are detected when the user types into the Contents Entry field when adding a content document to a publication. A suitable error message is displayed. On the New Epub form, the user must provide a title and an identifier which are mandatory metadata items in epub.

opubWriter User Guide

I have started a User Guide for opubWriter which is available as Pages within this blog. The articles to date (shown to the right of this screen) are:

  • Register and Login
  • Managing Your Library
  • Editing Epub Contents

Articles in the pipeline are:

  • Managing Metadata
  • Managing Styles
  • Document Organisation

Written by netkingcol

March 17, 2010 at 4:59 pm

Follow

Get every new post delivered to your Inbox.