opubWriter

the online wysiwyg epub editor

Archive for April 2010

Checking path-rootless with a regular expression

with one comment

Section 3.5.1 of the Open Container Format says:

The values of the full-path attribute MUST contain a “path-component” (as defined by RFC3986) which MUST only take the form of a “path-rootless” (as defined by RFC3986). The path components are relative to the root of the container in which they are used.

RFC3986 discusses the generic syntax of the Uniform Resource Identifier (URI) and declares a URI to comprise a hierarchical sequence of components with the layout:

       URI   = scheme “:” hier-part [ "?" query ] [ "#" fragment ]

       hier-part   = “//” authority path-abempty
           / path-absolute
           / path-rootless
           / path-empty

The RFC gives a very thorough definition of each of these components, including path:

       path          =  path-abempty    ; begins with “/” or is empty
                            / path-absolute   ; begins with “/” but not “//”
                           / path-noscheme   ; begins with a non-colon segment
                           / path-rootless   ; begins with a segment
                           / path-empty      ; zero characters

       path-abempty  = *( “/” segment )
       path-absolute = “/” [ segment-nz *( "/" segment ) ]
       path-noscheme = segment-nz-nc *( “/” segment )
       path-rootless = segment-nz *( “/” segment )
       path-empty    = 0<pchar>

These are the “five rules of path component disambiguation” and the RFC goes on to explain what segment-nz-nc and segment-nz mean. My question for epub validation is: how can I check that the full-path attribute in a <rootfile> in container.xml is of type path-rootless?

The first task is to work out what kind of URI we have in the full-path attribute; it’s just a string and it might contain anything – especially if someone edited it by hand. The C# language has a decent string class and it would be possible to analyse the value using methods like Contains and LastIndexOf. However, there’s a more powerful approach available – regular expressions.

RFC3986 itself provides a regular expression that will perform pattern matching on a URI:

       ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

You have to be a particular type of person to want to live and breathe regular expressions. I have dipped into them occasionally but I find that the interval between dips is just long enough for me to forget what I knew before. The regular expression shown here does the job, and that’s good enough for me – almost. I want to be able to identify which parts of a URI are present in the full-path attribute. Therefore, I extended this expression by inserting group names, resulting in this expression:

       ^((?<scheme>[^:/?#]+):)?(//
          (?<authority>[^/?#]*))?
          (?<path>[^?#]*)(\?
          (?<query>[^#]*))?(#
          (?<fragment>.*))?”;

To see how this works, consider the following erroneous <rootfile> definition:

       <rootfile full-path=”http://opubwriter.com/OPS/epub.opf?option=42#package-doc” 
                          media-type=”application/oebps-package+xml”/>

The full-path attribute clearly has, in addition to a path, a scheme (http), an authority (opubwriter.com), a query (option=42), and a fragment (package-doc). Running the following C# code shows how the regular expression analyses the full-path string:

Regex pathExpr = new Regex(REGEX_URI);
Match m = pathExpr.Match(fullPath);
if (m.Success)
{
    int i = 0;
    foreach (Group g in m.Groups)
    {
        string groupName = pathExpr.GroupNameFromNumber(i++);

        switch (groupName)
        {
            case “scheme”:
                Debug.WriteLine(String.Format(“scheme: {0} at {1}”, g.Value, g.Index));
                break;
            case “authority”:
                Debug.WriteLine(String.Format(“authority: {0} at {1}”, g.Value, g.Index));
                break;
            case “path”:
                Debug.WriteLine(String.Format(“path: {0} at {1}”, g.Value, g.Index));
                break;
            case “query”:
                Debug.WriteLine(String.Format(“query: {0} at {1}”, g.Value, g.Index));
                break;
            case “fragment”:
                Debug.WriteLine(String.Format(“fragment: {0} at {1}”, g.Value, g.Index));
                break;
        }
    }
}

The regular expression pathExpr is initialised with REGEX_URI – a constant set to the raw value of the expression shown above. The Match method is given the ‘fullpath’ variable which contains the string value of the full-path attribute.

If the match is successful, then the Groups of the match are examined. The GroupCollection corresponds with the subexpressions found in the regular expression. Where the group is named, we can look to see if it’s an unwanted URI component. Running this code with the <rootfile> shown above gives the following output:

       scheme: http at 0
       authority: opubwriter.com  at 7
       path: /OPS/epub.opf at 21
       query: option=42 at 35
       fragment:  package-doc at 45

This shows the subexpressions found in the matching process and gives the position at which each match was found. Rootfile validation can now include checks on the full-path attribute. If any part of a URI other than a path-component is detected a validation failure can be reported. If only a path is matched there is the further requirement that it be path-rootless. My interpretation of the five rules of disambiguation is that to be path-rootless, the path must have length greater than zero, not begin with a slash (“/”), and not include a colon (“:”) in the first segment.

I think that trying to explain something to someone else is a good measure of how much you understand; my problem is I’m already starting to forget.

Written by netkingcol

April 7, 2010 at 12:25 pm

The MUST, SHOULD, MAY approach to epub validation

with one comment

This post discusses the approach to epub validation taken by opubWriter. The features described are currently in test and will be available online shortly.

The epub standard documents each has a section called Conformance. For instance, the Open Container Format section 1.4 begins:

1.4 Conformance

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document MUST be interpreted as described in (http://www.ietf.org/rfc/rfc2119.txt).

This section defines conformance requirements for OCF.

The meanings of the keywords listed are summarised below:

1. MUST/REQUIRED/SHALL - an absolute requirement of the specification.
2. MUST NOT/SHALL NOT - an absolute prohibition of the specification.

3. SHOULD/RECOMMENDED – there must be good reasons for deviating from the specification.
4. SHOULD NOT/NOT RECOMMENDED - there must be good reasons for deviating from the specification.

5. MAY/OPTIONAL – a truly optional item.

There is a deliberate space between items 2 and 3 and another between items 4 and 5. This splits the list into three groups:

  1. Mandatory aspects of the specification.
  2. Items where there may be good reasons for deviating from the specification, but they have to be considered carefully.
  3. Optional items where the content provider has complete freedom.

Validation of a publication that purports to conform to the epub standards involves checking the .epub file against the specifications. To do this programmatically requires a degree of organisation, both in the preparation and in the implementation.

The rules for conformance must be extracted from the specifications. This involves scanning the text of the specifications to find each occurrence of the keywords listed above, and identifying a test that can be performed on the files of the .epub under test.

For instance, section 3.1 of the OCF says:

The virtual file system for the OCF “Abstract Container” MUST have a single common root directory for all of the contents of the container.

The following file names in the root directory are reserved:

  • “mimetype”
  • “META-INF”

The “mimetype” file is discussed in Section 4. The META-INF/ directory contains the reserved files used by OCF. These reserved files are described in the following sections. All other files used by the publication rendition(s) within the Abstract Container MAY be in any location descendant from the root directory except for “mimetype” at the root level or within the META-INF directory.

It is RECOMMENDED that the contents of individual publications be stored within dedicated sub-directories to minimize potential file name collisions in the event that multiple renditions are used or that multiple publications per container are supported in future versions of this Specification.

The keywords are in bold. That first sentence:

The virtual file system for the OCF “Abstract Container” MUST have a single common root directory for all of the contents of the container.

contains the MUST keyword. At this point the specification is talking about the virtual file system of an “Abstract Container”. When checking an actual publication which, in the case of a .epub book, is a zip file, this requirement can be formulated as a test of the structure of the zip contents.

Having identified the validation rules and how to test them, design decisions were needed for the opubWriter editor on what to test when and how to show the results to the user.

opubWriter has a ‘file upload’ option that allows the user to add an epub book to their library. This is risky because nothing is known about the uploaded book. There is a case for validating a book before allowing its inclusion in the library. On the other hand, the user should have an opportunity to fix a flawed publication, so it’s not appropriate to refuse to handle such files.

There is also the question of when to perform validation checks. To perform full validation when the user wants to open a book and edit a content document would be onerous and probably slow. A preferred approach is to perform only those checks that are part of unpacking the epub when the user selects it in their library. These could be classified as Container Validation and Package Validation.

The Container Validation checks at this stage don’t even need to be comprehensive; to open the book for editing we need to load and parse container.xml which should be in the META-INF folder. There must be at least one rootfile element with a media-type attribute set to ‘application/oebps-package+xml’ and the full-path attribute of that rootfile element should point to a package document which can be parsed as an XML document. There are plenty of other validation checks that could be performed on the container, but they needn’t prevent the user from writing content.

Package Validation comprises checking that the XML document pointed to by the rootfile contains <metadata>, <manifest>, and <spine> nodes. Further, we must check that the NCX document identified in the ‘toc’ attribute of the <spine> is present and can be loaded as an XML document. This document must include a <navMap> element that holds the table of contents as a collection of <navPoint> elements.

The approach used by opubWriter is to gather observations during the process of unpacking an epub book and, provided they are not fatal, presenting to the user an icon indicating the validation status. The following screenshots illustrate this.

Valid Epub Icon

Valid Epub Icon

In this screenshot, I have downloaded Wuthering Heights from epubBooks, uploaded it to the opubWriter library, and selected it.

The checks described above are performed as part of the unpacking process. If all is well, and in this case it is, the icon displayed to the left of the book title includes a tick:

valid-epub

Remember, this only means that the book passed the tests that were applied; it doesn’t mean it would pass the full validation performed by opubWriter or a third-party epub validation tool like epubCheck. On the other hand, if some problems were encountered, the icon displayed incorporates a red cross.

This is shown in the following screenshot where Holmes.epub has some problems.

Epub with validation errors

Epub with validation errors

When validation failures are detected, the icon becomes a hyperlink which, when clicked, takes the user to a screen which lists the findings. That screen is shown below as part of the discussion of full epub validation.

There is a visual indication that there are problems with the publication but the user’s workflow is not interrupted; they can continue to create content, deciding for themselves when they will review and address the validation failures.

If a more serious error occurs while opening a book for editing, a different approach to reporting the problem is taken. As an example, supposing the epub could not be unzipped because it had become corrupted in some way. That’s not a validation error, it’s a run-of-the-mill System Exception about which the application can do nothing.

An Invalid Epub

An Invalid Epub

Such errors are reported to the user by displaying a message at the top of the screen, as shown here. The test epub is called ‘an invalid.epub’; it’s not in the epub format and is not even zipped.

When the application tries to unzip the file a ZipException is raised; this is trapped and passed to the web page. A headline error message is displayed ‘Failed to open book…’ while the full detail of the problem is held as an InnerException.

In a future release of opubWriter, the user will have access to the stack of errors that can accumulate as system failures bubble up the code hierarchy.

The other validation approach is to perform a systematic and thorough examination of the .epub file using all of the rules extracted from the epub standards. That’s what the ‘validate’ button, beneath each book in the library, is intended to do.

Validation Review

Validation Review

Taking ‘an invalid.epub’ again as the example, if the user clicks on the validate button a more comprehensive series of tests is performed.

These tests will go as far as they can; obviously if there are serious deficiencies in the structure and content of the book it might not be possible to test everything.

The results are shown in the screenshot here. Each observation is tagged with a coloured light. Red indicates a violation of the specifications – a MUST/REQUIRED/SHALL rule that has not been followed.

Yellow indicates following failure to follow a SHOULD/RECOMMENDED rule. These should be treated as a warning that the feature in question should be checked. It may be what is wanted, it may be a mistake.

A green light indicates that additional information is displayed.

For example, when ‘an invalid.epub’ was examined, the initial tests were concerned with the first 60 or so bytes of the file even before the attempt to unzip it. The application found that the first 2 bytes were not ‘PK’ as they should be, mimetype is not the first file in the zip as indicated by the absence of that file name at position 30 in the file, and the application failed to find the bytes ‘application/epub+zip’ at position 38 in the file, which it should do if the mimetype file is valid.

These errors are reported with a red light. As additional information, and therefore with a green light against it, the application displays the first 60 bytes of the file. The application gets as far as it can, but then comes up against the serious failure that it can’t unzip the file. No further analysis is possible.

Conclusion

This post has presented the approach taken in opubWriter to epub validation. A textual analysis of the epub specifications reveals the requirements; the requirements are classified based on the keywords used and tests are devised to check each requirement. When an epub is opened by opubWriter, enough validation is performed to unpack the book and get the user to the point of editing content. A visual indication is provided of the validation status. Full validation is available using the ‘validate’ button. Validation results from both routes are available on a new Validation screen which lists all obsesrvations against a traffic light icon indicating their level of violation. Additional information is provided where it might help diagnose a problem.

Written by netkingcol

April 6, 2010 at 2:59 pm

TinyMCE valid_elements and the OPS Preferred Vocabulary

with one comment

opubWriter uses the tinyMCE editor for the creation and modification of content documents. TinyMCE can output XHTML documents, which is what we need to conform with the Open Publication Specification (OPS). Section 1.4.1.2 of the OPS lists the conditions which a conformant content document must meet. Condition v says:

v. all XHTML elements and attributes not contained in an Inline XML Island are drawn from the XHTML subset identified in this document.

The list of acceptable XHTML elements and attributes can be found in Section 2.2. These make up the OPS Preferred Vocabulary and include all the usual suspects like: h1, h2 , h3, strong, big, small, em, table, img, etc. (hands up – who thinks ‘etc’ is an XHTML element?)

I recognised a while ago that I hadn’t configured opubWriter’s TinyMCE instance to restrict what the author can enter to the OPS Preferred Vocabulary. TinyMCE has a configuration setting called ‘valid_elements’ which is the list of elements and attributes that an instance of the editor will allow. Elements and attributes not in the list are stripped out when TinyMCE outputs its content. This is exactly what I needed and, by comparing the list in the OPS with the default valid_elements set, I was able to derive an OPS-conformant set of valid_elements. So now I’m testing a version of opubWriter with the following configuration setting.

valid_elements : “@[id|class|style|title|dir + "onmousedown|onmouseup|onmouseover|onmousemove|onmouseout|onkeypress|"
+ "onkeydown|onkeyup],a[rel|rev|charset|hreflang|tabindex|accesskey|type|"
+ "name|href|target|title|class|onfocus|onblur],strong/b,em/i,u,”
+ “#p,-ol[type|compact],-ul[type|compact],-li,br,img[longdesc|usemap|"
+ "src|border|alt=|title|hspace|vspace|width|height|align],-sub,-sup,”
+ “-blockquote[cite],-table[border=0|cellspacing|cellpadding|width|frame|rules|"
+ "height|align|summary|bgcolor|background|bordercolor],-tr[rowspan|width|"
+ "height|align|valign|bgcolor|background|bordercolor],tbody,thead,tfoot,”
+ “#td[colspan|rowspan|width|height|align|valign|bgcolor|background|bordercolor"
+ "|scope],#th[colspan|rowspan|width|height|align|valign|scope],caption,-div,”
+ “-span,-code,-pre,address,-h1,-h2,-h3,-h4,-h5,-h6,hr[size|noshade],”
+ “dd,dl,dt,cite,abbr,acronym,del[datetime|cite],ins[datetime|cite],”
+ “object[classid|width|height|codebase|*],param[name|value|_value]“
+ “,script[src|type],map[name],area[shape|coords|href|alt|target],bdo,”
+ “col[align|char|charoff|span|valign|width],colgroup[align|char|charoff|span|"
+ "valign|width],dfn,kbd,noscript,q[cite],samp,small,”
+ “textarea[cols|rows|disabled|name|readonly],tt,var,big”

You can review the syntax of this statement at the link provided above. The following elements and their attributes were removed from the default valid_elements provided by TinyMCE:

strike, u,  font, embed, button, fieldset, form, input, label, legend, optgroup, option, select

Written by netkingcol

April 2, 2010 at 12:50 pm

Follow

Get every new post delivered to your Inbox.