Checking path-rootless with a regular expression
Section 3.5.1 of the Open Container Format says:
The values of the full-path attribute MUST contain a “path-component” (as defined by RFC3986) which MUST only take the form of a “path-rootless” (as defined by RFC3986). The path components are relative to the root of the container in which they are used.
RFC3986 discusses the generic syntax of the Uniform Resource Identifier (URI) and declares a URI to comprise a hierarchical sequence of components with the layout:
URI = scheme “:” hier-part [ "?" query ] [ "#" fragment ]
hier-part = “//” authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
The RFC gives a very thorough definition of each of these components, including path:
path = path-abempty ; begins with “/” or is empty
/ path-absolute ; begins with “/” but not “//”
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( “/” segment )
path-absolute = “/” [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( “/” segment )
path-rootless = segment-nz *( “/” segment )
path-empty = 0<pchar>
These are the “five rules of path component disambiguation” and the RFC goes on to explain what segment-nz-nc and segment-nz mean. My question for epub validation is: how can I check that the full-path attribute in a <rootfile> in container.xml is of type path-rootless?
The first task is to work out what kind of URI we have in the full-path attribute; it’s just a string and it might contain anything – especially if someone edited it by hand. The C# language has a decent string class and it would be possible to analyse the value using methods like Contains and LastIndexOf. However, there’s a more powerful approach available – regular expressions.
RFC3986 itself provides a regular expression that will perform pattern matching on a URI:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
You have to be a particular type of person to want to live and breathe regular expressions. I have dipped into them occasionally but I find that the interval between dips is just long enough for me to forget what I knew before. The regular expression shown here does the job, and that’s good enough for me – almost. I want to be able to identify which parts of a URI are present in the full-path attribute. Therefore, I extended this expression by inserting group names, resulting in this expression:
^((?<scheme>[^:/?#]+):)?(//
(?<authority>[^/?#]*))?
(?<path>[^?#]*)(\?
(?<query>[^#]*))?(#
(?<fragment>.*))?”;
To see how this works, consider the following erroneous <rootfile> definition:
<rootfile full-path=”http://opubwriter.com/OPS/epub.opf?option=42#package-doc”
media-type=”application/oebps-package+xml”/>
The full-path attribute clearly has, in addition to a path, a scheme (http), an authority (opubwriter.com), a query (option=42), and a fragment (package-doc). Running the following C# code shows how the regular expression analyses the full-path string:
Regex pathExpr = new Regex(REGEX_URI);
Match m = pathExpr.Match(fullPath);
if (m.Success)
{
int i = 0;
foreach (Group g in m.Groups)
{
string groupName = pathExpr.GroupNameFromNumber(i++);
switch (groupName)
{
case “scheme”:
Debug.WriteLine(String.Format(“scheme: {0} at {1}”, g.Value, g.Index));
break;
case “authority”:
Debug.WriteLine(String.Format(“authority: {0} at {1}”, g.Value, g.Index));
break;
case “path”:
Debug.WriteLine(String.Format(“path: {0} at {1}”, g.Value, g.Index));
break;
case “query”:
Debug.WriteLine(String.Format(“query: {0} at {1}”, g.Value, g.Index));
break;
case “fragment”:
Debug.WriteLine(String.Format(“fragment: {0} at {1}”, g.Value, g.Index));
break;
}
}
}
The regular expression pathExpr is initialised with REGEX_URI – a constant set to the raw value of the expression shown above. The Match method is given the ‘fullpath’ variable which contains the string value of the full-path attribute.
If the match is successful, then the Groups of the match are examined. The GroupCollection corresponds with the subexpressions found in the regular expression. Where the group is named, we can look to see if it’s an unwanted URI component. Running this code with the <rootfile> shown above gives the following output:
scheme: http at 0
authority: opubwriter.com at 7
path: /OPS/epub.opf at 21
query: option=42 at 35
fragment: package-doc at 45
This shows the subexpressions found in the matching process and gives the position at which each match was found. Rootfile validation can now include checks on the full-path attribute. If any part of a URI other than a path-component is detected a validation failure can be reported. If only a path is matched there is the further requirement that it be path-rootless. My interpretation of the five rules of disambiguation is that to be path-rootless, the path must have length greater than zero, not begin with a slash (“/”), and not include a colon (“:”) in the first segment.
I think that trying to explain something to someone else is a good measure of how much you understand; my problem is I’m already starting to forget.
[...] This post was mentioned on Twitter by Colin Hazlehurst. Colin Hazlehurst said: Epub validation: how to check that full-path is path-rootless using regular expressions: http://wp.me/pR3qe-40 [...]
Tweets that mention Checking path-rootless with a regular expression « opubWriter -- Topsy.com
April 7, 2010 at 12:31 pm