This document attempts to describe the format of a .rb file -- the book format that is downloaded into NuvoMedia's hand-held wonder, the Rocket eBook.
Note: All multi-byte integers are stored in Vax/Intel order (the opposite of network byte order). Most integers are 4 bytes (an int32), but there are some minor exceptions (as detailed below).
Also, the following document refers to the .rb file sections as "pages".
The first 4 bytes of the file seem to be a magic number (in hex): B0 0C B0 0C. I like to think of this as a hexidecimal pun on the word "book" (repeated). [Matt Greenwood has reported seeing a magic number of "B0 0C F0 0D" in another type of ReB-related file -- i.e. "book food".]
The next two bytes appear to be a version number, currently "02 00". I assume this means major version 2, minor version 0.
The next 4 bytes are the string "NUVO", followed by 4 bytes of 00h. (I have also seen an old title that had 0s in place of the "NUVO".)
This brings us up to offset 0Eh, at which point we have a 4-byte representation of the date the book was created (Matt Greenwood pointed this out to me -- thanks!). The year is encoded as an int16. On older version of the RocketLibrary was encoding the year's full value (e.g. 1999 was "CF 07" and 2000 was "D0 07"), but a more recent version is now using the tm_year value verbatim -- i.e. it's storing 100 for the year 2000 ("64 00"). The year is followed by an int8 for the 1-relative month number, and an int8 for the day of the month.
After that is 6 bytes of 00h. These may be reserved for setting the time of creation (at a guess).
Then, at offset 18h, we have an int32 that contains the absolute offset of the "Table of Contents" (the directory of the pages contained within this .rb file). In all of the .rb file's I've seen, this remains constant with a value of 128h. However, I have tested an atypical .rb file where I placed the ToC at the end of the file (after all the file contents), and it worked fine. (I've chosen not to build any books in such a non-standard format, however.)
Immediately following this is an int32 with the length of the .rb file (so we can check if the file is complete or not).
All the bytes from here (offset 20h) up to offset 128h appear to only be used by an encrypted title. In a non-encrypted title, they are always 0.
The table of contents typically comes next (at offset 128h). It starts with an int32 count of the number of "page" entries (.rb-file sections) in the ToC. Each entry consists of a name (zero-padded to 32 bytes), followed by 3 int32s: the length of this entry's data segment, the absolute offset of the data in the .rb file, and a flag. The known flag values are: 1 (encrypted), 2 (info page), and 8 (deflated). The names are tweaked as needed to ensure that they are all unique. The current RocketWriter software uses a unique 6-digit number, a dash, up to 8 characters from the filename, and then the re-mapped suffix for the data (.html, .hidx, .png, .info, etc.). My rbmake library simply ensures that the names are no longer than 15 characters (not counting the suffix) and are all unique.
Often the first item in the ToC is the info page, but it doesn't have to be. This page of information contains NAME=VALUE pairs that note the author, title, what the root-page's name is, etc. (See appendix A). This data is never encrypted nor compressed, so this entry's flag value is always "2".
An image page is always stored as a B&W image in PNG format. Since it has its own compression, it is stored without any additional attempt at deflation. I have also never seen an encrypted image, so its flag value is always 0.
An HTML page contains the tags and text that were re-written into a consistent syntax (this presumably makes the HTML renderer in the ReB itself simpler). HTML pages are typically compressed (See appendix B). Every HTML page appears to use the suffix .html no matter what the file name was on import (but I have seen older files with .htm used as the suffix, so the rocket appears to support both).
For every HTML page there is a corresponding .hidx page that contains a summary of the paragraph formatting and the position of the anchor names in the associated .html page (See appendix C). This page is sometimes compressed, depending on length (See appendix B).
There are also reference titles that have a .hkey page that contains a list of words that can be looked up in the associated .html page (See appendix D).
Immediately following the ToC is the data for each piece mentioned in the ToC, in the same order as it appeared in the ToC.
Finally, the end of the file appears to be padded with 20 bytes of 01h.
The info page consists of a series of lines that contain "NAME=VALUE" strings. Each line is terminated by a single newline. Here are the values that the RocketWriter generates:
COMMENT=Info file for <title> TYPE=2 TITLE=<title> AUTHOR=<author> URL=ebook:<long, unique string used for the file's name by the librarian> GENERATOR=<e.g. RocketLibrarian 1.3.216> PARSE=1 OUTPUT=1 BODY=<name of root HTML page (as it appears in the ToC)> MENUMARK=menumark.html SuggestedRetailPrice=<usually empty>
Encrypted titles have a few more entries (including those listed above):
ISBN=<ISBN number, including dashes> REVISION=<digits> TITLE_LANGUAGE=<en-us> PUB_NAME=<Publisher's name> PUBSERVER_ID=<digits> GENERATOR=<e.g. RocketPress 1.3.121> VERSION=<digits> USERNAME=<rocket-ID> COPY_ID=<digits> COPYRIGHT=<copyright> COPYTITLE=<another copyright?>
A reference title also has an indication that there is a .hkey page present, and may also have a GENRE of "Reference":
HKEY=1 GENRE=Reference
Compressed pages have a data section in the .rb file with the following format:
The first int32 is a count of the number of 4096-byte chunks of data we broke the uncompressed page into (the last chunk can be shorter than 4096 bytes, of course).
This is immediately followed by an int32 with the length of the entire uncompressed data.
After this there are <count> int32s that indicate the size of each chunk's compressed data.
Following these length int32s is the output from a deflation (the algorithm used in gzip) for each 4096-byte chunk of the original data. It appears that you must use a window-bit size of 13 and a compression level of "best" to be compatible with the Rocket eBook's system software.
The .hidx page's purpose is to allow the renderer to quickly look up the format of each paragraph (useful for random access to the data), and the position of the anchor names.
The first section lists the various paragraph-producing tags. It is headed by a line of "[tags <count>]", where <count> is the number of tags that follow this header. The tags are listed one per line, and have an implied enumeration from 0 to N-1 (which the other tags and the upcoming paragraph sections reference).
The first tag is typically (always?) "<HTML> -1". The number trailing the tag indicates what other tag (or sequence of tags, one per line) in which we are nested. So, if we have a <BR> nested inside a <P ALIGN="center">, it would be listed separately from a <BR> that was nested inside a normal paragraph, and each one would have a different trailing index number.
Following the tag section is the paragraph section. The heading is "[paragraphs <count>]", and is followed by a line for each paragraph. These lines consist of a character offset into the .html page for the start of the paragraph followed by a 0-relative offset into the tag section (indicating what kind of formatting to use for the indicated paragraph).
The paragraph-section character offsets point to the first bit of text after the associated tag.
The last section details the anchor names. The heading is "[names <count>]", and each item that follows is a quoted string of the anchor name, followed by a character offset into the .html page where we'll find that name. If there are no names in the associated HTML section, the heading is included with a 0 count (i.e. "[names 0]").
The name-section character offsets point to the start of the anchor tag (not after the tag, like the offsets in the "paragraphs" section).
The lines are terminated by newlines (in standard unix fashion).
For example:
[tags 10] <HTML> -1 <BODY> 0 <P ALIGN="right"> 1 <P ALIGN="left"> 1 <P> 1 <H3 ALIGN="center"> 1 <P ALIGN="center"> 1 <BR> 6 <H2 ALIGN="center"> 1 <BR> 1 [paragraphs 42] 160 9 164 9 184 8 220 8 261 6 316 5 359 1 379 6 410 6 460 7 511 7 564 7 616 7 668 7 720 7 773 7 827 7 880 7 933 7 988 7 1043 7 1100 7 1157 7 1214 7 1270 7 1328 7 1385 7 1442 7 1497 7 1556 7 1561 7 1635 1 1656 5 1690 6 1737 7 1773 5 1798 4 1826 3 2663 1 2668 4 2689 2 2730 8 [names 1] "ch1" 2689
The .hkey page contains a list of words, one per line, sorted in a strict ASCII sequence, each one followed by a tab and the offset in the .html page of the word's data. I presume that the .hkey page must share the same name prefix as its related .html page.
If the names contain high-bit characters, they are translated into regular ASCII in the .hkey file, since this allows the user to search for the words using unaccented characters.
The lines are terminated with a newline (in standard unix fashion).
An example:
a 5 apple 38 b 84 book 104
Each of these offsets points to a paragraph tag in the associated .html page. I have only seen this sequence of tags used so far:
<P><BIG><B>word</B></BIG> other stuff</P>
I have seen multiple <B>...</B> tags in the middle of the single set of <BIG>...</BIG> tags, but this is the basic tag format.
The offset in the .hkey page points to the start of the <P> tag.
Return to the rbmake home page.