incomplete/typesetting-engine/CHAPTER.rst
Typesetting engine
A typesetting engine deals with the composition of text for visual display or printing. In the digital era we live in, the main challenge for achieving an aesthetically pleasant result is to figure out the best possible line breaks in a stream of text representing paragraphs.
To illustrate the need of a good algorithm for breaking paragraphs into lines, I will let the god speak:
.. image:: http://jill-jenn.net/metasentence.png :align: center
Our contribution takes a stream of paragraphs with titles and creates PostScript slides.
For the rendering part, we use the page description language PostScript because of its simplicity. The main command that suits our needs is x y moveto (c) show: it puts the character c at the position x y of the current slide.
This algorithm was described by Donald E. Knuth and Michael F. Plass in a 1981 paper entitled "Breaking Paragraphs into Lines".
Our text is composed of blocks of several types:
Among these blocks, we need to identify at which positions we will insert breakpoints, within glues or after hyphens. A choice of breakpoints is thus a list of increasing positions within the list of blocks.
Between two consecutive breakpoints, if we know the size of each box, it is easy to compute the stretching/shrinking ratio of the spaces in order to justify the text. Basically, we want to minimize the deformation of the glue and use as few hyphens as possible: with low shrink/stretch ratio comes great readability. Each breakpoint will have a badness score called total of demerits. We thus want to solve the following optimisation problem: how to determine efficiently the breakpoint sequence achieving the fewest demerits?
If we ask ourselves the question: “From a certain breakpoint, what are the next feasible breakpoints?”, we can imagine a graph where the nodes are the breakpoints and there is an edge between nodes A and B if B is a feasible breakpoint right after the breakpoint A. Please note that this graph is acyclic, because the nodes along a path have increasing positions.
Knowing this graph, it is easy to compute efficiently the shortest path, if we regard the demerits as distances. But here, the algorithm really is original because we compute the best path while constructing the directed acyclic graph (DAG). To be precise, we maintain a linked list of breakpoints that are part of the DAG.
The raw text becomes a sequence of blocks for which we have to figure out the best position of the breakpoints (typesetting part) so that we can paint all blocks on a slide (rendering part).
The best position is chosen according to a certain “score of badness” called demerits.
To represent blocks, the Block object needs the following attributes:
Enum);SPACE_STRETCH and SPACE_SHRINK);penalty value if the block is a penalty: an amount of badness. For example, putting a breakpoint after a block corresponding to an hyphen is considered worse than putting it in place of a glue;flag boolean: “Is this a flagged penalty?” Two consecutive breakpoints at flagged penalties get extra demerits. In Knuth's paper, all penalties are flagged ones. As of tradition, we kept it in the implementation.All those attributes are defined at the beginning of the typesetting step and are not supposed to be modified. This is why we chose a namedtuple, which is not mutable, instead of a new class.
Breakpoints are at the core of our main data structure: a linked list of nodes that are part of a DAG. The Breakpoint class has the following attributes:
Please keep in mind that previous and link are not symmetric. In the following graph, there is an arrow from A to B if B.previous = A but for example, if those breakpoints are part of the linked list, we have so.link = seen and seen.link = dark.
.. image:: http://i.imgur.com/8YUCMeM.png :align: center
If we want to justify the line between two breakpoints, we must compute the ratio of its spaces. In the best possible world, spaces need not be modified (ratio = 0). If the line is too short, spaces need to be stretched, all of the same amount ratio * SPACE_STRETCH (ratio > 0). If the line is too long, spaces need to be shrinked, so we need to add to each of them the negative value ratio * SPACE_SHRINK (ratio < 0).
Thus, knowing the value current_line_length - width and the number of spaces within the current line (actually, the sum of their shrink/stretch parameters), we can compute the ratio using a simple division. In order to get a pleasant result, abs(ratio) needs to be as low as possible. This is why the demerits depend of the ratio parameter.
Using the ratio and a few other parameters (fitness class), we can compute the increase of demerits between any two breakpoints. Now our goal is to find the best sequence of breakpoints i.e. the one achieving the lowest demerits. If we knew the complete list of breakpoints, we could compute that value by dynamic programming:::
For each line
For each breakpoint registered at this line
Find the best previous breakpoint (the one achieving the lowest demerits)
But we don't. As we compute the demerits while adding breakpoints to our data structure, we need to use cunning.::
For each block
If this block is a possible breakpoint
For each line
For each breakpoint registered at this line in my linked list
Check if this breakpoint is the best previous breakpoint for the current block
Add into the linked list a breakpoint for my current block registered at the next line, knowing its best previous breakpoint
Well, that's clever.
Those steps ensure that the linked list is sorted by line number, which is essential for our dynamic programming approach. (Those consecutive nested for loops are represented by nested while loops in the find_best_previous_breakpoints method.)
More optimizations are made so that the linked list has no obsolete breakpoints (i.e. achieving a ratio less than -1, which corresponds to an ugly shrink). The first nodes of the linked list are progressively pruned.
As we know the width of each box, if we get the optimal sequence of breakpoints and the corresponding ratios, we can compute the position of each glue as well, thus paint all blocks at their right positions, so that the text can be justified.
It was hard to find how to get the width of all characters of a given font. There were several ways to estimate them but we wanted to get the true value. Our final solution used the output of ttf2tfm (TeX font metric) as stated in compute_font_metrics.py.