Dark Medieval Times

A TABLE RECOGNITION HEURISTIC

if you're blind that's too bad --- this is a really cool image

Given an image of a table, such as the one above, is it possible to compute the approximate position of the left edge of each column in a simple way? One solution may look like this:

In the image above, the left edge of each column is colored red, except for the left edge of the first column, which is defined to be the left edge of the image.

I propose a cheap, simple heuristic for determining the position of the left edge of each column in such an image.

First, partition the image into columns of width W. I have found W = 15 pixels to be effective in practice. Roughly speaking, the smaller W is, the more accurate the heuristic.

Now randomly sample N pixels from each partition, finding the simple a*(r+g+b) sum of the samples for each partition and normalize the result. I have found N = (W/3)*H, where H is the height of the image, to be sufficient in most cases. The higher N is, the slower but the more accurate the heurstic.

One such result is visualized by shading the partions with the color

	rgba(255, 0, 0, 1 - s)

In this expression, s is the partition sum. 1 is maximum opaqueness and 0 is fully transparent, meaning he more opaque the partition in the graphic below, the smaller its sum, s.

Before the final step of the heuristic is described, examine the partitions which have significantly large s, say s > x̄ + 0.35*σ, where x̄ is the mean of the sums and σ is the standard deviation. These columns are shaded blue in the following graphic.

Unsurprisingly, the significant partitions are almost exactly the ones which contain very little text. The final step of the heuristic simply identifies the middle partition in each contiguous subsequence of significant partitions as the left edge of a column.