Competition on Word Recognition from Segmented Historical Documents

The goal for this competition is to maximize automation while minimizing error in recognizing, labeling, clustering, or otherwise extracting “meaning” or “information” from segmented word fields. This competition targets the subset of historical documents that exhibit some degree of structured layout, and repetition of handwritten data in tables, rows, columns, forms, etc. To exploit the consistent structure or layout of these documents the organizers have provided bounding box coordinates locating each field. To exploit repetition of data within a field (illustrated below) the organizers have also provided a lexicon or dictionary with word frequency data for each field type.

The challenges motivating this competition are illustrated with the following image of a UK 1911 census document that contains (in this case) ten rows (records), each comprised of 15 columns. A particular entry in this table at a specific row and column is referred to as a field. A field contains handwritten word(s) of known type(s), which in this example are Name, Surname, Relationship, Age, and Place (as highlighted below).

Specific fields, indicated in red, yellow, green and blue highlighting, have been identified with in the original document. Bounding box coordinates are provided for locating each field of interest. The following are four snippets extracted from the above image:

In addition to the predictable structure or layout of the documents, we also observe that there is frequently repetition in the data within the fields as illustrated below. Lexicons with word frequency data are provided for each field of interest.

We anticipate that many algorithms and techniques including but not limited to word recognition, word spotting, clustering, etc. will be useful in solving the problems presented in this competition.

Information regarding deadlines, registration, and resources provided for participating in this completion is accessible via the corresponding tabs in the menu above. Details pertaining to file formats, executables, and scoring are located on the Protocol and Evaluation pages. Finally, categories and amounts of cash prizes are described on the Awards tab where the winners and results will be posted following the competition.