Wednesday, November 18, 2015

The Lurking Ligature

I ran into another horror on the road to making texts instantly available. In typography, a ligature is a special character made by joining two letters. For example, æ. There are also ligatures for prettifying text. There's an fl ligature, which won't render in Blogger, so I can't show it in this sentence. There's an Adobe PDF editor that includes ligatures as an option when you save a document. It will convert the fl's in the document into the special fl character, with the tops of the f and l connected, like the fl in 'reflections' from this book:
Yet another anomaly confusing machines, and people searching for 'reflections', although I expect Windows and Google handle ligatures automatically. Something to consider if you author PDF docs, and want your doc to be find-able.

Sunday, November 15, 2015

Added SundZ to GA App

Earlier this year I added full text search to the GA App, and I started adding content to the GA App. I started with some papers, and then added pages they cited, mainly individual German pages from GA volumes and pages from translated books. I would manually create each page in HTML and then manually add a hyperlink to the page, to the citation in the paper. Generally I would pick pages that were cited a lot, so adding a new page would allow me to add hyperlinks to multiple papers.

I would also add the new pages to the full text search index.

I created and added a couple hundred HTML pages, but creating each page manually didn't scale. I would take too long to create HTML pages for most of Heidegger's works. So at the beginning of summer I set that aside, and started looking into ways of generating the HTML pages automatically.

Most of the texts available on the internet are in PDF files. I spent some time experimenting with different tools and libraries of PDF functions. Getting the raw text out of a PDF file isn't difficult, but I wanted the HTML pages to look as much like the page in the book as possible. That meant getting information about the fonts (size, italics, etc.) and position (is a paragraph indented?).

I settled on using a library called pdfbox (it's on GitHub). I see they have a new version, 2.0, released last month. I'll have to try out the new one. I used the .NET version of 1.8.7. With pdfbox, I can get a list of all the characters on a page, with the coordinates of each, and font information.

I tried several different PDF files of books, and they were all sufficient different. That I ended up writing different programs to extract pages form different books. The programs have a common pattern and share common functions, but a single program that could handle all the books would be too complex. I decided to concentrate on the program to generate HTML pages from a PDF file of Sein und Zeit., SundZSteller.EXE. By October I had it working, generating HTML pages for all 437 pages of text in the book.

To date, when I need to add new data to the GA App, I rebuild the database; throw away the old database and create a new one. To add all 437 book pages, I wanted to figure out how to add a book with rebuilding everything. So I wrote a program that (1) adds the pages to Azure Storage, (2) adds the pages to Azure Search, and (3) adds the pages to the GA database. I finished the program yesterday and ran it and added the 437 pages to the GA App.

When someone searches for a term in the GA App, and Azure Search finds the term in a SundZ page, it returns a link to that page (and links to anywhere else the term was found). The link fetches the page's data from the database, displays the book page details web page with the data, and fetches the book page HTML from Azure Storage, and also displays the book page in the same web page.

Now I will go on to the next book, and write a program to extract all the pages from its PDF file. It should take less effort the second time.