Saturday, March 9, 2019

Make a Heidegger tool. Part I: assemble an archive

You're going to need Heidegger's texts. There are 100 volumes in his complete works. More than half of the German volumes are shared on the internet. A few are very good, all the characters in the electronic text are correct. A few are almost useless, the text can't be searched. Most of the usable texts are shared in PDF files.

1. Get the best version of each volume
Most PDFs have images of the pages and the text extracted (OCR) from the images. For our purposes, what matters is the quality of the text, not the quality of the images.
If you have a better OCR, extract better text from the images.

2. Convert the text to HTML files
Export the text from the PDF. Create an HTML page for each page of relevant text in the book. Try to get the most information possible from the PDF, like font (e.g. italics).
You will now have an archive of the best available texts. It'll be 90% reliable for simple words. 10% reliable for words with umlauts or Greek.

3. Correct the text
Update the text in the HTML pages to be correct, match what is on the printed page, in order to reliably search it. Most of the errors will be the results of OCR, which will make consistent errors, so you can make corrections across all the text files. Apply spellchecker; you'll need to add Heidegger's neologisms.

4. Put the HTML files on web server

5. Have search engines index pages

You can look up individual pages or search inside all texts.

The goal is to have 100% of the texts and that they be 100% correct in order to be able to search them reliably.