machinations

Saturday, May 4, 2019

How to sort Greek in C#

Found here.

class GreekComparer : IComparer<string>
{
    public int Compare(string s1, string s2)
    {
        return String.Compare(s1.Normalize(System.Text.NormalizationForm.FormD),
                              s2.Normalize(System.Text.NormalizationForm.FormD),
                              StringComparison.InvariantCultureIgnoreCase);
    }
}

GreekComparer gc = new GreekComparer();
sortedWordList = wordList.Sort(gc);

Saturday, March 16, 2019

Normalize Unicode

When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings

So important.

In javascript:

const normalized = str.normalize('NFC')

Saturday, March 9, 2019

Make a Heidegger tool. Part I: assemble an archive

You're going to need Heidegger's texts. There are 100 volumes in his complete works. More than half of the German volumes are shared on the internet. A few are very good, all the characters in the electronic text are correct. A few are almost useless, the text can't be searched. Most of the usable texts are shared in PDF files.

1. Get the best version of each volume
Most PDFs have images of the pages and the text extracted (OCR) from the images. For our purposes, what matters is the quality of the text, not the quality of the images.
If you have a better OCR, extract better text from the images.

2. Convert the text to HTML files
Export the text from the PDF. Create an HTML page for each page of relevant text in the book. Try to get the most information possible from the PDF, like font (e.g. italics).
You will now have an archive of the best available texts. It'll be 90% reliable for simple words. 10% reliable for words with umlauts or Greek.

3. Correct the text
Update the text in the HTML pages to be correct, match what is on the printed page, in order to reliably search it. Most of the errors will be the results of OCR, which will make consistent errors, so you can make corrections across all the text files. Apply spellchecker; you'll need to add Heidegger's neologisms.

4. Put the HTML files on web server

5. Have search engines index pages

You can look up individual pages or search inside all texts.

The goal is to have 100% of the texts and that they be 100% correct in order to be able to search them reliably.

Sunday, February 10, 2019

The Blazor Sofist

There's a new low level language in browsers called WebAssembly. Microsoft has built an experimental mechanism for running .NET virtual machines on WebAssembly called Blazor. That means that .NET languages like C# can now be used to write apps for web browsers.

I've written a simple Blazor app for a new, post-codex, "book". The first third of Heidegger's lectures on Plato's Sophist, which are about Aristotle's Metaphysics and Nicomachean Ethics. The app has the English and German text. I call the app Preliminary Sofist.

I've written the C# code to link Greek words on a page to their wiki entry, if that entry exists, the first time the word appears on a page. And, code to decorate Greek words with their English translation, if it appears in the glossary at the back of the book, so that the translation appears when the pointer hovers over the Greek word.

When Blazor's features improve, I want to add a dialog with Greek help, that pops up on double-clicking a Greek word, like I did with the Angular 5 Volpi app last year.

The app only works with Chrome. I use the IFrame srcdoc attribute to insert the page content into the app. Edge doesn't fully support HTML5.

I couldn't figure out how to get the app's URL routing to work from a sub-folder on a web site, so I had to host the app on its own domain.

I still have to proof-read and correct OCR errors from 2/3s of the German text, and add all the Greek words to the glossary and wiki links.

Tuesday, May 8, 2018

Why I haven't updated the GA App and its book links in a few years

What happened is that I was putting the books on my onedrive and then creating a link url on the site. The urls had a fixed format, with an id parameter. I store the id with the book record in the books table, and when generating the page HTML, generate the book's onedrive url with the id. A couple years ago, onedrive changed the format of their urls. The old book urls still work, but new urls have a different format. So I need to rewrite the code that generates the url. But I can't. I wrote the GA app four years ago, and Microsoft's changed its web tools (moved from old .NET to new .Net Core -- open source, works on Mac and Linux) and Visual Studio no longer supports the old .NET tools I used four years ago. So I need to upgrade to .NET core to compile any new code and rebuild the website. And I haven't got around to that upgrade project yet. Currently I'm more interested in learning client technologies, pushing more work to the browser, just serving files, and not having to keep a database running in the cloud to support the app.

Sunday, February 18, 2018

How to generate the Volpi book into a single file

Each page in the Volpi translation app is in a separate HTML file. I edit and update the individual files, and I don't maintain a single file with all of the book's pages.

However, you can generate a single HTML file with all the pages yourself, by pulling the individual pages from the app and concatenating them into a single file.

Here's how.

First create a start.html file with the HTML tags at the top:


<!DOCTYPE html>

<html>

<head>

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

 <title>Franco Volpi - Heidegger and Aristotle</title>

</head>

<body>

    <h1>Franco Volpi - Heidegger and Aristotle</h1>

    <h2>Translated by Pete Ferreira</h2>
    
    <hr/><br/><br/>

Then create a end.html file with the HTML tags at the bottom:


</body>


</html>

Then use this PowerShell script to concatenate start.html, all the pages from the app, and end.html.

$bookContent = Get-Content 'start.html'
For ($pagenumber=1; $pagenumber -lt 118; $pagenumber++) {
    $paddedpagenumber = ("{0:D3}" -f $pagenumber)
    $url = "http://beyng.com/volpi/assets/EN/Volpi.$paddedpagenumber.html"
    $resp = Invoke-WebRequest -URI $url
    $bookContent += "<br/><br/><p style=""text-align:center"">$pagenumber</p>`n`r"

    $bookContent += [system.Text.Encoding]::UTF8.GetString($resp.RawContentStream.ToArray())
}
$bookContent += Get-Content 'end.html'
$bookContent | Out-File VolpiBook.html

In between each page, the script inserts HTML with the page number. The padded page number is required for the page URLs - e.g. page 1 as 001. The UTF8.GetString stuff is required to keep the Greek characters from getting munged.

Wednesday, November 15, 2017

Volpi book

In April, after talking to some folks at Heidegger Circle 2017 in Walla Walla, I started translating Franco Volpi's Heidegger e Aristotele. I finished my first pass at the beginning of November. I now need to ensure I've translated words consistently, and that the English is readable. There are a few diagrams I need to render in SVG. But now that I have all the English text in HTML, I can create an app to review the pages.

I grabbed the Italian text from an e-book that circulates. I used a website to translate from the Kindle file to a PDF. Then I used the tools I've developed for other PDF books to get the text out into an HTML file per paragraph. The Kindle version of the book doesn't have page numbers - page 1 is the cover.

With previous book I extracted from PDF to HTML last year, I created an AngularJS app to display the book in browsers, making it easy page through or jump to any page. Earlier this year I wrote an app to try out Angular 2.0. Now, Angular is currently at version 5.0, so I decided to create the app for the Volpi book with the latest Angular and figure out how it works. I ran through the tutorial at angular.io last weekend, and today, I had a day off, and wrote an app for the book.

And it's fully functional: can navigate from ToC to any page, page through, jump to page, flip to the Italian text. And I got it running on beyng.com. Now I just need to pimp it up with some CSS. And add filters to allow users to select their favorite English word for controversial Heidegger terms. And finish the translation.