I few years ago I wrote an app for the Volpi book, in Angular 5, to make it easier to study the text. For example, double-clicking on a Greek word popped up a screen about that Greek word.
This year I've been working two texts and trying to create apps for them. I started with the first ~200 pages of GA 19 (Plato's Sophist) and wrote a Blazor app. Blazor is an experimental framework for writing code in browsers using WebAssembly. The app hosts the German pages and English translation. I has some features like "hover over a Greek word to see glossary look-up". Some features, like responding to a double-click on a selected word, son't quite work in Blazor yet. I need to do some more experimenting with Blazor as it matures.
In the summer I joined a B&T reading group, and created a B&T app, with the first ~100 pages of that text. In addition to the German and English, this app also hosts Tom Sheehan's paraphrastic condensation; users can flip between English translation and paraphrase.
Creating apps for texts is labor intensive if the text is not ready -- e.g., needs OCR corrections.
Monday, November 11, 2019
Saturday, May 4, 2019
How to sort Greek in C#
Found here.
class GreekComparer : IComparer<string>
{
public int Compare(string s1, string s2)
{
return String.Compare(s1.Normalize(System.Text.NormalizationForm.FormD),
s2.Normalize(System.Text.NormalizationForm.FormD),
StringComparison.InvariantCultureIgnoreCase);
}
}
GreekComparer gc = new GreekComparer();
sortedWordList = wordList.Sort(gc);
Saturday, March 16, 2019
Normalize Unicode
When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
So important.
In javascript:
So important.
In javascript:
const normalized = str.normalize('NFC')
Saturday, March 9, 2019
Make a Heidegger tool. Part I: assemble an archive
You're going to need Heidegger's texts. There are 100 volumes in his complete works. More than half of the German volumes are shared on the internet. A few are very good, all the characters in the electronic text are correct. A few are almost useless, the text can't be searched. Most of the usable texts are shared in PDF files.
1. Get the best version of each volume
Most PDFs have images of the pages and the text extracted (OCR) from the images. For our purposes, what matters is the quality of the text, not the quality of the images.
If you have a better OCR, extract better text from the images.
2. Convert the text to HTML files
Export the text from the PDF. Create an HTML page for each page of relevant text in the book. Try to get the most information possible from the PDF, like font (e.g. italics).
You will now have an archive of the best available texts. It'll be 90% reliable for simple words. 10% reliable for words with umlauts or Greek.
3. Correct the text
Update the text in the HTML pages to be correct, match what is on the printed page, in order to reliably search it. Most of the errors will be the results of OCR, which will make consistent errors, so you can make corrections across all the text files. Apply spellchecker; you'll need to add Heidegger's neologisms.
4. Put the HTML files on web server
5. Have search engines index pages
You can look up individual pages or search inside all texts.
The goal is to have 100% of the texts and that they be 100% correct in order to be able to search them reliably.
1. Get the best version of each volume
Most PDFs have images of the pages and the text extracted (OCR) from the images. For our purposes, what matters is the quality of the text, not the quality of the images.
If you have a better OCR, extract better text from the images.
2. Convert the text to HTML files
Export the text from the PDF. Create an HTML page for each page of relevant text in the book. Try to get the most information possible from the PDF, like font (e.g. italics).
You will now have an archive of the best available texts. It'll be 90% reliable for simple words. 10% reliable for words with umlauts or Greek.
3. Correct the text
Update the text in the HTML pages to be correct, match what is on the printed page, in order to reliably search it. Most of the errors will be the results of OCR, which will make consistent errors, so you can make corrections across all the text files. Apply spellchecker; you'll need to add Heidegger's neologisms.
4. Put the HTML files on web server
5. Have search engines index pages
You can look up individual pages or search inside all texts.
The goal is to have 100% of the texts and that they be 100% correct in order to be able to search them reliably.
Sunday, February 10, 2019
The Blazor Sofist
There's a new low level language in browsers called WebAssembly. Microsoft has built an experimental mechanism for running .NET virtual machines on WebAssembly called Blazor. That means that .NET languages like C# can now be used to write apps for web browsers.
I've written a simple Blazor app for a new, post-codex, "book". The first third of Heidegger's lectures on Plato's Sophist, which are about Aristotle's Metaphysics and Nicomachean Ethics. The app has the English and German text. I call the app Preliminary Sofist.
I've written the C# code to link Greek words on a page to their wiki entry, if that entry exists, the first time the word appears on a page. And, code to decorate Greek words with their English translation, if it appears in the glossary at the back of the book, so that the translation appears when the pointer hovers over the Greek word.
When Blazor's features improve, I want to add a dialog with Greek help, that pops up on double-clicking a Greek word, like I did with the Angular 5 Volpi app last year.
The app only works with Chrome. I use the IFrame srcdoc attribute to insert the page content into the app. Edge doesn't fully support HTML5.
I couldn't figure out how to get the app's URL routing to work from a sub-folder on a web site, so I had to host the app on its own domain.
I still have to proof-read and correct OCR errors from 2/3s of the German text, and add all the Greek words to the glossary and wiki links.
I've written a simple Blazor app for a new, post-codex, "book". The first third of Heidegger's lectures on Plato's Sophist, which are about Aristotle's Metaphysics and Nicomachean Ethics. The app has the English and German text. I call the app Preliminary Sofist.
I've written the C# code to link Greek words on a page to their wiki entry, if that entry exists, the first time the word appears on a page. And, code to decorate Greek words with their English translation, if it appears in the glossary at the back of the book, so that the translation appears when the pointer hovers over the Greek word.
When Blazor's features improve, I want to add a dialog with Greek help, that pops up on double-clicking a Greek word, like I did with the Angular 5 Volpi app last year.
I couldn't figure out how to get the app's URL routing to work from a sub-folder on a web site, so I had to host the app on its own domain.
I still have to proof-read and correct OCR errors from 2/3s of the German text, and add all the Greek words to the glossary and wiki links.
Tuesday, May 8, 2018
Why I haven't updated the GA App and its book links in a few years
What happened is that I was putting the books on my onedrive and then creating a link url on the site. The urls had a fixed format, with an id parameter. I store the id with the book record in the books table, and when generating the page HTML, generate the book's onedrive url with the id. A couple years ago, onedrive changed the format of their urls. The old book urls still work, but new urls have a different format. So I need to rewrite the code that generates the url. But I can't. I wrote the GA app four years ago, and Microsoft's changed its web tools (moved from old .NET to new .Net Core -- open source, works on Mac and Linux) and Visual Studio no longer supports the old .NET tools I used four years ago. So I need to upgrade to .NET core to compile any new code and rebuild the website. And I haven't got around to that upgrade project yet. Currently I'm more interested in learning client technologies, pushing more work to the browser, just serving files, and not having to keep a database running in the cloud to support the app.
Sunday, February 18, 2018
How to generate the Volpi book into a single file
Each page in the Volpi translation app is in a separate HTML file. I edit and update the individual files, and I don't maintain a single file with all of the book's pages.
However, you can generate a single HTML file with all the pages yourself, by pulling the individual pages from the app and concatenating them into a single file.
Here's how.
First create a start.html file with the HTML tags at the top:
Then create a end.html file with the HTML tags at the bottom:
Then use this PowerShell script to concatenate start.html, all the pages from the app, and end.html.
In between each page, the script inserts HTML with the page number. The padded page number is required for the page URLs - e.g. page 1 as 001. The UTF8.GetString stuff is required to keep the Greek characters from getting munged.
However, you can generate a single HTML file with all the pages yourself, by pulling the individual pages from the app and concatenating them into a single file.
Here's how.
First create a start.html file with the HTML tags at the top:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Franco Volpi - Heidegger and Aristotle</title>
</head>
<body>
<h1>Franco Volpi - Heidegger and Aristotle</h1>
<h2>Translated by Pete Ferreira</h2>
<hr/><br/><br/>
Then create a end.html file with the HTML tags at the bottom:
</body>
</html>
Then use this PowerShell script to concatenate start.html, all the pages from the app, and end.html.
$bookContent = Get-Content 'start.html'
For ($pagenumber=1; $pagenumber -lt 118; $pagenumber++) {
$paddedpagenumber = ("{0:D3}" -f $pagenumber)
$url = "http://beyng.com/volpi/assets/EN/Volpi.$paddedpagenumber.html"
$resp = Invoke-WebRequest -URI $url
$bookContent += "<br/><br/><p style=""text-align:center"">$pagenumber</p>`n`r"
$bookContent += [system.Text.Encoding]::UTF8.GetString($resp.RawContentStream.ToArray()) } $bookContent += Get-Content 'end.html' $bookContent | Out-File VolpiBook.html
$bookContent += [system.Text.Encoding]::UTF8.GetString($resp.RawContentStream.ToArray()) } $bookContent += Get-Content 'end.html' $bookContent | Out-File VolpiBook.html
In between each page, the script inserts HTML with the page number. The padded page number is required for the page URLs - e.g. page 1 as 001. The UTF8.GetString stuff is required to keep the Greek characters from getting munged.
Subscribe to:
Posts (Atom)