machinations

Tuesday, May 8, 2018

Why I haven't updated the GA App and its book links in a few years

What happened is that I was putting the books on my onedrive and then creating a link url on the site. The urls had a fixed format, with an id parameter. I store the id with the book record in the books table, and when generating the page HTML, generate the book's onedrive url with the id. A couple years ago, onedrive changed the format of their urls. The old book urls still work, but new urls have a different format. So I need to rewrite the code that generates the url. But I can't. I wrote the GA app four years ago, and Microsoft's changed its web tools (moved from old .NET to new .Net Core -- open source, works on Mac and Linux) and Visual Studio no longer supports the old .NET tools I used four years ago. So I need to upgrade to .NET core to compile any new code and rebuild the website. And I haven't got around to that upgrade project yet. Currently I'm more interested in learning client technologies, pushing more work to the browser, just serving files, and not having to keep a database running in the cloud to support the app.

Sunday, February 18, 2018

How to generate the Volpi book into a single file

Each page in the Volpi translation app is in a separate HTML file. I edit and update the individual files, and I don't maintain a single file with all of the book's pages.

However, you can generate a single HTML file with all the pages yourself, by pulling the individual pages from the app and concatenating them into a single file.

Here's how.

First create a start.html file with the HTML tags at the top:


<!DOCTYPE html>

<html>

<head>

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

 <title>Franco Volpi - Heidegger and Aristotle</title>

</head>

<body>

    <h1>Franco Volpi - Heidegger and Aristotle</h1>

    <h2>Translated by Pete Ferreira</h2>
    
    <hr/><br/><br/>

Then create a end.html file with the HTML tags at the bottom:


</body>


</html>

Then use this PowerShell script to concatenate start.html, all the pages from the app, and end.html.

$bookContent = Get-Content 'start.html'
For ($pagenumber=1; $pagenumber -lt 118; $pagenumber++) {
    $paddedpagenumber = ("{0:D3}" -f $pagenumber)
    $url = "http://beyng.com/volpi/assets/EN/Volpi.$paddedpagenumber.html"
    $resp = Invoke-WebRequest -URI $url
    $bookContent += "<br/><br/><p style=""text-align:center"">$pagenumber</p>`n`r"

    $bookContent += [system.Text.Encoding]::UTF8.GetString($resp.RawContentStream.ToArray())
}
$bookContent += Get-Content 'end.html'
$bookContent | Out-File VolpiBook.html

In between each page, the script inserts HTML with the page number. The padded page number is required for the page URLs - e.g. page 1 as 001. The UTF8.GetString stuff is required to keep the Greek characters from getting munged.

Wednesday, November 15, 2017

Volpi book

In April, after talking to some folks at Heidegger Circle 2017 in Walla Walla, I started translating Franco Volpi's Heidegger e Aristotele. I finished my first pass at the beginning of November. I now need to ensure I've translated words consistently, and that the English is readable. There are a few diagrams I need to render in SVG. But now that I have all the English text in HTML, I can create an app to review the pages.

I grabbed the Italian text from an e-book that circulates. I used a website to translate from the Kindle file to a PDF. Then I used the tools I've developed for other PDF books to get the text out into an HTML file per paragraph. The Kindle version of the book doesn't have page numbers - page 1 is the cover.

With previous book I extracted from PDF to HTML last year, I created an AngularJS app to display the book in browsers, making it easy page through or jump to any page. Earlier this year I wrote an app to try out Angular 2.0. Now, Angular is currently at version 5.0, so I decided to create the app for the Volpi book with the latest Angular and figure out how it works. I ran through the tutorial at angular.io last weekend, and today, I had a day off, and wrote an app for the book.

And it's fully functional: can navigate from ToC to any page, page through, jump to page, flip to the Italian text. And I got it running on beyng.com. Now I just need to pimp it up with some CSS. And add filters to allow users to select their favorite English word for controversial Heidegger terms. And finish the translation.

Wednesday, March 1, 2017

Last year, after my last post, I decided what I really needed were true (correct) texts to work with.

I took the C# program I wrote to extract the text of SuZ from its PDF file, and modified the code to work with the PDF for another book. And then repeated the exercise, created another program for the next PDF book. The PDF of each book is sufficiently different from the others, that I haven't been able to write a general program for converting PDF pages to HTML pages. I wrote programs to covert the PDFs for Logic, Being & Truth, and Basic Concepts of Aristotelian Philosophy. I started from the highest quality PDFs, so that the generated HTML pages would not requires any editing. I modified the program I had written to upload the SuZ pages to the cloud and index them in the GA App's search system.

In the summer I went to a seminar with assigned Heidegger texts. I wrote programs for the texts' books' PDFs, to convert the PDF pages to HTML. Because the PDFs were low quality, I had to edit corrections on all the pages. I was learning AngularJS at the time and I wrote an angular app for each of the texts, to be able to easily find pages during the seminar. Links to the apps are in this post. I also added the pages to the GA App, to be able to search them.

At the Heidegger Circle in Chicago, I agreed with Andrew Mitchell to host a seminar on his book The Fourfold. When Andrew picked the Heidegger texts for the seminar I created programs to extract the text from the PDF files, and wrote angular apps to page through each text. For a couple of the essays, I also created HTML pages for their corresponding German pages from the GA, and linked the translation pages to the original pages. I created HTML pages for all the Bremen lectures, and Andrew's book, and indexed everything in the GA App.

I expect to keep generating more pages and adding them to the GA App's index.

Saturday, April 2, 2016

I think I'll be moving the GA App to dot net core this year, for the teaching-myself-current-tech side of the project. I built a dot net core web site for a small business last month, and that went pretty smoothly with Visual Studio.

Now I want to get more minimalist. This morning I figured out the fewest steps to getting a Hello World console app built and running, with dot net core RC2. And I'm sure I'll have to refer back to it.

Install nodejs to get ndm.


npm install -g yo generator-aspnet

yo aspnet

Tell it to gen Console App HelloWorld


cd .\HelloWorld\

dnvm upgrade -r coreclr

dnu restore

dnx run

Wednesday, November 18, 2015

The Lurking Ligature

I ran into another horror on the road to making texts instantly available. In typography, a ligature is a special character made by joining two letters. For example, æ. There are also ligatures for prettifying text. There's an fl ligature, which won't render in Blogger, so I can't show it in this sentence. There's an Adobe PDF editor that includes ligatures as an option when you save a document. It will convert the fl's in the document into the special fl character, with the tops of the f and l connected, like the fl in 'reflections' from this book:

Yet another anomaly confusing machines, and people searching for 'reflections', although I expect Windows and Google handle ligatures automatically. Something to consider if you author PDF docs, and want your doc to be find-able.

Sunday, November 15, 2015

Added SundZ to GA App

Earlier this year I added full text search to the GA App, and I started adding content to the GA App. I started with some papers, and then added pages they cited, mainly individual German pages from GA volumes and pages from translated books. I would manually create each page in HTML and then manually add a hyperlink to the page, to the citation in the paper. Generally I would pick pages that were cited a lot, so adding a new page would allow me to add hyperlinks to multiple papers.

I would also add the new pages to the full text search index.

I created and added a couple hundred HTML pages, but creating each page manually didn't scale. I would take too long to create HTML pages for most of Heidegger's works. So at the beginning of summer I set that aside, and started looking into ways of generating the HTML pages automatically.

Most of the texts available on the internet are in PDF files. I spent some time experimenting with different tools and libraries of PDF functions. Getting the raw text out of a PDF file isn't difficult, but I wanted the HTML pages to look as much like the page in the book as possible. That meant getting information about the fonts (size, italics, etc.) and position (is a paragraph indented?).

I settled on using a library called pdfbox (it's on GitHub). I see they have a new version, 2.0, released last month. I'll have to try out the new one. I used the .NET version of 1.8.7. With pdfbox, I can get a list of all the characters on a page, with the coordinates of each, and font information.

I tried several different PDF files of books, and they were all sufficient different. That I ended up writing different programs to extract pages form different books. The programs have a common pattern and share common functions, but a single program that could handle all the books would be too complex. I decided to concentrate on the program to generate HTML pages from a PDF file of Sein und Zeit., SundZSteller.EXE. By October I had it working, generating HTML pages for all 437 pages of text in the book.

To date, when I need to add new data to the GA App, I rebuild the database; throw away the old database and create a new one. To add all 437 book pages, I wanted to figure out how to add a book with rebuilding everything. So I wrote a program that (1) adds the pages to Azure Storage, (2) adds the pages to Azure Search, and (3) adds the pages to the GA database. I finished the program yesterday and ran it and added the 437 pages to the GA App.

When someone searches for a term in the GA App, and Azure Search finds the term in a SundZ page, it returns a link to that page (and links to anywhere else the term was found). The link fetches the page's data from the database, displays the book page details web page with the data, and fetches the book page HTML from Azure Storage, and also displays the book page in the same web page.

Now I will go on to the next book, and write a program to extract all the pages from its PDF file. It should take less effort the second time.