I ran into another horror on the road to making texts instantly available. In typography, a ligature is a special character made by joining two letters. For example, æ. There are also ligatures for prettifying text. There's an fl ligature, which won't render in Blogger, so I can't show it in this sentence. There's an Adobe PDF editor that includes ligatures as an option when you save a document. It will convert the fl's in the document into the special fl character, with the tops of the f and l connected, like the fl in 'reflections' from this book:
Yet another anomaly confusing machines, and people searching for 'reflections', although I expect Windows and Google handle ligatures automatically. Something to consider if you author PDF docs, and want your doc to be find-able.
Wednesday, November 18, 2015
Sunday, November 15, 2015
Added SundZ to GA App
Earlier this year I added full text search to the GA App, and I started adding content to the GA App. I started with some papers, and then added pages they cited, mainly individual German pages from GA volumes and pages from translated books. I would manually create each page in HTML and then manually add a hyperlink to the page, to the citation in the paper. Generally I would pick pages that were cited a lot, so adding a new page would allow me to add hyperlinks to multiple papers.
I would also add the new pages to the full text search index.
I created and added a couple hundred HTML pages, but creating each page manually didn't scale. I would take too long to create HTML pages for most of Heidegger's works. So at the beginning of summer I set that aside, and started looking into ways of generating the HTML pages automatically.
Most of the texts available on the internet are in PDF files. I spent some time experimenting with different tools and libraries of PDF functions. Getting the raw text out of a PDF file isn't difficult, but I wanted the HTML pages to look as much like the page in the book as possible. That meant getting information about the fonts (size, italics, etc.) and position (is a paragraph indented?).
I settled on using a library called pdfbox (it's on GitHub). I see they have a new version, 2.0, released last month. I'll have to try out the new one. I used the .NET version of 1.8.7. With pdfbox, I can get a list of all the characters on a page, with the coordinates of each, and font information.
I tried several different PDF files of books, and they were all sufficient different. That I ended up writing different programs to extract pages form different books. The programs have a common pattern and share common functions, but a single program that could handle all the books would be too complex. I decided to concentrate on the program to generate HTML pages from a PDF file of Sein und Zeit., SundZSteller.EXE. By October I had it working, generating HTML pages for all 437 pages of text in the book.
To date, when I need to add new data to the GA App, I rebuild the database; throw away the old database and create a new one. To add all 437 book pages, I wanted to figure out how to add a book with rebuilding everything. So I wrote a program that (1) adds the pages to Azure Storage, (2) adds the pages to Azure Search, and (3) adds the pages to the GA database. I finished the program yesterday and ran it and added the 437 pages to the GA App.
When someone searches for a term in the GA App, and Azure Search finds the term in a SundZ page, it returns a link to that page (and links to anywhere else the term was found). The link fetches the page's data from the database, displays the book page details web page with the data, and fetches the book page HTML from Azure Storage, and also displays the book page in the same web page.
Now I will go on to the next book, and write a program to extract all the pages from its PDF file. It should take less effort the second time.
I would also add the new pages to the full text search index.
I created and added a couple hundred HTML pages, but creating each page manually didn't scale. I would take too long to create HTML pages for most of Heidegger's works. So at the beginning of summer I set that aside, and started looking into ways of generating the HTML pages automatically.
Most of the texts available on the internet are in PDF files. I spent some time experimenting with different tools and libraries of PDF functions. Getting the raw text out of a PDF file isn't difficult, but I wanted the HTML pages to look as much like the page in the book as possible. That meant getting information about the fonts (size, italics, etc.) and position (is a paragraph indented?).
I settled on using a library called pdfbox (it's on GitHub). I see they have a new version, 2.0, released last month. I'll have to try out the new one. I used the .NET version of 1.8.7. With pdfbox, I can get a list of all the characters on a page, with the coordinates of each, and font information.
I tried several different PDF files of books, and they were all sufficient different. That I ended up writing different programs to extract pages form different books. The programs have a common pattern and share common functions, but a single program that could handle all the books would be too complex. I decided to concentrate on the program to generate HTML pages from a PDF file of Sein und Zeit., SundZSteller.EXE. By October I had it working, generating HTML pages for all 437 pages of text in the book.
To date, when I need to add new data to the GA App, I rebuild the database; throw away the old database and create a new one. To add all 437 book pages, I wanted to figure out how to add a book with rebuilding everything. So I wrote a program that (1) adds the pages to Azure Storage, (2) adds the pages to Azure Search, and (3) adds the pages to the GA database. I finished the program yesterday and ran it and added the 437 pages to the GA App.
When someone searches for a term in the GA App, and Azure Search finds the term in a SundZ page, it returns a link to that page (and links to anywhere else the term was found). The link fetches the page's data from the database, displays the book page details web page with the data, and fetches the book page HTML from Azure Storage, and also displays the book page in the same web page.
Now I will go on to the next book, and write a program to extract all the pages from its PDF file. It should take less effort the second time.
Wednesday, September 9, 2015
ἀλήϑεια != ἀλήθεια
Sad day today, to discover there are two thetas, the Greek letter θ, and the mathematical symbol ϑ; Unicodes U+03B8 and U+03D1 respectively. Some clever chaps, like Wikipedia and Google, are aware that both forms refer to the same character when embedded in words, and fold the mathematical symbol into the Greek letter. Sadly Adobe PDFs and Microsoft are clueless. If you search Bing for ἀλήϑεια, you get a bunch of papers by Thomas Sheehan, who, thankfully, appears to the only Heideggerian to use the mathematical symbol instead of the Greek letter. I discovered this looking up the definition of ἀλήθεια-1 in the PDF version of Sheehan's Making Sense of Heidegger.
I can fix this in the GA app, since all the texts I entered get their Greek normalized anyway -- all the ersatz Greek fonts from Win95 and Mac 9 days are converted to Unicode and non-standard diacritics fixed. The goal in the GA app is to make things easy to find, and not to resolutely follow the original text.
I can fix this in the GA app, since all the texts I entered get their Greek normalized anyway -- all the ersatz Greek fonts from Win95 and Mac 9 days are converted to Unicode and non-standard diacritics fixed. The goal in the GA app is to make things easy to find, and not to resolutely follow the original text.
Friday, July 3, 2015
character expansion or handling Eszett
This week I discovered that Azure Search does not handle character expansion, meaning that it considers Strasse and Straße to be different words. They are alternative spellings of the same word - it depends on what keyboard is ready-to-hand, or which official spelling directive rules your world. If you search Strasse you will not find documents containing Straße, and vice-versa. That's disappointing. I entered a feature request at the Azure Search site. Vote for this feature if you care!
In addition to using Azure Search to search all the documents, searching for a term in the GA App also searches the glossaries I've added to the database. There I've been able to enable character expansion. With search term Schluß, the glossary search returns Abschluss,
and searching for Schluss returns Schluß.
In the database, in table DeWords, with the records from a book (ISBN), find a record where the Word field matches the search query.
If there aren't exact matches in a glossary, GA App then looks for partial matches. InvariantCulture is automatic when searching for substrings, but always case sensitive, so:
In addition to using Azure Search to search all the documents, searching for a term in the GA App also searches the glossaries I've added to the database. There I've been able to enable character expansion. With search term Schluß, the glossary search returns Abschluss,
and searching for Schluss returns Schluß.
The trick to getting character expansion to work, is to tell the search that we want the search to be done with Invariant Culture; the default is Ordinal.
dw = db.DeWords.Where(w => w.ISBN == isbn).SingleOrDefault(w => w.Word.Equals(q, StringComparison.InvariantCultureIgnoreCase));
In the database, in table DeWords, with the records from a book (ISBN), find a record where the Word field matches the search query.
If there aren't exact matches in a glossary, GA App then looks for partial matches. InvariantCulture is automatic when searching for substrings, but always case sensitive, so:
dw = db.DeWords.Where(w => w.ISBN == isbn).FirstOrDefault(w => w.Word.ToLower().Contains(q.ToLower()));
Friday, June 5, 2015
Azure problems, sites down
My Azure sites are down until 6/19.
I have a Microsoft Developers Network subscription that gives me $150 in credits on Azure per month. I use Azure to host various experiments I have in the cloud (including my GA app and Ereignis beta), and pay for resources (web traffic, storage, etc.) from my Azure credits. Typically, my resource use is less than $50 a month.
Earlier this year, I signed up for the Azure Search beta. Search resources were free. Azure Search went live (became a billable product) last month, and there are now two tiers, free and standard. When Azure Search went live, Microsoft switched the users of the free beta, to the standard plan ($8/day), instead of the free plan. So Azure Search used up $8 every day, until it stopped working this morning.
My Azure web sites won't work again, until 6/18, when the monthly cycle rolls over, and they get another $150 again. At that point I'll shut down the Search service, create a new free search service, and re-index all the content for the search service.
I have a Microsoft Developers Network subscription that gives me $150 in credits on Azure per month. I use Azure to host various experiments I have in the cloud (including my GA app and Ereignis beta), and pay for resources (web traffic, storage, etc.) from my Azure credits. Typically, my resource use is less than $50 a month.
Earlier this year, I signed up for the Azure Search beta. Search resources were free. Azure Search went live (became a billable product) last month, and there are now two tiers, free and standard. When Azure Search went live, Microsoft switched the users of the free beta, to the standard plan ($8/day), instead of the free plan. So Azure Search used up $8 every day, until it stopped working this morning.
My Azure web sites won't work again, until 6/18, when the monthly cycle rolls over, and they get another $150 again. At that point I'll shut down the Search service, create a new free search service, and re-index all the content for the search service.
Tuesday, April 14, 2015
Added Google Scholar links
Inspired by a paper with all it's citations linked to Google Scholar, I had another look at Google's service. It's useful if you can find the item you are looking for, and if the item is cross-referenced - Google figured out its citations, and other items that cited it.
I added Google Scholar links to the Gesamtausgabe App's (the GAApp?) Paper and Book details pages. In the case of papers, the URL linked to Google has the title (URL encoded; e.g., "Logic: The Question of Truth" => "Logic&3a+The+Question+of+Truth") and the Author's name in the form Google likes it (Initials and last name; e.g., "Andrew J. Mitchell" => "AJ+Mitchell"). I had to right a function to convert names to that format. With Books, if there's an author, the URL is the same as papers. If the book doesn't have an author, but has an editor, then URL uses the editor as the author. If the book has a translator but no author, then URL uses "M Heidegger" as the author.
The links appear to return pretty good results. Google Scholar either finds an exact match if its there, or reports it can't find it, without returning a bunch of unrelated results. Google should support searches with books' ISBNs t improve results when looking up books.
I'm considering also linking People in the GAApp to Google Scholar, and also individual Texts - most of the Texts in Wegmarken appear Google as individual citations.
I have not pushed this latest version of the GAApp to the cloud yet. It'll be in version 1.3.
[Update 4/16/15: I just discovered my simple author's name algorithm doesn't work with "Miguel de Beistegui" => "Md+Beistegui", no matches for his "The New Heidegger" on Google Scholar. The search has to be specifically for "M+de+Beistegui" to return the book. Other of his books require searching for specifically either "M+Beistegui" or "Md+Beistegui". Sheesh. Come on Google. Apply some intelligence, artificial or otherwise, and link all the variations to the same person, and have a query with any variation return all of a person's citations.]
I added Google Scholar links to the Gesamtausgabe App's (the GAApp?) Paper and Book details pages. In the case of papers, the URL linked to Google has the title (URL encoded; e.g., "Logic: The Question of Truth" => "Logic&3a+The+Question+of+Truth") and the Author's name in the form Google likes it (Initials and last name; e.g., "Andrew J. Mitchell" => "AJ+Mitchell"). I had to right a function to convert names to that format. With Books, if there's an author, the URL is the same as papers. If the book doesn't have an author, but has an editor, then URL uses the editor as the author. If the book has a translator but no author, then URL uses "M Heidegger" as the author.
The links appear to return pretty good results. Google Scholar either finds an exact match if its there, or reports it can't find it, without returning a bunch of unrelated results. Google should support searches with books' ISBNs t improve results when looking up books.
I'm considering also linking People in the GAApp to Google Scholar, and also individual Texts - most of the Texts in Wegmarken appear Google as individual citations.
I have not pushed this latest version of the GAApp to the cloud yet. It'll be in version 1.3.
[Update 4/16/15: I just discovered my simple author's name algorithm doesn't work with "Miguel de Beistegui" => "Md+Beistegui", no matches for his "The New Heidegger" on Google Scholar. The search has to be specifically for "M+de+Beistegui" to return the book. Other of his books require searching for specifically either "M+Beistegui" or "Md+Beistegui". Sheesh. Come on Google. Apply some intelligence, artificial or otherwise, and link all the variations to the same person, and have a query with any variation return all of a person's citations.]
Friday, March 27, 2015
Pernicious link rot
I started Ereignis, the web page in 1995, and one of the early gripes was that links stopped working, because the content was moved or deleted. When I was informed that a site had moved or I noticed a link was broken, I would update or remove it, but I did not regularly test the links and remove the dead links.
One of the goals with the Gesamtausgabe app is to only display valid links. Towards that goal, I've written a module that checks all the links in the database, and checks the links embedded in the paper and book page content. Once that was working, it was easy to re-purpose the code and point it at the the Ereignis pages on beyng.com, and have it check the links on Ereignis.
This I did. I skipped the Ereignis pages that have links by subject, and only repeat links that are already on the general pages. I included the book pages in the bibliography, where the links and mainly to authors, publishers, and reviews.
Out of 2691 hyperlinks on Ereignis, 705 were broken, 26%. Testing the 2691 links took 80 minutes. The oldest page of links, from the 1990's, had over 80% rotten links. 10% of the links from the last year have rotted. The distribution appears linear. Link rot occurs consistently. Surprisingly, links with the most rot were those to people, rather than papers. Links to institutional web sites are likelier to rot then links to individual web site. Universities and publishers are changing their web hosting software regularly and tossing their old content, while individuals are more likely to ensure that their URLs continue to work.
One of the goals with the Gesamtausgabe app is to only display valid links. Towards that goal, I've written a module that checks all the links in the database, and checks the links embedded in the paper and book page content. Once that was working, it was easy to re-purpose the code and point it at the the Ereignis pages on beyng.com, and have it check the links on Ereignis.
This I did. I skipped the Ereignis pages that have links by subject, and only repeat links that are already on the general pages. I included the book pages in the bibliography, where the links and mainly to authors, publishers, and reviews.
Out of 2691 hyperlinks on Ereignis, 705 were broken, 26%. Testing the 2691 links took 80 minutes. The oldest page of links, from the 1990's, had over 80% rotten links. 10% of the links from the last year have rotted. The distribution appears linear. Link rot occurs consistently. Surprisingly, links with the most rot were those to people, rather than papers. Links to institutional web sites are likelier to rot then links to individual web site. Universities and publishers are changing their web hosting software regularly and tossing their old content, while individuals are more likely to ensure that their URLs continue to work.
Thursday, March 19, 2015
A helpful suggestion from History Today on broken hyperlinks.
Digital library researchers at Los Alamos National Laboratory found in a survey of three and a half million scholarly articles from scientific journals between 1997 and 2012 that one in five links provided in the footnotes suffered from 'reference rot'. Another survey, this time of law and policy publications, revealed that after six years nearly half of URLs cited had become inaccessible. Historians (perhaps unsurprisingly, given their profession) have been slower to place this most modern of problems at the top of their agenda. They are, however, not immune from its effect. An American study of two leading history journals found that in articles published seven years earlier, 38 percent of web citations were dead. Missing web pages can sometimes be relocated by academics through digital archives, the biggest of them being the Wayback Machine in San Francisco. A good many web pages, however, have not been archived and are permanently irretrievable.
A tool called Perma.cc was launched in beta phase in 2014. Developed by the Harvard Law School Library, it ‘allows users to create citation links that will never break’. If you want to secure the future of an Internet link in your footnotes, you create an archived version of the page you are referring to and anyone later clicking on your link will be taken through to the archived version. This ‘permalink’ does not repair Internet citations that have already decayed, but it does effectively fix the problem going forward. It has already been taken up by law reviews in America.It would be cool if philosophy papers had links, instead of just referencing paper editions.
Thursday, February 26, 2015
Azure Search and diacritics
I've been experimenting with Azure Search, to improve searching the Gesamtausgabe. I've got all the content indexed on Azure, and it's returning decent results; compensating for misspellings, and providing suggestions-as-you-type. I still need to figure out how to integrate with the Bootstrap typeahead control, before I update the website with the new search feature.
One of the features Azure Search doesn't have yet is asciifolding, so that a search for "αληθεια" will return documents containing "ἀλήθεια". Who can remember the polytonic Greek keyboard's diacritics' layout? And not every document uses diacritics consistently. If this feature is important to you, you can cast three votes for it here.
[Update March 9, 2015]
Asciifolding now works with Azure Search, with api-version=2015-02-28-Preview. The new release cadence from Microsoft is much better than the old days; "we'll fix that in the next Windows release". I've rebuilt the indexes and asciifolding is working in the app version that I'm currently working on. I hope to release it soon, several weeks.
In the fields you want to be searchable with asciifolding you set:
One of the features Azure Search doesn't have yet is asciifolding, so that a search for "αληθεια" will return documents containing "ἀλήθεια". Who can remember the polytonic Greek keyboard's diacritics' layout? And not every document uses diacritics consistently. If this feature is important to you, you can cast three votes for it here.
[Update March 9, 2015]
Asciifolding now works with Azure Search, with api-version=2015-02-28-Preview. The new release cadence from Microsoft is much better than the old days; "we'll fix that in the next Windows release". I've rebuilt the indexes and asciifolding is working in the app version that I'm currently working on. I hope to release it soon, several weeks.
In the fields you want to be searchable with asciifolding you set:
Analyzer = "standardasciifolding.lucene"
Wednesday, February 25, 2015
Drawing diagrams with SVG
I've figured out how to do large large curly brackets, like these in Thomas Sheehan's "Astonishing! Things Make Sense!" (P. 14):
I don't want to use bitmaps, like the picture above, because they don't scale, they're relatively large downloads, and their text isn't searchable. Today I figured out how to draw the curly brackets using SVG (Scalable Vector Graphics) to draw lines. A curly-bracket is two lines and four quarter circles.
Here's what it looks like in an HTML document. It won't render inside a blogspot blog, so you'll have to click there to see.
This is the code that renders the diagram.
I don't want to use bitmaps, like the picture above, because they don't scale, they're relatively large downloads, and their text isn't searchable. Today I figured out how to draw the curly brackets using SVG (Scalable Vector Graphics) to draw lines. A curly-bracket is two lines and four quarter circles.
Here's what it looks like in an HTML document. It won't render inside a blogspot blog, so you'll have to click there to see.
This is the code that renders the diagram.
<svg height="700" width="800"> <path d="M 200 260 a 7 7 0 0 1 7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 200 260 l 0 143" stroke-width="2" stroke="black"> <path d="M 200 403 a 7 7 0 0 1 -7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 200 417 a 7 7 0 0 0 -7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 200 417 l 0 143" stroke-width="2" stroke="black"> <path d="M 200 560 a 7 7 0 0 0 7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 420 160 a 7 7 0 0 1 7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 420 160 l 0 93" stroke-width="2" stroke="black"> <path d="M 420 253 a 7 7 0 0 1 -7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 420 267 a 7 7 0 0 0 -7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 420 267 l 0 93" stroke-width="2" stroke="black"> <path d="M 420 360 a 7 7 0 0 0 7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 95 a 7 7 0 0 1 7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 95 l 0 43" stroke-width="2" stroke="black"> <path d="M 560 138 a 7 7 0 0 1 -7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 152 a 7 7 0 0 0 -7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 152 l 0 43" stroke-width="2" stroke="black"> <path d="M 560 195 a 7 7 0 0 0 7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 315 a 7 7 0 0 1 7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 315 l 0 43" stroke-width="2" stroke="black"> <path d="M 560 358 a 7 7 0 0 1 -7 7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 372 a 7 7 0 0 0 -7 -7" fill="none" stroke-width="2" stroke="black"> <path d="M 560 372 l 0 43" stroke-width="2" stroke="black"> <path d="M 560 415 a 7 7 0 0 0 7 7" fill="none" stroke-width="2" stroke="black"> <g fill="black" font-size="14" font="Helvetica, sans-serif" stroke="none"> <text x="70" y="390">EVERY λόγος =</text> <text x="70" y="410">σημαντιχός</text> <text x="70" y="430">A MEANINGFUL</text> <text x="70" y="450">UTTERANCE</text> <text x="220" y="240">λόγος ἀποφαντιχός</text> <text x="220" y="260">DECLARATIVE SENTENCE</text> <text x="220" y="280">I DECLARE P OF S</text> <text x="220" y="560">λόγος ἀναποφαντιχός</text> <text x="220" y="580">NON-DECLARATIVE SENTENCE</text> <text x="220" y="600">I WISH, HOPE, ASK, OR</text> <text x="220" y="620">COMMAND SOMETHING</text> <text x="450" y="140">AFFIRMATIVE</text> <text x="450" y="160">χατάφασις</text> <text x="450" y="360">NEGATIVE</text> <text x="450" y="380">ἀπόφασις</text> <text x="580" y="85">TRUE</text> <text x="580" y="105">ἀληθής</text> <text x="580" y="205">FALSE</text> <text x="580" y="225">ψευδής</text> <text x="580" y="305">TRUE</text> <text x="580" y="325">ἀληθής</text> <text x="580" y="425">FALSE</text> <text x="580" y="445">ψευδής</text> </g> </svg>
Subscribe to:
Posts (Atom)