Friday, November 21, 2014

The problem with Google Scholar

Assume you want to find the most cited papers about a subject. There are library databases, that are professionally maintained, but they have a limited scope -- e.g., only cover a subset of all journals -- and it is difficult to get casual access to library databases. Then there is Google Scholar, which scans any paper Google can find, plus some databases, and is free to search. So Google Scholar should be a good source for citations.

But Google Scholar doesn't quite work.

For example, if you want to know how many times "Sein und Zeit" has been cited, you find:

"Sein und Zeit" is considered differently from "Sein und Zeit (1927)", "Sein und Zeit [Being and Time]", "sein und Zeit, tübingen", "Martin Heidegger: Sein und Zeit". Included too are any papers or books that include "Sein und Zeit" in their abstracts or titles, and onwards for dozens more pages of results.

The problem is that Google Scholar is just running automatically, trying to extrapolate citations from texts that are formatted in many different, inconsistent ways, and Google Scholar doesn't have editors who would realize that a set of different entries all refer to a single item and connect them. Algorithms don't understand the meaning of text. But they can get better.

No comments:

Post a Comment