Quantcast
Viewing all articles
Browse latest Browse all 4

Google Refine: Metadata Power Tool

I’ve long said that as librarians we should be able to say “I’ve never metadata I didn’t like.” (ba-dump). In other words, we should be excellent at acquiring, massaging, translating, merging, deduping, and all the other types of activities that might be required to make metadata really useful. But to do those things well require new kinds of tools — metadata crosswalks, metadata visualization and summarization tools, etc. That’s why I was so excited this week to discover that Google has released Google Refine 2.0.

Google Refine is based on a application that Freebase developed, which Google recently bought. Called Gridworks under Freebase, Google changed the name to Refine and put some more work into it. A specific design decision was to have it run as a desktop application instead of on the web. This has a couple of key ramifications. On the good side is that you don’t have to lose control of your data to refine it. It can stay completely in your hands on your desktop. On the bad side is that it needs to load all of your data into RAM to process it. This means if you have a large data set (like a million or more rows) then you will need serious computing power to process it.

Before I discovered this I spent hours waiting for a data set of 7 million rows to load, and it eventually threw a Java error. They are considering various ways to enable it to handle very large datasets, but for now stick with processing much smaller sets of data.

Also, although Refine is a stand-alone application (and there are flavors for Windows, Mac, and Unix), you use it via your web browser. When you start it up it opens a new tab in your web browser, which then allows you create a project and locate the data on your disk or via a URL.

Probably the best way to find out if this application has something for you is to watch the first of the introductory videos on the Google Refine site. Watch it through to the end to get a good sense of some of its capabilities. In a nutshell, it allows you to see common values within a cell and detect variances of those values and quickly correct them, find data that is similar and merge it, visually depict the range of numeric values in a column, locate numeric errors, select rows that meet certain criteria, and other similar tasks. Plus, it keeps track of all the operations you perform so you can roll back any of your changes all the way to opening the file.

Overall, this looks like a very useful tool for anyone working with metadata — particularly metadata that may be messy and inconsistent. Of course, our data is perfect, right?


Viewing all articles
Browse latest Browse all 4

Trending Articles