Some Gridworks tips

Much of the data that’s lying around is a mess. So we need badly tools that help us to clean up this mess.
Freebase Gridworks of Metaweb, recently acquired by Google is such a tool that allows us to:
  • Merge similar names using multiple methods:
    • Automatic title-casing
    • using an expression language (GEL)
    • using several clustering algorithms to detect similarities
  • Split multi-valued cells over columns and rows
  • Create new colums based on content of other columns
  • ...

Make sure you see the videos at http://code.google.com/p/freebase-gridworks/

Some tricks I want to remember for myself:

  • how to use regular expressions to correct cell values
  • how to fill in a 'null' column with a value.

Correcting cell values using regular expressions

I have some phone numbers in an existing dataset which should be formatted according to following structure:
+ followed by the country code: e.g. +32
followed by the area code with the 0 between parentheses: e.g. (0)15
followed by the local code following this pattern X?XX XX XX: e.g. 23 45 67
Full example: +32 (0)15 23 45 67

The existing dataset contains slightly different phone numbers; shown as loaded into Gridworks.

phone numbers
Using regular expressions we can split the existing numbers in 2 groups using parentheses to indicate the groups:

  • the characters before the area code
    start-of-line, followed by the '+' character, followed by 2 digits, followed by a space
    (^\+\d{2}\s)
  • everything starting from the area code
    1 or 2 digits followed by a space, followed by multiple digits, followed by a space, followed by 2 digits followed by a space (2 times), followed by end-of-line
    (\d{1,2}\s\d+\s\d{2}\s\d{2}$)

The regular expression as shown in the RX Toolkit of Komodo IDE.

regex in Komodo IDE

Now using these groups to replace the existing values with a value conforming the wished structure using the replacement expression
group 1 followed by '(0) followed by group 2.

\1(0)\2


Replacement using groups

Now that we have our regex working, let's move on to Gridworks now.

On the column containing the telephone numbers, choose Edit cells, Transform ...

Gridworks Transform cell
Now we can use the Gridworks expression language (GEL) to do our transform.

GEL offers a whole list of functions; we will be using 'replace'. 'Replace' takes 3 arguments:

  • the input string
  • the replacement string which can be a regex
  • the returned string after replacement which can contain the captured groups defined in the regex.
It took us a few minutes to discover what the precise syntax was when using regex's. This is the template to be used:

replace(value,//,'')

where value refers to the value in the cell
where // delimits the regex
and '' contains the replacement string using the captured groups being indicated with '$', e.g. $1, $2.

In our case the expression became:

replace(value,/(^\+\d{2}\s)(\d{1,2}\s\d+\s\d{2}\s\d{2}$)/,'$1(0)$2')


Result of regex replacement

Filling in a column with a fixed value

From a spreadsheet containing addresses of musea I have a column with 'null' values which I want to use to indicate the type of the entity e.g. 'museum'.

Empty column
I've done this using the GEL 'forNonBlank' function. 'forNonBlank' takes 4 parameters:
forNonBlank(e, v, eNonBlank, eBlank)
  • an expression to be evaluated
  • captured in a variable
  • when not null or empty string, evaluate eNonBlank
  • when null or empty string, evaluate eBlank

In our case

forNonBlank(value, v, 'not relevant', 'museum')

setting column

Conclusion

For everyone pursuing data quality Gridworks should become a central component of his/her toolset.
I'll try to investigate how NeedleBase compares.



Comments