First Things First
The second step in establishing your SaffronMemoryBase is mapping your input data (documents, databases, events, ...) to a representation in your Space. Often times, this is mistaken as the first step. "Hey I've got some data, let's put it into SMB and see what we get!". The problem with this approach is before you can do a good job at this, you need to know how the associations will be queried. Given your input data, take time to describe the query use-case(s). Don't agonize too much about it since frequently you gain insight about your data from iterating a few times on the ingestion. "Iteration" and taking baby steps is a key concept here. Fortunately the REST API makes it is pretty easy to iterate your memory design.
Step 1 - Create your initial data set
Create a small set of XML files, representing your data. Don't try to ingest all of it the first time. Depending on the size of you resources, you probably want to initially keep the number of files < 5K, or something that can be ingested in < 15 minutes. Try to make the set of files as diverse as possible so that it representative of the larger data set (as much as possible).
Step 2 - Create your space
Use the create space API to describe your attribute mapping to the system. More on this later.
Step 3 - Ingest your data set (we call these "resources")
If you have chosen to generate XML files, you can download and run Ingest.java from the example code repository to ingest the data into Sierra.
Step 4 - Run some queries
After your data has been ingested (using the /resources/status REST resource to check the status), run queries on your space to see the results. You can use the REST test harness, which doesn't require writing any code, or write a simple program to execute the queries you are interested in.
Step 5 - Make tweaks and Iterate!
Go back to step 1 if you want to change the specific attributes/values and/or the structure of your resources. If the attributes and the structure of the resources look good, you may just want to change how the attributes are mapped into the space (memories, matrices, row, columns). In this case go back to step 2 and re-create your space with a new definition.
Getting Down To It
Now it's time to roll up our sleeves and walk though an example.
Creating resources
Let's say you are interested in ingesting tweets from a Twitter feed. You want to be able to better see what people who I'm following are talking about. Specifically, you want to know the tags, people they are talking to and maybe knowing where these tweets are coming from. You may want to discover similar tweets, and how these associations trend over time. So to answer these questions, you need to understand all the "raw" elements that go into answering these questions. A healthy perspective is how would you, as a subject matter expert of you data, think about and answer those queestions. Whatever that process is, we will try and emulate that using SMB. In our example, there are a number of attributes contained in the tweet, and you may decide for your first pass, you will consider the following elements:
- author's screen name
- reply to screen name (if it exists)
- other screen names mentioned in the tweet
- creation date
- geo loaction of the tweet
- hashtags
- any mentioned urls
Now you need to create a Saffron Resource to submit to Sierra containing the above data from a tweet. The first decision is what do you want to call the 8 elements of data and do you need any pre-parsing of the data values before storing in SMB? For the first iteration, you'll keep it as simple as possible.
- author's screen name - call this attribute "person" and stored as is.
- reply to screen name (if it exists) - also call this attribute "person" and stored as is.
- other screen names mentioned in the tweet - call these attributes "person" and remove the leading '@' before storing.
- creation date - call this attribute "date" and normalize to mm-dd-yy.
- geo loaction of the tweet - call this attribute "geocoordinate" and stored as is.
- hashtags - call these attributes "hashtag" and remove the leading '#' before storing.
- any mentioned urls - call these attributes "url" and remove the leading '@' before storing.
You could have distingushed the author from the reply-to from the other mentioned names, but you're taking a simple approach and not changing this unless you really need to. One advantage in naming screen names the same (whether thay show up as author, reply-to, or mentioned), is that it really is the same entity, just in different roles. As a result, the storage of that attribute in SMB better reflects what it truely represents. The topic of parsing attributes is far to large to cover in this document, but in general, there are 3 strategies you can take when it comes to attribute parsing:
- Parse it yourself, as in this example, before putting into a Saffron Resource.
- Let Saffron parse the attribute. If you have the SMB enterprise product, you will see this as one of the more significant functional differences between Sierra and the SMB enterprise product. With the SMB Enterprise, there are a large number of pre-packaged parsers in an ETL framework allowing you to customize existing beans to fit your needs, or extend the capability by writing your own. Sierra expresses "templates" as a way to ingest and process certain data formats. For example, Sierra has a Twitter template (discussed below) that knows how to properly parse a tweet. There are also a growing number of other Templates that you will become familiar with.
- Combination of both #1 and #2.
Now that you have your attributes, the second decision that you need to make is do you want all the data from the tweet to be associated together or do you want to partition the associations in some way. When you partition the associations, it prevents attributes in one segment (partition) from being associated to attributes in another. The reason you might want to do this is so that you don't "over associate" your data. For example, not everything may be associated with everything else in a document. There are sentences and paragraphs that contain distinct thoughts and concepts. A Saffron Resource can reflect this concept by putting attirubtes in separate "segments" (like buckets). Attributes from one segment are not associated with attributes from another segment, however they are all associated to the unique resource id and any other global resource attributes. In the twitter case, however, tweets are so small that everything is in fact related to everything else, so we don't need to segment a tweet. If you were ingesting a news article, for example, you probably would want to segment the data along paragrah or sentence boundries.
So what would a Twitter Saffron Resource look like? Given our XML structure, here would be one example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n='1'>
<as>
<a c='date'>
<v>01-27-10</v>
</a>
<a c='person'>
<v>pentaho</v>
</a>
<a c='hashtag'>
<v>datamining</v>
</a>
<a c='url'>
<v>http://bit.ly/dAxgFJ</v>
</a>
</as>
</r>
Notice that all attributes ("<a>" elements) are top-level attributes and there are no segments. For example on how to do this programmactically, see Observing & Querying Tweets (Part 1). For the complete resource specification, see the Saffron Resource DTD.
Resource Templates
Some of the Sierra Resource Templates have built-in parsers. For example, TweetDive is an application that uses Sierra's Twitter Template. Rather than taking on all the data pre-processing in the application, it forms a simple resource of 5 attributes: msg, author, id, date, and location. The msg attribute simply contains the text of the tweet. Sierra will then do the appropriate things to the individual attributes. Specifically, it formats the date to a short form (mm-dd-yyyy, SMB's internal format), as well as, extracting all the relavent terms in the tweet. Terms such as screen names, urls, nouns, verbs, adjectives, adverbs, phone numbers, hash tags, stocks, etc. are all extracted and appropriately tagged. Below is an example of a tweet that might be submitted using the Twitter Template.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n='1'>
<as>
<a c='msg'>
<v>Weka Developer Support now available! New support options for Pentaho Data Mining Community Edition http://bit.ly/dAxgFJ #datamining</v>
</a>
<a c='date'>
<v>Wed Jan 27 18:09:03 +0000 2010</v>
</a>
<a c='id'>
<v>8288064139</v>
</a>
<a c='author'>
<v>pentaho</v>
</a>
</as>
</r>
To gain a more complete understanding of the Saffron Resource and the various elements, see Anatomy of a Saffron Resource.
Creating a space
So now that you have your resource, and have defined your use-cases and attribute schema, you can create your space. The first thing to do is to come up with a name. The space name shouldn't be too long, since you use it in practically every REST call. Also, although whitespace in your name is allowed, it isn't recommended. Pick a name that reflects the ingested data set(s) or application name. Once you have chosen your space name it is time to define your attributes. The primary job here is to declare to the system how you want to map your attributes. Do you want the "person" attribute to be a memory? Do you want the "hashtag" attribute to be a row (input) or a column (output)? So first go through your list of attributes and select how you would like them to be mapped into the SMB object hierarchy.
Using the Twitter example, you can build a "capabilities matrix" for each attribute:
|
attribute
category
|
memory
(input)
|
matrix
(input)
|
row
(input)
|
column
(output)
|
| person |
x |
|
x |
x |
| date |
|
x |
|
|
| hashtag |
|
|
x |
x |
| url |
|
|
|
x |
| geocoordinate |
x |
|
x |
x |
Given the above matrix, how will my data be organized in SMB and what will that allow me to do? Let's assume all references to a Twitter screen name is labeled as "person", which isn't absolutely correct, but okay for this illustration. By declaring that "person" map to memories, rows and columns, it says to the system that you want to use "person" as both an input and output. This covers the row and column settings, respectively. However, we also indicated we want "person" to be a memory. Configuring "person" as a memory allows us to see contextual links around specific people. Not just that joe is connected to bob, but joe is connected to bob in the context of a third attribute like "junk bonds" (assuming we had noun phrase extraction in our tweet).
What does mapping date into matrix mean? It means that when SMB encounters a "date" attribute during the ingestion process, it takes the value of that date and stores all the attributes in the segment in that nammed matrix. So if there were 2 people, a date and several oterh keywords in a tweet, we might end up with several triples stored in each of the 2 people's "date value matrix" containing associations of all the keywords to eachother (assuming they are all configured as rows and columns). What happens when there are no dates or, for that matter, people in the tweet? If there are no dates and the space has been configured to contain temporal memories, the ingestion date/time is used. If there are also no people attributes (or other memory attributes), row-column associations are not lost. The row-column association are observed in the directory memory in the ingestion date/time matrix.
Once the roles are figured out, the other somewhat common attribute setting is datatype. By default, values are treated as case insensitive strings and are converted to lowercase. You can mark an attribute's value as a case sensitive string, point, date or scalar type (covered later).
There are several advanced schema options and several space runtime parameters, but these are infrequently used. Lets first look at an example of the XML payload that would be delived with the Create Space POST to create the space given the above definition.
<space name="twitter">
<attributes>
<attribute category="person" role="memory,row,column"/>
<attribute category="date" role="matrix"/>
<attribute category="hashtag" role="row,column"/>
<attribute category="url" role="column"/>
<attribute category="geocoordinate" role="memory,row,column"/>
</attributes>
</space>
That's it, your space is created. If you want to delete your space and redefine it DELETE from /spaces/twitter and re-POST you new definition. If you just want to clear all the data from your space PUT from /spaces/twitter.
Ingesting data
Once your space has been created, you can begin ingesting your data. Assuming you have XML documents, the easiest way to put your data into SMB is to use the Ingest program for the examples SVN repository. It is important to understand that POSTing a /resources resource is an asynchrounous call that queues your data for ingest. This means that it may not be immediately available to query after your call returns. You can get status of the ingestion process by GETting the /resources/status resource. It will tell you how many Saffron resources are still pending to be ingested.
You can also use PUT on the /resources URL to perform job control. Your XML payload tells the system that you want to 'pause', 'resume' or 'clear' the current job queue.
Running queries
After you have ingested your data, you can begin to look at results using the SMB queries. It is not necessary that all the data be completely ingested, you can query the system while ingesting to see some results. The easiest way to start looking at query results is by using the Test Harness. The Test Harness is simply a web page installed on the SMB server allowing you to interactively issue queries via REST. Results are displayed in the native JSON format. Alternatively, you can write a custom program to issue your queries. If you are an enterprise customer, use Firefox and have disabled API keys, you can also use the REST Client Plugin for Firefox.
For more information on query APIs see the REST API Reference Guide.
Comments (0)
You don't have permission to comment on this page.