View
 

Quick Start

Page history last edited by Jim Fleming 7 mos ago

A Quick Start to the SaffronMemoryBase service

 

Welcome to Saffron Sierra, a developer site for using SaffronMemoryBase hosted on the web.  There are a couple of things to know before getting started.  First, you can sign up for a development system which allows you to provision a Saffron server for application development whenever you need it.  With your account, you will be given a secret key that you will use when making your REST calls into the service.  You can find example code to help you get going.  Be sure to check back as we are always adding more examples to this repository.  You should also know if you have questions, problems or suggestions, you can visit http://help.saffrontech.com for assistance.  You can also find our developers blog at http://blog.saffrontech.com to see what is on the minds of our engineers.  And of course, here at http://docs.saffronsierra.com is where you will find documentation for the service.  You will also see extended documentation for or enterprise product, for folks who have our system installed inside their organization.  In addition to the REST interfaces, the enterprise product provides a rich set of ETL tools for customized/high-performance data ingestion, as well as, extended data management APIs.  To help you get started in understanding and using our APIs, this quick-start will cover the RESTful object design and methods.  If you haven’t registered, this quick-start will help give you a sense of the simplicity and power of our memory-based approach for fast and easy data analysis, encouraging you to register and and give it a try.   Join us.

 

First, you will learn our namespace to understand how the objects and methods of this memory base are organized.  Second, you will see how to transform and load data through “observations” into memories.  Third, you will then be able to navigate your created memory space, and leverage the easy to use associative memory query methods.  Finally, we’ll tell you what to expect as we grow this service and build a memory making community with your support.  So let’s get started!

 

The Name Space

 

At the highest level, the interfaces are arranged in a RESTful object name space.  The name space is hierarchical as follows:

 

Spaces

Memories

Matrices

Rows

Columns

 

Here is a quick description of each level:

  • Space.  Think of a space as an application container.  For example, if you want to represent the knowledge in a data set or a number of related data sets, you will load them all into one space.  If you want to build another application with a different design based on the same or different data, you can create and use a second space.  The different spaces can also be used for different purposes within one application.  For example, you might want to represent source data in one space and represent user behaviors in another.  For example, one space might remember all the associations in twitter feeds while the other space remembers what tweets are of interest to different users.  Think of a space as a graph of memories for different “people, places, and things”.  As you will see, memories represent semantic triples and their frequency counts.  For the mathematically minded, a space is really a manifold of a near-infinite number of weighed and context-dependent sub-graphs.  Any one graph you see is dependent on the query you ask.  We’ll get back to the kinds of queries you can ask of a space, but first let’s continue to understand the overall name space. 
  • Memory.  Each memory stores all the information about one particular thing across all the data loaded into a space.  A “thing” can be a person, company, organization, country, or city, for example.  If you are interested in life science and healthcare, these might be patients, doctors, drugs, genes.  If your data contains decisions and outcomes, then each memory can represent an action or a consequence.  Memories can represent an instance of a thing or concept of a general topic, or as we call it, a category.  For example you may have structured data that contains partNumber:12345n or unstructured data that has been tagged or extracted with the term, city:Raleigh.  Wherever your data comes from, whatever your structured data describes in field:values or can be marked with tags in unstructured text, you should think of each of these things as mapping to a memory.
  • Matrix.  Each memory contains at least on matrix.  Spaces and memories are conceptual levels of the name space.  Matrices (and their rows and columns) are physical.  We define a memory as an association matrix, which connects and counts things.    You can ignore the use of the matrix level to begin because each memory has one by default.  If you make a memory, it will contain one matrix called “default”.  But as you develop your memory-based application, matrices give you a fourth dimension to play with, such as for the representation of time slices within each thing.  There are other uses of matrices slices too, but to begin, just remember that each memory contains a matrix as the physical representation of a memory for each thing. 
  • Row.  Guess what?  Matrices are defined by rows and columns.  It’s pretty simple.  The things on the row are associated with things on the columns.  The important point to remember is that the matrix representation is row-dominant.  In other words, we have implemented our matrices so that the associations are always grouped along the row.  This allows the rows to be used as inputs from queries and the columns to be used as outputs for answers.  Rows, like memories, are represented as category:value pairs.  In fact, some rows in one matrix of a memory, may refer to another memory.  Other rows may not refer to other memories and just supply "context" for other associations.  Keywords are good examples of terms that don't justify their own memory, but add context to an association between to memories.
  • Column.  Guess what again?  Matrices have columns, and as just described, columns are used as outputs.  Like rows, there are also category:value pairs.  For now, just remember this name space and begin to visualize the physical structure.  Within a space, there are many memories.  Each memory-matrix has rows and columns.  If you “clamp” a question onto the rows, you can get the associated output from the columns.

 

The overall structure also includes the management of “scopes”.  This might be hard to remember at first, but if you think about the concept of scoping in computer science, you will get the idea.  You can create a “scope” in order to group memories.  For example, you can group different memories together that have different names but represent the same concept (e.g. person).   The different names might be spelling variants or might be intentional aliases.  Or for drugs, one name might be the brand name while the other is the compound name, but they are the same thing.  For whatever creative purpose you might need, this is a way to keep separate memories but treat them as one.  Scopes are applied at query time so they are not permanent.  They are completely dynamic and can be combined forming composite groups.

 

Note that the name space also includes the use of category:value pairs.  Category and value names are used to address 1) memories, 2) rows, and 3) columns.  In other words, if a memory is used to store the information about a specific thing, the naming convention should also include the type of thing.  If you create a memory for “Joe Smith”, its category:value should be “Person : Joe Smith”.   or “Customer : Joe Smith” or “City : London” or “Drug: Acetaminophen”.    Spaces and matrices are not typed, and are addressed only by name.  You can partition and name your spaces however you want.  Data within a memory can be partitioned into several matrices, each having a name that reflect the associations contained within that particular matrix slice.  For example, if you use matrices to time slice the memories, then a time frame or time code can be used as your way to partition, list and retrieve these partitions.  However, the use of category:values naming for memories, rows, and columns provides the organization we need for a powerful triple store of how things are related, organized by their category types and well as their name values.

 

Learning from Data

 

So how do you get your data into the memory base?  First, do not think of SaffronMemoryBase as storing your data.  It is a memory base, not a database.  It stores the information and knowledge in your data, not the data itself.  SaffronMemoryBase is a learning machine, based on the representation and reasoning of memories.  This approach to machine learning is schema-free and on-the-fly.  You can load-query, query-load-load-load-query, or whatever you want when you want.  It does not require batch training.  It does not require a lot of parameter-tweaking to “fit” data to a model.  It just observes data as it arrives.  Whenever you want to ask a question, it reasons from it knows (observations to date).

 

The first thing to do is to create a space.  By default, there is a "default" space that comes when the system starts up that you can use.  The "default" space, however assumes everything is a row and column and no memories.  No memories, what does that mean!  Where would I find my matrices!  It turns out that in addition to all the attributes that you label as "memories", there is one special memory that sees everything.  It is what we call the directory memory.  Evey row and column that is stored in any declared memory is also stored in the directory memory.  Think of the default space as a single matrix system.  If you want to customize you data mapping to memories, rows and columns (which you probably will), you will need to create a new space.  To create a space, you POST a simple XML schema definition of your space, to the system.  Similar in concept to the SQL 'create table' statement, you declare to the system what to do with attributes as they are observed.

 

In defining your schema, you declare which categories should be memories, rows and/or columns.  Remember that memories and rows are used as inputs and columns are outputs.  Suppose we wanted to recall document IDs as outputs but never use them as inputs.  Then we could declare the “docID” category as a column.  But maybe these IDs can be used as inputs.  Maybe you are unsure right now about inputs and outputs.  That’s okay.  Don’t declare any rows and columns and all the following data will be used as both rows and columns.  In other words, any of the attributes can be used as inputs or outputs.  More advanced designs can use these declarations, but keeping everything as both inputs and outputs is a fine design for universal discovery: asking anything and answering anything.  For now, just declare which categories are the things in your data that are most important to learn about as memories.  This is a bit of an art, but of all the attributes in your system, memories should represent your primary concepts.  For example, for entity analytics, memories are the people, places and things, while the nouns and verbs are not.  In this case, the focus is on entity associations in the context of other entities and keywords (nouns, verbs).  One rule of thumb is to avoid the create of memories for highly-specific and arbitrary values, such as numbers.  Numbers such as the cash value of a transaction will be learned and remembered as rows and columns, but do you really need a memory for “amount : $2,310. 56”?  Yes, everything can become a memory, but does this make sense for your data and application?  Who bought, what was bought, what month it was bought, and other things would make good memories for sure.  You’ll get a better sense of this as you go, so let’s just get some data in to start.

 

Once you have created your spaces, you can list them, delete them and re-initialize them.  Aside from basic space management, the first API under spaces is called “observe”.  This is how you "ingest" your data into the memories within a space.  The POST method takes your XML data (attribute vectors), and assimilates them into the associative memory system.  Notice that it is a flat list of category:value pairs.  This is very simple and universal approach.  No matter if your data contains transaction records or unstructured text, almost anything can be transformed to such a vector-based description.  If you have sentences to observe, then entity extractors can mark the people, places, and things.  Concept maps or human tagging might also be part of the description.  The other parts of speech might also be marked to describe verbs for example.  More advanced processing might mark events or relationships.  In any case, each observation of the world is described by such a vector.  When you ask the space to observe the vector, each memory (the categories you declared to be memories), will update itself with all the associations between all the other attributes in the vector.   You will see how this works when we navigate the name space below.

 

As a matter of performance, pack several observations, if appropriate, into a single payload.  If source data arrives piece-meal, you can load each observation one at a time.  But for improved ingestion rates, it is beneficial to bulk-load.  Each "observe" POST defines a transaction boundary. After the call, the memories of the data are committed and can have effect during the next queries.   Document scope is one example of a reasonable transaction boundary.  Tens or hundreds of sentences in a document can each be represented as a series of observations, all packed together to process at a time.  Within a single observation, there are 2 stanzas, the attributes and the matrices.  You can ignore the matrix stanza for now and just put everything in “default”, but use the matrix name too if you have a clever idea for using it as another dimension within memories.  The observation, along with the space definition, determine what associations to form and which matrices to stored them in.

 

Forget is the reverse of observe.  The observe method constructs the necessary objects and increments the associative counts.  Forget decrements the counts (with a floor of zero) and deconstructs the objects (if the counts reduce to zero).  You might never use this if you are building a “total recall” of corporate or personal memories of all the data forever.  However, in dynamic memory applications, data might be retracted, in which case the memory can also keep in synch by forgetting it.  To modify a record, forget the old record and re-remember the new.  It will be like nothing every happened except for the new associations.   More advanced applications; such as for tracking in non-stationary environments can use forget to punish a memory that is in error.  In this sense, forgetting acts to “sculpt” the memory, whether or not the exact observation was actually seen before.  However, when it comes to sculpting memories, there are other strategies like observing negative feedback memories, which offer query-time advantages such as the option of applying the feedback and to what extent.  But let’s not worry about forgetting right now.  Get data into the memories.  Then let’s start using them, which is what really matters.

 

Namespace Navigation

 

Once you observe data into a space, you can navigate all the associations.  Beyond just navigation as a developer of the space, you can provide navigation to your users as a powerful method itself.  List your spaces.  Within each space, list the categories, the kinds of things it contains.  Remember that category names and value names are parts of the hierarchical name space.   Given any one category, list the values in the space, the specific things in the space.  Given a specific category:value of a category mapped to memories, you have a handle to one of the memories.  Ignoring the matrix layer for now, just use “default”.  You can ask this matrix for the categories it contains as rows.  List the values for the category.  Now we are getting somewhere interesting!  Imagine having a person memory and asking it for a list of associated persons.    For that associated person, list the verbs off the columns that associate them.  Now we are starting to query and complete triples.  Or ask for time stamps when the two persons met or the document IDs as evidence of their meeting.  Or if you have transactions, ask one memory of a product for its associated problems.  Given one or more of these problems, what parts of the product are also associated?  What customers are having problems with this product?

 

This navigation supports faceted search.  Given one answer from a column output, you can clamp it to the row as an added input and ask other questions, now with greater precision to the growing context.  You can also pivot, recalling one associated category:value, and if that category is also a memory type, you can go to ask the same or similar question of its memory.  You can explore the memory network at will, discovering the overall structure or clamping rows to get back only the answers that are relevant to a context.    This navigation can be given to your user’s control or you can think about automated ways to “think” over this very large and complex manifold for more advanced functions.

 

SaffronMemoryBase also provides you with more than just the existence of the associations.  Because the cells in each matrix store counts, clamping rows as inputs and asking for columns as outputs also provides you with the frequencies of these associations.  In fact, when spaces become so large and complex that mere lists of associations are overwhelming, these counts are used to rank order and page the strongest associations first.  Even if there are thousands of answers, you can ask for just so many at a time, looking first at the strongest associations and paging down a sort of “entity rank” as needed.  Having raw access to this massive counting engine, you can also pull any sub-matrix of relevant frequency and frequency distributions.  Think of what you can compute statistically from such a basic return of connections and counts!  You can reason semantically over the memory-row-column triples and you can reason statistically over the frequencies that also define them.  Get creative!  You now have the full power of the memory representation to start your own memory-based reasoning and its application.

 

Space Views

 

You can now query and compute anything you want over a space, but the space also provides some of the application level views you might use.  We will continue to add to this list, but here are the current application and utility views you might like to use (or provide food for your own creative thoughts):

  • Analogies.  Given the category:value name of any attribute, analogies will be returned as a list of similar other attributes.  Knowing who is similar to whom is one side of the coin of what is sometimes called “entity analytics”.  This view is also often called entity resolution, such as for finding duplicate IDs, name variants, or intentional aliases.  If different category:values are determined to be the same individual, then you can join then with the scoping methods mentioned above.  Even if you are not looking for dups, variants, and aliases, this view can be used to lookup and reason by similarity.  If you are in manufacturing and need to replace a part, are there any other similar parts?  Are you experiencing similar problems with other parts?  If you find one customer with a particular problem, are there other customers who have or are likely to have the same problem.  If you are trying to close a deal, are there any similar cases in history to recollect if and how they closed?  You don’t need to write any matching rules for this. SaffronMemoryBase recalls the most informative signature of the category:values you provide and uses them to lookup other values in the category with similar signature.  The weighing of factors in the signature is by a measure of the information, or entropy, in the data observed.  If all the people in your space are male, then “gender : male” doesn’t help any when computing who is most similar.  If two and only two people were born in Timbuktu, well then, this is informative!  Entropy is a measure of information in a frequency distribution, and frequency distributions are what define the memory base.
  • Connections.  Knowing who is related to whom defines the second side of the entity analytics coin.  As discussed above, you can navigate the network of memories yourself, but the connections view also provides you a higher-level query language for searching and retrieving entities rather than documents.  The connections view returns entities according to entity rank rather than documents according to document rank.   Look at the Attribute Query Language (AQL) to see a search engine-like query capability including AND, OR, and NOT operators.  You can also specify the category type of entity you want returned if you are looking only for people or only for organizations, for example.  The entities are returned in rank order of several metrics.  If you want a simple list of entities and the general rank order of how they are connected to the query, then the list and relative ordering metric are provided by default.  If you would like to dig deeper into the ranking metrics, then you can request advanced metrics.  The ordering factors include whether the entities are associated by a double or triple connections to the query terms (triples are more specific), the number such connections it has to the query terms, and the frequency strength of these connections (of course).  Through the raw navigations views, you can create your own entity rank views, such as sorting by the unusualness or novelty of the connection.  Otherwise, the connection view offers one way to provide an entity search engine over a space of entity memories.
  • Networks.  Visualizing a network of entities is another way to analyze them.  Rather than a rank ordered list returned by the connections view, the network view shows how entities in a list (a set) are connected to each other.  Provide a list of category:values, and the network view will return the connections and connection strengths between them.  The view allows you to define two lists, a source list and destination list.  The second set is optional, but if included, then the network view will return a bipartite graph, showing only the connections between the sets.  Again, you can also build such functions yourself, perhaps returning the context of how the entities are connected, perhaps marking themes for how they are all connected.  But the network view shows you another idea of what you can do with a memory base and we will continue to provide more such functions for you as well.
  • Classifications.  In addition to entity analytics to make sense of entity networks, the classifications view introduces the ability to help make decisions.  The graph structure that is possible within the name space supports schema-free semantics. The classification view shows how the memory-based approach also supports schema-free statistics.    You can use any AQL query to include Boolean functions in what you want to classify, but think of a simple vector query as a list of category:values.  In the same way that objects can be described as vectors to the observe method, objects can be described as vectors to query their classification.   Think of the query vector as the independent variables of a new case.  Pick another category as the dependent variable, the output variable within which you want to classify the new case.  Suppose the space of observed cases included the category “”suspicious” with values “yes, no”. Or the space includes the category “classification” with values “fish, amphibian, reptile, mammal”.  Given a new case, you want to know whether a transaction is likely suspicious or not.  Or given a new animal, what kind of animal is it.    Specify the classification category – any category – you want as the dimension for the classification.  The classifications view recalls similar cases and the associated classification of those cases to provide an answer.  This is called a nearest neighbor classifier.  It is kind of like the analogies view, but now applied to inductive reasoning.  If the category selected represents an action or outcome, you can classify a new case according to associated actions and outcomes from past experience.  This is experience-based or memory-based reasoning.
  • Attributes.  The attribute view is a utility to list any categories or values that match a search string.  For example, you can ask for all the values of people names that match “Jo*”, perhaps returning “Joe, Joseph, John, Joanne, …”.  This is just a way for you to lookup lexical values and to know what attribute terms (categories or values) exist in your space.

 

Download our sample code.  You can see java and HTML examples for calling these views.  Launch index.html in particular.  This page serves as an example as well as a little utility to show you the HTTP signatures for these queries along with the returns to view in your browser.  This will give you a sense of the ins and outs of each view before you start calling them in your own code.

 

By the way, both XML and JSON can be used to observe data but JSON is used exclusively for the results.  As a Web Service, we imagine that SaffronMemoryBase will be used for machine-to-machine communications (“mom” style communications), and JSON is more efficient than XML for this.  However, there is a lot of XML data in the world, geared for both automated machine reading and human person reading (“pop” style).  XML is easier to read (at least easier than JSON), but is verbose.  In any case, Observing XML let’s you take existing XML data and more easily transform it to observation payloads.  Then JSON is the most efficient way to talk to the memory base.

 

Summary

 

To wrap up in review, SaffronMemoryBase REST APIs are organized into an object name space.  Spaces can be used to partition your memory-base into different applications or different functions within one application.  Memories define the people, places, things, actions, outcomes, events, or whatever you want to learn about.  Matrices give you a fourth dimension of partitioning the memories within a space, perhaps to represent time.  Matrices are the physical representations of a memory, defined by rows and columns.  Rows are for inputs and columns are for outputs. 

 

You can go crazy thinking about what you can now do with this highly scalable, super fast, and universally powerful representation.  You can use some of the already developed methods, or you also create your own.  We’ll be having fun and making more such methods ourselves!

 

What’s next

 

Heads up!  There are a few things for you to know as you think about what you can and can’t do with the current interface.  Here is what we are up to next:

  • Numeric values.  Observed values are all treated as strings.  Of course, lots of data includes both strings and numerics.  The enterprise version of SaffronMemoryBase does support numeric values, including generalizing functions to infer the associations to unseen values.  We will add this capability to the REST interface.  When you send observations with the observation schema, you will also be able to declare the categories that are numeric, so that the memory represents and uses scalar semantics, generalizing from seen to unseen values.
  • Memory signatures.   The analogy method uses the signature of an entity (a category:value represented by a memory).  The signature attributes are weighted by an entropy measure.  This signature can be useful to many purposes when we expose it as a separate method.  But even for analogy, you might want to take the compute signature weights and modify them based on other knowledge (it is easier to change your hair color but not your height when creating a fake identity).  Signatures will be available as either linear or nonlinear.  Linear signatures are described as the most informative attributes (features and relationships for an entity, for example).  Non-linear signatures are described by the most informative associations between these attributes.  When distinctions become tougher or your data has more “noise”, the non-linear signature can more deeply describe the unique patterns within a memory.
  • Fundamental computations.  You can use the namespace methods to pull any set of submatrix counts and do anything you want, but we will also be developing some computations for you to use.  For example, entropy is a fundamental measure of a count distribution.  You might also want to know the expected value of a distribution.  Given the observed count that is also stored in SaffronMemoryBase, what is the expected value, and given the difference, how surprising is the actual count?  Is a link missing, likely to exist but not yet observed?  Is a count unusually high, too frequent to what was expected?  Other computations will include a measure of novelty.  How much information in an observation vector is currently unknown and new, for example?  You can also compute these basic functions, but we will keep packaging them ourselves to further exploit the statistical nature of the memory base.
  • Classification methods.  Nearest neighbor classification is only one method of classification.  We are also using conditional probability, weighing the attribute space, and other “thought processes” as we call them in the enterprise system.  We will increasingly provide our experience through these REST interfaces for increased support of decision making as well as sense making applications.
  • Advanced Designs Patterns.   There are so many more things you can do, once you get creative with how you can observe and design your memories with spaces.  For one, the memory-matrices can store co-distances as well as co-mentions.  This lets you predict not only that another category:value is likely to be associated but also how far away it is likely to occur, whether the distance is in space or in time.  Temporal prediction, representing and predicting event sequences, is one such function that will be available soon.

 

Help us build the memory making community.  If you have any problems, let us know through our on-line support.  If you want to add any example code, let’s put it out there together.  If you need more support or want to take bigger steps, give us a ring.  You will also see more and more about performance and scaling to help you with capacity planning.  For know, just know that your one cloud instance can observe thousands of triples per second and that results to date show the memory base will be near linear in scalability as you want to add more and more machines. We will be developing more and more examples and hope you join us in sharing yours.  There are so many possibilities we have not yet thought of! We hope you will see this memory-based approach as fast and easy.  It is cool new stuff and we hope you will have fun developing applications with it.  There is so much more to come. 

Comments (0)

You don't have permission to comment on this page.