Ingestion Templates are used to describe the format of the incoming data. They define an ETL process to perform, taking data from the source into a the normalized Saffron Resource Format (see Saffron Resource DTD for the complete specification). As customer needs grow, you will likely see additional templates become availble in the future. Below are the currently availble templates for Saffron Sierra.
"WS XML Raw" Template
Usage: <template>WS XML Raw</template>
The WS XML Raw template uses the native Saffron XML format. It is essentially a passthrough template where no data pre-processing is performed. As a result, the client can supply all the elements of a full Saffron resource. See Anatomy of a Saffron Resource for more information on the various elements of a resource. A simple example of a Saffron XML resource may look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<as>
<a c="person">
<v><![CDATA[john]]></v>
</a>
<a c="city">
<v><![CDATA[raleigh]]></v>
</a>
<a c="state">
<v><![CDATA[nc]]></v>
</a>
</as>
</r>
The above example has defined three resource-level attributes, person:john, city:raleigh and state:nc. These 3 attributes will all be associated with eachother. A more advanced resource with segments might look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<as>
<a c="company">
<v><![CDATA[acme]]></v>
</a>
<a c="industry">
<v><![CDATA[computer software]]></v>
</a>
</as>
<ss>
<s>
<as>
<a c="person">
<v><![CDATA[mary]]></v>
</a>
<a c="city">
<v><![CDATA[san jose]]></v>
</a>
<a c="state">
<v><![CDATA[ca]]></v>
</a>
</as>
</s>
<s>
<as>
<a c="person">
<v><![CDATA[john]]></v>
</a>
<a c="city">
<v><![CDATA[raleigh]]></v>
</a>
<a c="state">
<v><![CDATA[nc]]></v>
</a>
</as>
</s>
</ss>
</r>
The above example has both resource-level attributes and segment-level attributes. Each set of segment attributes are associated with eachother (and to the "global" resource-level attributes). However, attributes are not associated across segments, i.e.
attributes company:acme, industry:computer software, person:mary, city:san jose, state:ca would all be associated with eachother
and...
attributes company:acme, industry:computer software, person:john, city:raleigh, state:nc would all be associated with eachother
but...
attributes person:mary, city:san jose, state:ca would not be associated to attributes person:john, city:raleigh, state:nc
Another example might involve text where entities may have been extracted:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<ss>
<s>
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="person" o="9" l="4">
<v><![CDATA[john]]></v>
</exa>
<exa c="verb" o="19" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="company" o="27" l="4">
<v><![CDATA[acme]]></v>
</exa>
</exas>
</a>
</as>
</s>
<s>
<as>
<a c="text">
<v><![CDATA[Mary works in San Jose with Bob.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="verb" o="5" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="city" o="14" l="8">
<v><![CDATA[san jose]]></v>
</exa>
<exa c="person" o="28" l="3">
<v><![CDATA[bob]]></v>
</exa>
</exas>
</a>
</as>
</s>
</ss>
</r>
In the above example, only the extracted attributes are associated to eachother. If an attribute contains extracted attributes, it is not associated with anything, only the child extracted attributes. As a result,
attributes person:mary, person:john, verb:work, company:acme would all be associated with eachother
and...
attributes person:mary, verb:work, city:san jose, person:bob would all be associated with eachother
In the end, the double association person:mary <-> verb:work would have an association count of 2, while all other double and triple associations would have counts of 1.
"WS XML Text" Template
Usage: <template>WS XML Text</template>
The WS XML Text template uses the native Saffron XML format. It differs from the WS XML Raw template in that for every attribute with a category nammed "text", it performs an extraction and segmentation process. First attributes are extracted from the text field in the following order:
1. NamelistExtractor - extracts entities that are defined in your namelist (see the Namelist APIs).
2. RegexExtractor - extracts common terms: dates, email addresses, phone numbers, urls and geocoordinates
3. WordNetExtractor - extracts keywords: nouns, verbs, adverbs, adjectives, dates, phone numbers, times, timezones, currencies, numbers, alphanums, unknowns (uncategorized terms)
Then the text block is segmented based on sentence boundries. The purpose of segmentation is that it defines the observation boundries. As described above, attributes from one segment are not associated with attributes from another segment. For example, if the following resource was submitted with the "WS XML Text" template, and assuming person:mary, person:john, company:acme and city:san jose have been inserted into your namelist, the same it would be converted to the same resource as above.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme. Mary works in San Jose with Bob.]]></v>
</a>
</as>
</r>
The final resource that is then observed into SMB is a fully extracted, segmented resource.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<ss>
<s>
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="person" o="9" l="4">
<v><![CDATA[john]]></v>
</exa>
<exa c="verb" o="19" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="company" o="27" l="4">
<v><![CDATA[acme]]></v>
</exa>
</exas>
</a>
</as>
</s>
<s>
<as>
<a c="text">
<v><![CDATA[Mary works in San Jose with Bob.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="verb" o="5" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="city" o="14" l="8">
<v><![CDATA[san jose]]></v>
</exa>
<exa c="person" o="28" l="3">
<v><![CDATA[bob]]></v>
</exa>
</exas>
</a>
</as>
</s>
</ss>
</r>
"WS XML Text Extraction" Template
Usage: <template>WS XML Text</template>
The WS XML Text template uses the native Saffron XML format. It differs from the WS XML Text template in that the text has already been segmented but not extracted. For every attribute with a category nammed "text", it performs an extraction process only. The attributes are extracted from the text field in the following order:
1. NamelistExtractor - extracts entities that are defined in your namelist (see the Namelist APIs).
2. RegexExtractor - extracts common terms: dates, email addresses, phone numbers, urls and geocoordinates
3. WordNetExtractor - extracts keywords: nouns, verbs, adverbs, adjectives, dates, phone numbers, times, timezones, currencies, numbers, alphanums, unknowns (uncategorized terms)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<ss>
<s>
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme.]]></v>
</a>
</as>
</s>
<s>
<as>
<a c="text">
<v><![CDATA[Mary works in San Jose with Bob.]]></v>
</a>
</as>
</s>
</ss>
</r>
The final resource that is then observed into SMB is a fully extracted, segmented resource. "text" attributes do not necessarily need to be in segments for the "WS XML Text Extraction" template. You could pass in resource "global" attributes to be extracted as well.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<ss>
<s>
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="person" o="9" l="4">
<v><![CDATA[john]]></v>
</exa>
<exa c="verb" o="19" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="company" o="27" l="4">
<v><![CDATA[acme]]></v>
</exa>
</exas>
</a>
</as>
</s>
<s>
<as>
<a c="text">
<v><![CDATA[Mary works in San Jose with Bob.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="verb" o="5" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="city" o="14" l="8">
<v><![CDATA[san jose]]></v>
</exa>
<exa c="person" o="28" l="3">
<v><![CDATA[bob]]></v>
</exa>
</exas>
</a>
</as>
</s>
</ss>
</r>
"WS XML Text Segmentation" Template
Usage: <template>WS XML Text</template>
The WS XML Text template uses the native Saffron XML format. It differs from the WS XML Extraction template in that the text has already been extracted but not segmented. For every attribute with a category nammed "text", it performs the segmentation process only (on sentence boundries). The purpose of segmentation is that it defines the observation boundries. As described above, attributes from one segment are not associated with attributes from another segment. For example, if the following resource was submitted with the "WS XML Text" template, and assuming person:mary, person:john, company:acme and city:san jose have been inserted into your namelist, the same it would be converted to the same resource as above.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme. Mary works in San Jose with Bob.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="person" o="9" l="4">
<v><![CDATA[john]]></v>
</exa>
<exa c="verb" o="19" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="company" o="27" l="4">
<v><![CDATA[acme]]></v>
</exa>
<exa c="person" o="33" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="verb" o="38" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="city" o="47" l="8">
<v><![CDATA[san jose]]></v>
</exa>
<exa c="person" o="61" l="3">
<v><![CDATA[bob]]></v>
</exa>
</exas>
</a>
</as>
</r>
The final resource that is then observed into SMB is a fully extracted, segmented resource.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE r SYSTEM "saffron-normalized-v4-single.dtd">
<r n="1">
<ss>
<s>
<as>
<a c="text">
<v><![CDATA[Mary and John both work at Acme.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="person" o="9" l="4">
<v><![CDATA[john]]></v>
</exa>
<exa c="verb" o="19" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="company" o="27" l="4">
<v><![CDATA[acme]]></v>
</exa>
</exas>
</a>
</as>
</s>
<s>
<as>
<a c="text">
<v><![CDATA[Mary works in San Jose with Bob.]]></v>
<exas>
<exa c="person" o="0" l="4">
<v><![CDATA[mary]]></v>
</exa>
<exa c="verb" o="5" l="4">
<v><![CDATA[work]]></v>
</exa>
<exa c="city" o="14" l="8">
<v><![CDATA[san jose]]></v>
</exa>
<exa c="person" o="28" l="3">
<v><![CDATA[bob]]></v>
</exa>
</exas>
</a>
</as>
</s>
</ss>
</r>
Comments (0)
You don't have permission to comment on this page.