Aseem Bajaj1, Thorsten Joachims2
Cornell University, Ithaca, NY
{ab3781, tj2} @cs.cornell.edu
 
With the popularity of XML for representing & exchanging data, the requirement for agreement between parties on common Schema or DTD has become significant. Getting various parties to agree on a common standard is often time consuming and complex problem.
 
This work suggests an approach to identify mapping between XML documents with different schemas.
 
Identifying mapping between two sets of XML documents would find usage in various areas. Here is a scenario. A user that hosts her web log on blogger.com wants to shift to City Desk (another web logging system). There has been little standardization of web log data representation. Both the systems support XML Import & Export. The matcher can help create a transformer by identifying corresponding tags in the two schemas.
 
|   <item>        <title>My name
  is...</title>        <description>        Jack Bauer and ths is going to be the
  longest day of my life - well,        these were exactly my thoughts during
  our trip to Vegas - it took us more        than 24 hours to get there. Nice! But
  overall it went were smoothly without        any serious problems. I just tried the
  network connection here and everything        seems to work, even VPN - Great! So,
  only 10 minutes to go before the        conference starts. More later...perhaps
  :)        </description>         <guid>http://radio.weblogs.com/0107211/2003/11/17.html#a183</guid>        <pubDate>Mon, 17 Nov 2003 16:21:20
  GMT</pubDate>   </item>   | 
|   <entry>        <time>October 25,
  2003</time>        <URL>http://www.joelonsoftware.com/items/2003/10/25.html</URL>        <heading>Tidbits</heading>        <posting>        My incoming spam is running at over 200
  junk emails a day,        but SpamBayes is catching them all,
  with virtually no false positives.        Bayesian filtering, invented by Paul
  Graham and available in        many open source implementations, is
  the best answer yet.        </posting>                             </entry>   | 
|   entry                        =              item time                          =              pubDate URL                         =              guid heading                   =              title posting                    =              description   | 
 
Above example is a simple case of XML Fragment mismatch. A generalized problem can be stated as below.
Given a collection of XML documents, all adhering to a common structure, for a new test document, find mapping for each of the tags in the test document to the tags in document collection.
 
The test XML document could be very different from the training documents.
 
The problem of structural mapping between XML documents is closely related to some other problems
 
Recently there has work on matching structural similarity between XML Documents.
 
Measuring Structural Similarity among XML Documents and DTDs [5] is used for clustering XML Documents based on similarity. For a given XML document, the approach finds a matching DTD from a group. The approach defines and/or uses four key data structures
The approach uses the following measures for similarity
The paper defines an evaluation function and matching algorithm to calculate the “similarity measure” between the XML document and DTD.
 
But the approach considers tag equivalence instead of tag similarity to calculate the document similarity, though later, it uses attributes and sub elements to check tag similarity. This approach cannot be used in this context as we are trying to find an approach that works even in case the tag & attributes names don’t match.
 
Similarity Search in XML Data using Cost-Based Query Transformations [6] proposes approXQL, a query language that returns ranked results from XML documents. Traditional query languages on XML like XPATH & XQL are like structured query languages that return results if any found that match the queries. The query language looks like XPATH. The basic approach creates a query tree. Then it applies following transformations on it to get a bigger tree
The algorithm then matches elements in the XML documents that match this wider (more relaxed) query condition/tree. For each match, the conditions that are used are used to evaluate the rank for the search.
 
XML Retrieval Models based on Structural Proximity for Irregular Structured Document Sets [8] use structural information to improve Information Retrieval. The approach takes care of the fact that XML data is generally unregulated. The paper suggests a query model called the Set of Regular Paths, SRP, to get the ranked results. The query may be used for structure-content or content only information. The approach then uses the edit distance, proximity & term weights to calculate the similarity between XML documents.
 
Though information retrieval is another potential application of our work, and that is what the above two papers discuss, these works cannot be useful in suggesting a mapping between XML document structures.
 
Evaluating Structural Similarity in XML Documents [7] discusses an approach to partition a group of XML documents into sub-groups based on similarity. The approach uses dynamic programming to calculate edit distances between trees. The edit distance is used as a measure of similarity. The approach identifies the following types operations that would result in tree transformation
Though this approach can find good measure of similarity (and that’s another application that can be built over our work), this approach cannot be adapted well to identify mapping between XML Documents in the first place.
 
Text Classification and Information Retrieval problems have pretty successfully used machine learning by taking words in document as features and learning from them.
 
Unlike traditional text classification model, that learns and classifies documents, in our approach machine learning is used to classify each of the tags in the test document. The class learning is based on the each of the tag instances in the training documents. Here is the outline of the process
 
The diagram below shows the vector space representation for each of the tags.
 
 
 
 
 
| Tag |   | Content | Features |   |   | Content Info Features | Structural | Info Features | 
|   |   |   |   |   |   |   |   |   | 
| Item |   |   |   |   |   | Nullcontent:1 | Repeats:1 | Children5:1 | 
|   |   |   |   |   |   |   |   |   | 
| Creator |   | Matt:1 | Welch:1 |   |   | Tinycontent:1 | IsLeaf:1 |   | 
|   |   |   |   |   |   |   |   |   | 
| Title |   | Saudis:1 | Fear:1 | Sand:1 | Shortage:1 | Tinycontent:1 | IsLeaf:1 |   | 
Using each of the leaf tags in the XML collection as a training example, with tag name as the training class, a text classifier can classify each of the leaf tags in the test XML fragment with the correct tag name from the training schema.
Given sufficient training, identifying mapping for each leaf tag
The experiment consists of the following steps
Using various properties of content in various tags in training collection can improve classification accuracy.
For leaf tags, following properties can have stated results.
· Content Length: The classifier differentiates between tags based on size, like Title and Description.
To the above experiment process, add the following steps.
· While extracting out tags and content from XML documents, add the following to the content before writing it in Lemur Document format
o Depending the size, add information for sizes 0, 1, 2-4, 5-32, 33-256, 257 and above
Using structural information properties as features in vector representation of the tags can provide matches with fair accuracy.
The classification for tags especially non-leaf tags in the test fragment would give better accuracy with the use of structural information as features.
Use the experiment processes of first hypothesis and instead of tag content, use the following properties similar to the second experiment above.
Suggested mappings for all of the instances of a given tag in XML documents should be the same. Use of voting & confidence values after classification can suggest correct uniform mapping.
A few of the classification made by the classifier would be weak.
The post-processor would use the classification results & it’s own rules to come up with a common, stronger & better suggestion all the test instances of the same tag. This would improve the accuracy results.
Use the above experiments to suggest tags for all the tags in the test document. Then apply the following post processing rules.
a. If there is at least one positive match (confidence > 0): Among the tag names with positive matches, choose the one that is suggested the most.
b. If there is no tag with a positive value of confidence: Among the tags names with negative values, choose the class that had the highest confidence level.
 
 
Learns the structure of the training document in the following steps
Note: Among the above properties, “Parent’s Siblings” and “Depth” are not being used for the tests because the XML document in the collection have the same structure and these properties would give undue advantage to the tests.
Identifies the mapping between provided XML document and the document structure learnt by the Learner.
 
The test collection consists of main page content published by 6 web log sites. All the 6 XML documents follow a common RSS 1.0 schema.
The test procedure is explained here.
Shown below are the accuracy results for overall accuracy of each of the documents tested. The ‘Total’ column prints the number of tags tested in each document, while the ‘+’ & ‘-‘ identifies the number of correct & incorrect classifications.
 
Class Total + - Acc
 
Overall 131 100 31 0.7634
Overall 131 115 16 0.8779
Overall 131 100 31 0.7634
Overall 131 115 16 0.8779
Overall 132 115 17 0.8712
Overall 131 115 16 0.8779
 
Average Accuracy 0.8386
 
For each test document there are just one or two errors in mapping (as is visible in the mappings shown in the next table). An average accuracy of 83% is achieved.
The Matcher suggests the tags for each of the tags in the test document. Here are the suggested mappings for the last run in the above test.
 
       
Actual Tag                                   SVM Suggestions    Final Suggestion
 
generatorAgent( 1)                                +generatorAgent(1)      generatorAgent
      
creator(16)                            
+creator(1)+title(15)              
title
  
description(16)                        
+description(15)+title(1)        
description
     
language( 1)                                      +language(1)            language
        
title(16)        
+title(11)+subject(3)-title(1)-subject(1)               title
         
item(15)                                         +item(15)                item
      
subject(15)                                      +subject(15)             subject
           
li(15)                                           +li(15)                  li
      
channel( 1)                                       +channel(1)             channel
        
items( 1)                                         +items(1)               items
          
RDF( 1)                                           +RDF(1)                 RDF
         
date(16)                                         +date(16)                date
          
Seq( 1)                                           +Seq(1)                 Seq
         
link(16)                                         +link(16)                link
 
The post processor helps in making correct judgments about the mappings. In the above example out of the 3 cases where multiple suggestions are made by the SVM, the post processor makes right judgment in two cases, while in the remaining case it makes a wrong selection (but even in that case, there is no way it could have made right selection because the classifier results were heavily in favor of the answer the post processor selected)
To analyze the above results and to test out our hypotheses, we conducted the one on one test. The test was conducted as such
 
Here are the results for the best case found in the above random selection.
 
Class Total + - Acc
Overall 131 115 16 0.8779
 
Even though the training was done with just one document, the accuracy here is comparable to the case of training with 5 documents. This is because each of the documents has many instances of a tag & the SVM gets a chance to train on many examples.
The above tests were repeated without using Structural Information. The table below represents results using content and it’s properties as features.
 
Class Total + - Acc
Overall 131 95 36 0.7252
 
Simply using content and its properties is enough to give an accuracy of 72%. Though this is not as good as the previous case, it proves the hypothesis that content information can be used to identify structural mapping between XML fragments.
 
Note: Tests were also conducted by using purely content as the features. The accuracy for the document was poor, around 1%. By using just one document as training hardly gave the SVM a chance to learn, hence the results. Since these are obvious results, they have not being included here.
The one on one test was repeated without using Content & it’s properties & only using structural information. Here is the table of results.
 
Class Total + - Acc
Overall 131 35 96 0.2672
 
The ‘items’ & ‘RDF’ tags in the documents are non-leaf tags. Note that they have been mapped correctly here, while they were mapped incorrectly in the last test, while simply using content information. This is because they don’t have any content and the SVM doesn’t have a chance to learn any feature about them except the fact that they have null content.
 
The results of the above table, shows that structural information in itself is not enough to find mapping between the structures. But the structural information supplements the content information to boost the performance significantly.
The above four one-on-one test (using various features) were repeated for all 30 pairs in the document collection. The graph below shows the results.

 
 
Similarly the leave one out test was repeated by making the same variations in choice of features as above.
 

 
The accuracy results for tests using pure content as features improve in the latter case because the SVM get a chance to learn more from the data. But the accuracy results using pure structural information remains exactly the same. This is because the structure of the document is same and there is nothing new to learn for the SVM after it has looked at one document.
 
In both the cases it’s clear that even though content & structural information are useful in identifying mapping between XML documents, combining them gives better accuracy numbers.
The work presented the idea that it is possible to use various properties of XML documents to identify mapping between them. Specifically, we use properties of tag content & information about structure of the XML documents as features in Vector Space Model to find the mapping between tags of two XML documents. Though each of these two pieces of information can find mappings to some extent, using them in conjugation gives the best results.
 
Here is a list of possible suggestions for further development going ahead from here
[1] Thorsten Joachims, SVM light, a Support Vector Machine Implementation
http://svmlight.joachims.org/
 
[2] Heloise Hwawen Hse, jSVM, a Java Wrapper for SVM light
http://www-cad.eecs.berkeley.edu/~hwawen/research/projects/jsvm/doc/manual/
 
[3] Lemur Project
http://www.cs.cmu.edu/~lemur/
 
[4] Jakarta Project
http://jakarta.apache.org
 
[5] E. Bertino, G. Guerrini, M. Mesiti, I. Rivara, C. Tavella: Measuring Structural Similarity between XML Documents and DTDs
 
[6] Torsten Schlieder: Similarity Search in XML Data using Cost-Based Query Transformations
 
[7] Andrew Nierman, H. V. Jagadish: Evaluating Structural Similarity in XML Documents
 
[8] Shinjae Yoo: XML Retrieval Models based on Structural Proximity for Irregular Structured Document Sets