Validating XML Data
with an XML Schema
Contents
1. XML Validation Concepts
a. Concepts
b. Errors
c. Resources
2. Example: Validation with XMLSpy
a. Downloading Spy
b. Creating a new XMLSpy Project
c. Associate the homestead XML Schema with a folder d. Open the file in XMLSpy
e. Add the active file to the folder f. Click the "Validate" button
3. Example: Manipulating Large XML Data Sets with Ant & Eclipse
a. Tools for Records and Metadata vs. Tools for Data b. Apache Ant – DOS command line
c. Eclipse – GUI interface
Disclaimer
• The information and examples in this document are for
demonstration purposes only.
• The information and examples presented are for your information to assist in enhancing the abilities of
counties to work with and validate XML datasets with Minnesota Revenue XML schemas.
• The Minnesota Department of Revenue does not
endorse nor support any products mentioned in this presentation. It is beyond the scope of the mission of the Property Tax Division to support tools within each
XML Validation Concepts
<XML File/>3
Validation errors Validates <XML Schema/> XML Validator If you have:1) A valid XML file. And
2) a well defined XML Schema, you can
3) check the XML file to see if it is XML and has all the
required tags defined by the schema with any standard XML validation program.
XML Validation Concepts
• XML is a text file where well defined tags surround each data value.
• An XML Schema describes what tags are needed and where they need to be for a particular file.
Tag example: <Zip_Code>55101</Zip_Code>
<xs:element name="Zip_Code"> <xs:simpleType>
<xs:restriction base="xs:string"> <xs:pattern value=“[0-9]{5}"/> </xs:restriction>
XML Validation Errors
If you have:
1) An invalid XML file: You get an invalid XML, malformed XML or content error. Examples are missing tag brackets or other syntax errors.
2) A valid XML file with tag errors: You get a reasonable list of XML tag errors found that are inconsistent with the specific XML
Schema being validated against.
<XML File/>
3
Validation errors Validates <XML Schema/> XML Validatorampersand greater than less than Name & & > > < < Escape Character
There are five characters are used in XML syntax that cannot be used directly in a data value. They must be “escaped” by representing the character using the ampersand representation
XML Validation Errors
for XML Escape Characters
10 Common XML Transmission Errors
1. Mal-formed XML
2. Missing namespace declarations 3. Invalid document structure
4. Missing required element 5. Missing data in element
6. Invalid document type code values 7. Invalid property type code value 8. Invalid character values
9. Incorrect number of repeating fields 10. Incorrect tax year
XML & Validation Resources
• W3C XML Standards Page – http://www.w3.org/XML/
• OASIS XML Cover Pages –
http://xml.coverpages.org/xml.html#xmlValResources (lots of
references)
• XML.com – http://www.xml.com (up-to-date XML information)
• XML.com Schema Tools –
http://www.xml.com/pub/a/2000/12/13/schematools.html (older list of
schema tools)
Example:
Validating a Homestead File with
XMLSpy
Validating with XMLSpy Steps
1. Download XML Spy (30 day free eval) and homestead zip file
2. Create a new XML Spy Project
3. Associate the homestead XML Schema with a folder
Download XML Spy
• http://www.altova.com/products/xmlspy/xml_editor.html
Download Homestead Files
Start XML Spy
• Double click the XML Spy icon
New Project Window
• Note: if the window is not visible use the Window/Project menu to show the project window
Set the Properties of the XML Folder
• Right click over the XML files folder in the project view
• NOTE: RIGHT CLICK not left click
Browse… to homestead schema
• Click OK and then double click on your xml data file to be validated
Add this file to your project
• RIGHT click and select the "Add Active File"
View Results in Validation View
• If your file is valid a green check will appear in the validation view
File Size Limitations
• XMLSpy tends to have problems validating files over about 25MB on a system with
1GB of RAM
• Use Apache Ant and/or Eclipse if you want to validate larger files
Example:
Manipulating Large
XML Data Sets with Ant & Eclipse
Agenda
• Tools for Records and Metadata vs. Tools for Data
• Apache Ant
– DOS command line • Eclipse
– GUI interface
• V – The File Viewer – Viewing large files • XML databases
Records vs. Databases
• XML File Viewers (like XML Spy) are ideal for viewing single records and metadata (XML Schemas)
• Visual editing tools tend stop working
when file sizes exceed about 25MB (given 2GB of RAM) (e.g. We don't use MS-Word to edit 100,000 records in a database)
In Memory vs. Streaming
• There are several different approaches to checking large files
– Load the entire file into memory (DOM) – Stream the file through memory (SAX)
– Page only relevant sections into memory (Chunking – used in V-The-File-Viewer)
Apache Ant
• Open source build manager
• User give ant a high-level description of a task • Ant executes task using dependency analysis
(only validate after extract)
• Called from shell (DOS or UNIX)
• Called from Integrated Development Environment (IDE)
Adding tools.jar
• Apache ant needs one missing jar file call "tools.jar" that is free with Sun's Software Development Tools
• It is freely available from the Java download as part of the JavaSDK 1.4+ (but not the JDK)
• Temporary file is on the Java Open Source User Group JOSUG web site
(www.josug.org/tools.jar) • File is about 6MB!
Apache Ant 1.7
• Many new features
• Simple <schemavalidate> target • Faster execution
<schemavalidate
noNamespaceFile="homestead-data_v0.28.xsd"
<?xml version="1.0" encoding="UTF-8"?>
<project default="validate-homestead">
<property name="SrcDir" value="C:/homestead/stress-test"/> <property name="SchemaDir" value="C:/homestead/schemas"/> <target name="validate-homestead">
<schemavalidate noNamespaceFile="${SchemaDir}/homestead-data_v0.28.xsd" file="${SrcDir}/100MB-test.xml"> </schemavalidate> </target> </project>
Ant From DOS Command Line
1. Download Apache Ant version 1.7.0 2. Copy the build.xml into a directly
3. Change file locations in properties of the build file to match your local files
Change these to match your local system build.xml
Apache Ant Tasks
• schemavalidate
– New Ant 1.7 optional task just for XML Schema
• xmlvalidate
– very general Ant 1.6 task for validation of XML files – check for well-formed files
– check for validation against an XML Schema
schemavalidate options
Sample Ant 1.6 Validate Script
Eclipse
• OpenSource Integrated development
environment originally sponsored by IBM • "GUI" front end to Apache Ant
Complete Ant 1.7 Build File
<?xml version="1.0" encoding="UTF-8"?>
<project default="validate-homestead">
<property name="DataDir"value="C:/homestead/data-files"/> <property name="SchemaDir" value="C:/homestead/schemas"/> <target name="validate-homestead">
<schemavalidate noNamespaceFile="${SchemaDir}/homestead-data_v0.28.xsd" file="${DataDir}/my-data-file.xml"> </schemavalidate> </target> </project>
GUI "Point and Click" UI
• Sample "point and click" GUI interface • Alt+Shift+X, Q to run a task
XML Transform
• View a homestead record of a specific parcel ID Big File (Gigabytes) XML Transform With Matching Rules Very Small File match no match
Sample XML Transform
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheetversion="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:mn="http://data.state.mn.us" xmlns:c="http://niem.gov/niem/common/1.0" xmlns:u="http://niem.gov/niem/universal/1.0" xmlns:mnr="http://revenue.state.mn.us" xmlns:mnr-ptx="http://propertytax.state.mn.us" ><xsl:outputindent="yes"exclude-result-prefixes="mn mnr c u mnr-ptx"/> <!--only display the homestead record for this parcel ID -->
<xsl:template
match="/HomesteadRecordsDocument/CountyHomesteadRecord/HomesteadParcels/HomesteadParcel/CountyPr opertyTaxStatement[mn:ParcelID='1234567']">
<!--copy the CountyHomesteadRecord that matched this parcel ID to the output --> <xsl:copy-ofselect="../../.."/>
</xsl:template>
<!--do not output anything else --> <xsl:templatematch="@*|node()">
<xsl:apply-templatesselect="@*|node()"/> </xsl:template>
V-The File Viewer
• $20 application (less in quantity)
• Easily allows viewing of files greater than 1GB (uses file
"chunking" technology)
Use Goto Function
• Goto is (Ctrl-G) or
XML Databases
• XML databases store XML in its native format
• You can associate a column in your databases or a "collection" with the homestead XML Schema
Example of XML Databases
• IBM DB2 version 9 "PureXML"
– free and low-cost "express" versions for development and testing
• eXist (open source)
– native XML database with XML Schema validation • Over 50 other free and low-cost solutions with
DB2
• IBM DB2 version 9 supports fast searches on complex XML data sets
• Load records into XML datatype
• Records are quickly validated using an XML Schema
eXist
• Open source
• Built in web-administration • Easy to setup and configure
• Allows data to be validated on insert • Fast searches
Microsoft SQL Server 2005
• Supports native XML datatype • Supports fast indexing
• Add SOAP services to XML documents • Support for XQuery and XQuery updates