Parsing

I am on my third class of an online XML tutorial. This week’s topic was XML parsing. This covers how to actually use XML. The parser reads the XML. Some web browsers can parse XML. Internet Explorer version 4 was the first browser to include a parser. Expat is a free XML parser. Lark is a non-validating XML parser written in Java. That’s all the class had to say about parsers. It was pretty light.

So I also went to the W3 schools for some XML parser information. All modern browsers have a built in XML parser. The parser can convert XML into a DOM object. The parser traverses XML trees. I also looked up information about the Expat parser since my online course mentioned it. This XML parser is written in the C programming language. It is stream oriented. Applications register handlers, causing the parser to call back to the application when events of interest occur.

Finally I went to Sun Microsystems to brush up on XML parsers. They were heavily oriented towards parsing with the Java programming language. Data becomes available to the application while the XML is being parsed. SAX makes parsing callbacks available. Your first step is to actually obtain a parser. It should comply with the XML specifications for a parser. Sun recommends the Apache Xerces parser. Xerces is free. It works with both the C and Java programming languages.

Sun says that you should receive the SAX classes along with your parser, as they are parser dependent. The first thing the application needs to do is instantiate the parser. Then you set up callbacks so that SAX can take action on interesting events. This is called registering handlers with SAX. I get the feeling that I actually need to play around with this some more to get a better feeling of how it works.

Channels

Kris Zyp has written some interesting entries on the Comet Daily blog that have to do with channels. The first I saw had to do with HTTP Channels. It is a “publish and subscribe” model to communicate resource changes using HTTP.

Kris has also written about JSON Channels. They are an extension to JSON-RPC. They provide a subscription capability as well. However they are easier to implement than HTTP Channels, as you do not have to parse the HTTP protocol. It can be thought of as an alternative format to HTTP Channels.

In another post, Kris introduces REST Channels. The plan here is to use the web sockets protocol from HTTP 5. However the data is in JSON format. You can mix HTTP and JSON in a browser session. So an efficient technique could use HTTP requests, with updates coming back in JSON format.

I do not want to get too deep into the details. The Comet Daily blog is the source for that. However it is interesting to see that HTTP is still a continual force in the web. However I found it interesting to see that people consider its format a chore for parsing. I have heard colleagues mention how JSON is a good and light format for the web.

I myself am trying to get into web development to prepare myself for where my current project is going. So I am going to need a good handle on all the technology involved. It seems this might be a bit more than just learning XML and its supporting technologies. At least I will have to do a lot of homework to be able to speak intelligently at the architecture level.

Valid XML

Last week I subscribed to an online class for XML. This week I received my second class installment. The topic was valid XML documents. This lesson taught that a valid XML document is one that meets the DTD or XML Schema. It is different from an XML document being well formed, which means an XML processor can read the document. To me well formed means it meets the syntax rules of XML.

The course notes this week pointed out that XML is a self describing language. So you might not always need a DTD or XML Schema. However when you do use validation, the DTD can be local or public. A validation file such as XML Schema describes the XML document. Then a parser validates the XML document against the schema.

One of the most interesting experiences with this week’s class was an ad on the lesson page. The ad offered a free XML viewer. So I clicked through, then downloaded the Firstobject XML editor. It was a free download. The editor divides the XML document into elements and their sub elements. The tool uses a tree like structure in the user interface. I found the tool to be very basic compared to XMLSpy.

The Firstobject XML editor displays XML as text. There does not seem to be much graphic viewing capabilities other than the tree structure. One nice thing about this product is that the company will give you the source code to the editor if you buy their product. They are trying to sell a single MFC class that does all things XML for $249. That’s a lot cheaper than my XMLSpy tool would cost.

Perhaps the best course of action would be for me to write my own XML viewer and/or editor. Then I would really learn the ins and outs of parsing XML documents. I would not have to start from scratch. I could use an XML parser library to actually read the XML document. This would not be cheating. I would be learn the parser’s API.

Going to the Web

The Software Maintenance blog give an example of one project that is going to the Web. Moving from a client server environment to a web one involves a number of technologies. Of course there are some bare bones basics like web servers and HTML. But it is so much more than that. Frequently developers are classified as either desktop or web developers. Making the transition can be difficult.

I have sensed a different outlook amongst web developers. That might be a hard trait to emulate. However it should not be difficult to identify some common technologies used by web developers. For example, they may tend to use a combination of Java and JavaScript on the front and back ends. And you can pretty well imagine that they might be using XML format for sending data between systems.

Personally I am planning to enroll in some classes to beef up my web programming skills. The first one is going to be Java 101. I guess that would actually be my second class. I have already taken an introductory XML class.

XML 101

I took an XML training class a couple weeks ago. However I have not been able to do any real XML work due to other constraints at work. Therefore I decided to do an XML refresher to make sure I did not forget the things I learned in class. I signed up for an online course from About dot com. They provided some courses that are controlled through e-mail. I chose the course “XML 101” by Jennifer Kyrnin.

Perhaps this online course is not as current as the recent one my company paid for. This course stated that XML is not a programming language. It is also not a language of tags. Instead, XML is instructions on how to create the tags. It is a markup language to define information. XML is a meta language.

The online course covered Document Type Definitions (DTDs), which are the grammar of the XML document. The second line of an XML file should be the DTD. However you can omit the DTD line.

The latest specifications for XML are at the W3C web site. There is a lot of jargon associated with XML. However XML is not hard to learn. An entity is a storage unit for XML. Processing instructions were covered. Elements in XML are case sensitive.

Kyrnin recommended that you learn HTML before trying to learn XML. She says the benefit of XML is that you can move processing from the server to the client. So far I have only read through the first lesson. There are a lot of ads in the course material I receive. However the course is free so it is a reasonable trade off. Given the coverage of DTDs already, I wonder if this course is going to cover XML Schema as well.

XMLSpy Ad

Last weekend I was browsing the latest copy of Oracle magazine. I saw a full page ad in it for XMLSpy 2008. This product has advanced tools for XML Schema development. It is the self proclaimed world’s best selling XML editor. I recall my XML instructor telling me the same thing.

The ad declared that XMLSpy had many useful features such as support for very large XML files, DTD conversion, UML generation, advanced validation, and code generation for Java/C#/C++. It encourages you to leave the XML Schema details to the XMLSpy program while you concentrate on the business at hand. Altova (the maker of XMLSpy) offers a free 30 day trial download of the program.

I went to the Altova web site to find out more about the program. It has Visual Studio and Eclipse plugins. Of course it offers a visual XML schema editor. It also has a DTD editor. It supports Open XML (OOXML). It can debug SOAP. XMLSpy has a CSS editor.

XMLSpy has a number of tools for XQuery like an editor, debugger, and profiler. It can analyze XPath. It integrates with your database. XMLSpy supports XInclude and XPointer. Hell. I don’t even know what XInclude is. So you know this thing must be good. LOL.

The only downside to XMLSpy is the price. I had been warned about this before. The standard edition goes for $149. The professional edition costs $599. And the enterprise edition is $1190. Now that is not a lot for a development tool. We have third party tools that cost a whole lot more. The problem is that I need to justify the cost to my company and/or client. And that is some non-trivial effort.

Luckily my boss already knows the power of XMLSpy. So now we can work together to convince the powers that be that I need this tool. Altova has done a good job with its marketing. From the grapevine I also hear the product itself is excellent.

Benchmark Tool by Intel

I found a product on the web called the XML Benchmark Tool (XBT) by Intel. It is an XML performance and measurement tool. It analyzes the performance of XML processing engines. It tests a number of things such as XML parsing, XLST, XML Schema validation, and XPath operations to name a few. You can write your own driver since the tool is provided as a framework. There are versions for Windows and Linux. You can test C++ and Java code.

The XBT is provided for free. I suspect Intel is trying to gather good will to sell its XML Software Suite. This product costs $199 for the developer edition. The run time license goes for $1999. It states that it can handle large XML file processing. It claims to be standards compliant. The product also states that it is thread safe.

I downloaded the free XBT. It was almost 6 Megabytes. The download unzips into a directory structure. There is no install program. I found that a bit strange. You have to manually configure the product yourself. It requires you to obtain and install third party products such as Cygwin, the JDK, and Visual Studio. The free downloads comes with a PDF user guide. There were too many dependencies and configurations required to get the application up and quickly running.

XBT comes with many familiar drivers. For example it comes with Xerces, JAXP, libxml, MSXML, JDOM, Saxxon, and Xalan. This seemed promising. When I signed up to download the free product, I had to provide my e-mail address. It was embarrassing on September 4th when I received the following e-mail from a guy at Intel:

From: censored@intel.com
Date: Sep 4, 2008 9:39 PM
Subject: Out of Office: XML Benchmark Tool
To: me

I will be on vacation with no email access, returning Sept 2nd. Please contact for any items needing my immediate attention.


Ouch. Somebody has fallen asleep at Intel product support. I was not even given the e-mail address of the other guy at Intel. I suspect this out of office message was intended for other Intel Corporation employees. The dude forget that he was the one who received automatic e-mails every time somebody downloaded the XML Benchmark Tool. LOL.

Schema From XML

A long time ago I used to read the CodeGuru web site for information about software development. I have since graduated to other more interesting web site. However I still go back to CodeGuru every now any then. Today I saw an article posted there entitled “Inferring an XML Schema from an XML Document” by Paul Kimmel. I read it with interest as I am becoming more aware of XML Schema.

Paul said that writing plain XML documents is easy. However he does not have enough practice to write XML Schema documents from scratch. The attribute and namespace syntax gets him mixed up sometimes. So he decided to write a program to generate an XML Schema from any given XML document. That sounds like a good idea to me.

The trick is to employ the XmlSchemaInterface class from the .NET framework. Specifically you can use the InferSchema method of this class to produce an XML Schema Definition Language (XSD) schema if you pass it an XML file. That way you can let the class do the formatting and syntax for you. Sure you could do this yourself by hand if you know the ins and outs of XML Schema. But why take the hard route?

I suspect Paul wrote this article to help publicize his new book “LINQ Unleashed: for C#”. The list price for this book is $49.99. However you can get it for $31.49 from Amazon with free shipping. Good luck Paul with your new title. I have some interest in LINQ myself. If I can get my company to pay for it, I will pick up a copy.

Protocol Buffers

Matt Cutts of Google declared on his blog that Google had open sourced protocol buffers. They encode data in binary format for transmission. You write a description of the protocol you desire. The Google code then generates classes to work with the protocol. It supports the C++, Java, and Python programming languages. There are over 10k protocols used by Google itself.

People have said Protocol Buffers look similar to Facebook Thrift. And Thrift supports even more languages like Perl, PHP, XSD, C#, Ruby, Objective C, Smalltalk, Erlang, Ocaml, and Haskell. Matt Cutts has gone on the record as stating that Protocol Buffers predate Thrift. Both Thrift and Protocol Buffers are based on old ideas such as Corba and IDL. Some have commented that it would be nice if you could map Protocol Buffers automatically to XML.

Protocol buffers create stubs for your RPCs. People have commented that Protocol Buffers look a lot like JSON. Sometimes Google refers to Protocol Buffers as pbuffers. Google uses it exclusively for talk between servers. Google itself uses C++ for programs that run on production machines. This is in order to get the best performance.

I consulted the Google developer’s guide on Protocol Buffers. It declares that they are language neutral, platform neutral, and extensible. Their intended purpose is to serialize structured data. It claims that Protocol Buffers are smaller, faster, and simpler than XML. You define message types in “.proto” files. They contain name value pairs. Fields are numbered in each message type. When you add fields, the result is still backward compatible.

Mark Pilgram, another Google employee, likens a proto file to a schema. It does not contain data. He says that Protocol Buffers are designed to minimize network traffic and maximize performance. They can be nested. And they are both backward and forward compatible. He stated they will not replace JSON.

I have not used Protocol Buffers myself. However if Google uses them that much, there must be some really good benefits to them. Unfortunately my own project at work seems to be going in the direction of XML. I think we are officially prototyping it next year in production.