Parsing XML to get one value

**NeillHog** · October 29th, 2007

Between all the philosophy about right and wrong and what the monks would do...

Quikee wrote that as the xml file I am using makes use of namespace I have to include

namespace = "http://www.topografix.com/GPX/1/1"

OK! I went off and read about namespace but a question remains (at least for mere mortals like me)
How do I know what value I have to give for "namespace"?
I realise that it is in the XML file but the way I understand things I can only read the file if I know the namespace value.

Sorry if I am being really dense here. Please bear with me. I am learning. Four months ago I had never heard of Ubuntu and three months ago I thought that Python wa a snake

Thanks!

**pmasiar** · October 29th, 2007

Python still is a snake. Language Python was named not after the snake, but after british comic group "Monty Python Flying Circus". If you wan't seen "Life of Brian" you are missing a lot.

**LaRoza** · October 29th, 2007

The namespace of an XML document, means nothing. It is usually an URI of some sort, so they are more likely to be unique, but that URI is not followed.

The namespace of a document can be different, The namespace will be in the root element, and may be prefixed with a word and a colon. In my page, laroza.freehostia.com/home, you'll see this line in the source:

Code:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

I declared an XML namespace (xmlns) of "http://www.w3.org/1990/xhtml". This is a global namespace, all child elements are part of this namespace. The xml:lang="en" attribute is prefixed with xml: because that attribute belongs to another namespace. I could have written:

Code:

<html xhtml:xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

But I would have had to write xhtml: before all elements of this namespace. It doesn't have to be xhtml, it could have been almost any word, but giving the namespace a logical name make sense.

I don't know what the XML looks like that you are reading, and haven't use the functions you are using, so I don't know if it gets the xml of one namespace out of several, or that is the global namespace.

I hope this helps, but I was not sure exactly what you are asking.

**NeillHog** · October 29th, 2007

I am 44 and originally come from England so I think that answers your question.
But the lumberjack song and the dead parrot are both better than life of brian.

But ...
How do I find a value for namespace. It is in the root string in

namespace = "http://www.topografix.com/GPX/1/1"
tree = ElementTree.parse("test.xml")
root = tree.getroot()

but there must be a function to find it.

Mustn't there.

I really "wanted to be a lumberjack" but when I told the careers advisor that he assumed I had ovedosed on Monty Python.

**skeeterbug** · October 29th, 2007

Use the DOM example posted, or your own SAX handler. Please don't use a regex, it will be much more difficult to maintain. You may only want one node now, but how about in a month? What if the XML changes?

If you are parsing unstructured data, use regex. It is very powerful. XML is structured and we have libraries to easily work with it, use them!

**Quikee** · October 29th, 2007

if you look at a part of your xml:

Code:

<gpx
  version="1.1"
  creator="Touratech QV 4.0.87 Standard - http://www.ttqv.com"
  xmlnssi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns:topografix="http://www.topografix.com/GPX/Pr ivate/TopoGrafix/0/1"
  xmlns="http://www.topografix.com/GPX/1/1"
  xsi:schemaLocation="http://www.topografix.com/GPX/ 1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
  <metadata>
    <time>2007-09-02T17:30:53Z</time>
    <bounds minlat="47.4481272697449" minlon="10.4949045181274" maxlat="47.6289081573486" maxlon="11.0962772369385"/>
  </metadata>
....

you see that <gpx ..> has a attribute xmlns="http://www.topografix.com/GPX/1/1" which means that gpx is part of "http://www.topografix.com/GPX/1/1" namespace and all its children (practically all other elements) are part od this namespace as well.

If you look even further you see that <gpx..> element has a attribute xmlns:topografix="http://www.topografix.com/GPX/Pr ivate/TopoGrafix/0/1" which means that all attributes defined as <topografix:someElementName> is part of "http://www.topografix.com/GPX/Private/TopoGrafix/0/1" namespace.

Which namespace and which element "fit together" is defined in the specs of gpx format (the shema of the xml).

In other words namespaces are just some sort of a discriminator for elements that have the same name.

**NeillHog** · October 29th, 2007

Quikee is always there when you need him

Thank you! I understand all that now. After your hints yesterday I read up al about namespace.

my problem is that it seems like I need to know that the namespace variable is "http://www.topografix.com/GPX/1/1" before I can start parsing. However I can not know what the value for namespace is until I have parsed the file.
Like the chicken and egg

Depending on which programme creates a GPX, the namespace variable is different so before starting reading the elevations I need to know what the value for namespace is.

In your example you have hardcoded the value but that will only work if it agrees with the XML file.

As I wrote, the namespace is in the string I see by using "print root".
But I assume that there is some clever way of getting the namespace out of the file. My experiments so far have failed.

Thanks for all your time and help

Neill

**aks44** · October 29th, 2007

Originally Posted by NeillHog

But I assume that there is some clever way of getting the namespace out of the file. My experiments so far have failed.

You should be able to parse your XML document into a DOM tree, and from there extract the namespace from the root element.

Something like:

Code:

tree = ElementTree.parse("test.xml")
root = tree.getroot()
namespace = root.getnamespace("") // empty string for default namespace

I used to do it using the Xerces parser, so I guess Python's parser can do it too. Better check the API reference.

**Quikee** · October 29th, 2007

Originally Posted by NeillHog

Depending on which programme creates a GPX, the namespace variable is different so before starting reading the elevations I need to know what the value for namespace is.

In your example you have hardcoded the value but that will only work if it agrees with the XML file.

As I wrote, the namespace is in the string I see by using "print root".
But I assume that there is some clever way of getting the namespace out of the file. My experiments so far have failed.

Thanks for all your time and help

Neill

This is weird. Usually a format defines a namespace (or many of them) for its elements and they are always the same as long you parse the same format of the same version. To have different formats is nonsense - it is like ie, firefox and opera would define its own namespaces for HTML elements. Just imagine the confusion.

ElementTree that is build into Python 2.5 handles namespaces in a very strange way. That's why I prefer lxml which provides the same interface as the built-in ElementTree + its addons and backend. One of the "add-ons" is a namespace map (nsmap) which is a map/dictionary of all namespaces defined on the current element.

Code:

from lxml import etree as ElementTree

if __name__ == "__main__":
	tree = ElementTree.parse("gpxExampleNS.xml")
	root = tree.getroot()
	namespace = root.nsmap[None]
	print root.nsmap
	trackSegments = root.getiterator("{%s}trkseg" % namespace)
	for trackSegment in trackSegments:
		for trackPoint in trackSegment:
			print trackPoint.attrib
			print trackPoint.attrib['lat']
			print trackPoint.attrib['lon']
			print trackPoint.find('{%s}ele'% namespace).text
			print trackPoint.find('{%s}time'% namespace).text

lxml is in the ubuntu repository.

I don't know how to do this in normal ElementTree.

**NeillHog** · October 29th, 2007

Weird it may be but here are the xmlns tags from two GPX files.
They are only a tiny bit diferent but different enough.
xmlns="http://www.topografix.com/GPX/1/0"
xmlns="http://www.topografix.com/GPX/1/1"
I think the difference is the version but none the less hardcoding isn't going to work.

Is it possible to use the first code you sent me (none namespace) to parse for the xmlns part of the gpx tag. If that was possible then I would have the namespace and couls use your second (namespace) code to do the rest using the xmlns part as the namespace?

Another possibility would be to extract the namespace from the root.
When I do "print root" I get
<Element {http://www.topografix.com/GPX/1/1}gpx at b7d506ec>
This contains the namespace that I am looking ffor but is not a string and will not let me do any string operations on it.

One of these solutions would be ideal because they use only standard python.

Sorry about all these questions but this is slowly sending me mad. Once I have the values the rest will be easy (famous last words!)

Thanks
Neill

Thread: Parsing XML to get one value

Thread Tools

Display

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Re: Parsing XML to get one value

Bookmarks

Bookmarks

Posting Permissions