codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Questions about XML processing?


Thank you Terry, Dan and Dieter for encouraging me to post here. I have 
already solved the problem albeit with a not so efficient solution. 
Perhaps, it is useful to present it here anyway in case some light can 
be added to this.

My job is to parse a complicated XML (iso metadata) and pick up values 
of certain fields in certain conditions. This goes for the most part 
well. I am working with xml.etree.elementtree, which proved sufficient 
for the most part and the rest of the project. JSON is not an option 
within this project.

The specific trouble was in this section, itself the child of a more 
complicated parent: (for simplicity tags are renamed and namespaces removed)

 ????????? <tagA>
 ??????????? <tagB>
 ????????????? <tagC>
 ??????????????? <string>Something</string>
 ????????????? </tagC>
 ????????????? <tagC>
 ??????????????? <string>Something else</string>
 ????????????? </tagC>
 ????????????? <tagC>
 ??????????????? <note>
 ????????????????? <title>
 ??????????????????? <string>value</string>
 ????????????????? </title>
 ????????????????? <date0>
 ??????????????????? <date1>
 ????????????????????? <date2>
<gco:Date>2020-11-06</gco:Date>
 ????????????????????? </date2>
 ????????????????????? <dateType>
 ??????????????????????? <code blah lots of strange things blah />
 ????????????????????? </dateType>
 ??????????????????? </date1>
 ????????????????? </date0>
 ??????????????? </note>
 ????????????? </tagC>
 ??????????? </tagB>
 ????????? </tagA>

Basically, I have to get what is in tagC/string but only if the value of 
tagC/note/title/string is "value". As you see, there are several tagC, 
all children of tagB, but tagC can have different meanings(!). And no, I 
have no control over how these XML fields are constructed.

In principle it is easy to make a "findall" and get strings for tagC, using:

elem.findall("./tagA/tagB/tagC/string")

and then get the content and append in case there is more than one 
tagC/string like: "Something, Something else".

However, the hard thing to do here is to get those only when 
tagC/note/title/string='value'. I was expecting to find a way of 
specifying a certain construction in square brackets, like 
[@string='value'] or [@/tagC/note/title/string='value'], as is usual in 
XML and possible in xml.etree. However this proved difficult (at least 
for me). So this is the "brute" solution I implemented:

- find all children of tagA/tagB
- check if /tagA/tagB/tagC/note/title/string has "value"
- if yes find all tagA/tagB/tagC/string

In quasi-Python:

string = []
element0 = elem.findall("./tagA/tagB/")
 ??? for element1 in element0:
 ??????? element2 = element1.find("./tagA/tagB/tagC/note/title/string")
 ??????????? if element2.text == 'value'
 ??????????????? element3 = element1.findall("./tagA/tagB/tagC/string)
 ??????????????? for element4 in element3:
 ??????????????????? string.append(element4.text)


Crude, but works. As I wrote above, I was wishing that a bracketed 
clause of the type [@ ...] already in the first "findall" would do a 
more efficient job but alas my knowledge of xml is too rudimentary. 
Perhaps something to tinker on in the coming weeks.

Have a nice weekend!





On 2020-11-06 20:10, Terry Reedy wrote:
> On 11/6/2020 11:17 AM, Hern?n De Angelis wrote:
>> I am confronting some XML parsing challenges and would like to ask 
>> some questions to more knowledgeable Python users. Apparently there 
>> exists a group for such questions but that list (xml-sig) has 
>> apparently not received (or archived) posts since May 2018(!). I 
>> wonder if there are other list or forum for Python XML questions, or 
>> if this list would be fine for that.
>
> If you don't hear otherwise, try here.? Or try stackoverflow.com and 
> tag questions with python and xml.
>
>