How to set define precedence of one match above another in regular expression
I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.
[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"
This is currently parsed using the following regular expression to extract the name and value pairs.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})
It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".
I use the following test code
public class test {
public static void main(String args) {
Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";
// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}
}
Which gives the result
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512
The desired result should would be
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512
I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})
What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".
Any help is appreciated.
regex key-value
add a comment |
I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.
[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"
This is currently parsed using the following regular expression to extract the name and value pairs.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})
It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".
I use the following test code
public class test {
public static void main(String args) {
Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";
// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}
}
Which gives the result
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512
The desired result should would be
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512
I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})
What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".
Any help is appreciated.
regex key-value
Do not use{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.
– Wiktor Stribiżew
Nov 21 '18 at 7:52
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Note I suggested leaving out(?:^[.*])?
from the pattern because the timestamp is not likely to contain=
and will not make any difference. If you ever need to put it back, use(?:^[[^]*])?
instead.
– Wiktor Stribiżew
Nov 21 '18 at 8:24
add a comment |
I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.
[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"
This is currently parsed using the following regular expression to extract the name and value pairs.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})
It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".
I use the following test code
public class test {
public static void main(String args) {
Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";
// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}
}
Which gives the result
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512
The desired result should would be
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512
I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})
What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".
Any help is appreciated.
regex key-value
I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.
[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"
This is currently parsed using the following regular expression to extract the name and value pairs.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})
It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".
I use the following test code
public class test {
public static void main(String args) {
Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";
// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}
}
Which gives the result
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512
The desired result should would be
Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512
I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.
(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})
What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".
Any help is appreciated.
regex key-value
regex key-value
asked Nov 21 '18 at 7:49
pcvnespcvnes
3672531
3672531
Do not use{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.
– Wiktor Stribiżew
Nov 21 '18 at 7:52
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Note I suggested leaving out(?:^[.*])?
from the pattern because the timestamp is not likely to contain=
and will not make any difference. If you ever need to put it back, use(?:^[[^]*])?
instead.
– Wiktor Stribiżew
Nov 21 '18 at 8:24
add a comment |
Do not use{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.
– Wiktor Stribiżew
Nov 21 '18 at 7:52
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Note I suggested leaving out(?:^[.*])?
from the pattern because the timestamp is not likely to contain=
and will not make any difference. If you ever need to put it back, use(?:^[[^]*])?
instead.
– Wiktor Stribiżew
Nov 21 '18 at 8:24
Do not use
{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.– Wiktor Stribiżew
Nov 21 '18 at 7:52
Do not use
{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.– Wiktor Stribiżew
Nov 21 '18 at 7:52
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Note I suggested leaving out
(?:^[.*])?
from the pattern because the timestamp is not likely to contain =
and will not make any difference. If you ever need to put it back, use (?:^[[^]*])?
instead.– Wiktor Stribiżew
Nov 21 '18 at 8:24
Note I suggested leaving out
(?:^[.*])?
from the pattern because the timestamp is not likely to contain =
and will not make any difference. If you ever need to put it back, use (?:^[[^]*])?
instead.– Wiktor Stribiżew
Nov 21 '18 at 8:24
add a comment |
1 Answer
1
active
oldest
votes
You may use
s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"
In Java:
String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";
See the regex demo. The first s
can even be omitted, but it makes it more efficient.
Details
s
- whitespace
(?<key>[^=s]+)
- Group "key": 1 or more chars other than whitespace and=
="
- a literal text
(?<value>(?:<[^<>]*>|[^"])*)
- Group "value": any substring between<
and>
with no<
/>
(<[^<>]*>
) inside or (|
) any char other than a double quotation mark ([^"]
)
"
- a double quote
1
Took me some time to understand the<[^<>]*>
part, but works like a charm. Learned something again, thanks!
– pcvnes
Nov 21 '18 at 8:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407398%2fhow-to-set-define-precedence-of-one-match-above-another-in-regular-expression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You may use
s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"
In Java:
String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";
See the regex demo. The first s
can even be omitted, but it makes it more efficient.
Details
s
- whitespace
(?<key>[^=s]+)
- Group "key": 1 or more chars other than whitespace and=
="
- a literal text
(?<value>(?:<[^<>]*>|[^"])*)
- Group "value": any substring between<
and>
with no<
/>
(<[^<>]*>
) inside or (|
) any char other than a double quotation mark ([^"]
)
"
- a double quote
1
Took me some time to understand the<[^<>]*>
part, but works like a charm. Learned something again, thanks!
– pcvnes
Nov 21 '18 at 8:23
add a comment |
You may use
s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"
In Java:
String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";
See the regex demo. The first s
can even be omitted, but it makes it more efficient.
Details
s
- whitespace
(?<key>[^=s]+)
- Group "key": 1 or more chars other than whitespace and=
="
- a literal text
(?<value>(?:<[^<>]*>|[^"])*)
- Group "value": any substring between<
and>
with no<
/>
(<[^<>]*>
) inside or (|
) any char other than a double quotation mark ([^"]
)
"
- a double quote
1
Took me some time to understand the<[^<>]*>
part, but works like a charm. Learned something again, thanks!
– pcvnes
Nov 21 '18 at 8:23
add a comment |
You may use
s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"
In Java:
String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";
See the regex demo. The first s
can even be omitted, but it makes it more efficient.
Details
s
- whitespace
(?<key>[^=s]+)
- Group "key": 1 or more chars other than whitespace and=
="
- a literal text
(?<value>(?:<[^<>]*>|[^"])*)
- Group "value": any substring between<
and>
with no<
/>
(<[^<>]*>
) inside or (|
) any char other than a double quotation mark ([^"]
)
"
- a double quote
You may use
s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"
In Java:
String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";
See the regex demo. The first s
can even be omitted, but it makes it more efficient.
Details
s
- whitespace
(?<key>[^=s]+)
- Group "key": 1 or more chars other than whitespace and=
="
- a literal text
(?<value>(?:<[^<>]*>|[^"])*)
- Group "value": any substring between<
and>
with no<
/>
(<[^<>]*>
) inside or (|
) any char other than a double quotation mark ([^"]
)
"
- a double quote
answered Nov 21 '18 at 8:09
Wiktor StribiżewWiktor Stribiżew
324k16146226
324k16146226
1
Took me some time to understand the<[^<>]*>
part, but works like a charm. Learned something again, thanks!
– pcvnes
Nov 21 '18 at 8:23
add a comment |
1
Took me some time to understand the<[^<>]*>
part, but works like a charm. Learned something again, thanks!
– pcvnes
Nov 21 '18 at 8:23
1
1
Took me some time to understand the
<[^<>]*>
part, but works like a charm. Learned something again, thanks!– pcvnes
Nov 21 '18 at 8:23
Took me some time to understand the
<[^<>]*>
part, but works like a charm. Learned something again, thanks!– pcvnes
Nov 21 '18 at 8:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407398%2fhow-to-set-define-precedence-of-one-match-above-another-in-regular-expression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do not use
{1}
in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.– Wiktor Stribiżew
Nov 21 '18 at 7:52
Thanks for making the regex better, but it does not answer my question though.
– pcvnes
Nov 21 '18 at 7:54
Note I suggested leaving out
(?:^[.*])?
from the pattern because the timestamp is not likely to contain=
and will not make any difference. If you ever need to put it back, use(?:^[[^]*])?
instead.– Wiktor Stribiżew
Nov 21 '18 at 8:24