How to set define precedence of one match above another in regular expression












1















I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.



[2018-11-19T13:04:33.031+01:00]  Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"


This is currently parsed using the following regular expression to extract the name and value pairs.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})


It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".



I use the following test code



public class test {

public static void main(String args) {

Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";

// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}


}



Which gives the result



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512


The desired result should would be



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512


I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})


What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".



Any help is appreciated.










share|improve this question























  • Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

    – Wiktor Stribiżew
    Nov 21 '18 at 7:52













  • Thanks for making the regex better, but it does not answer my question though.

    – pcvnes
    Nov 21 '18 at 7:54











  • Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

    – Wiktor Stribiżew
    Nov 21 '18 at 8:24
















1















I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.



[2018-11-19T13:04:33.031+01:00]  Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"


This is currently parsed using the following regular expression to extract the name and value pairs.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})


It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".



I use the following test code



public class test {

public static void main(String args) {

Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";

// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}


}



Which gives the result



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512


The desired result should would be



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512


I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})


What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".



Any help is appreciated.










share|improve this question























  • Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

    – Wiktor Stribiżew
    Nov 21 '18 at 7:52













  • Thanks for making the regex better, but it does not answer my question though.

    – pcvnes
    Nov 21 '18 at 7:54











  • Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

    – Wiktor Stribiżew
    Nov 21 '18 at 8:24














1












1








1








I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.



[2018-11-19T13:04:33.031+01:00]  Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"


This is currently parsed using the following regular expression to extract the name and value pairs.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})


It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".



I use the following test code



public class test {

public static void main(String args) {

Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";

// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}


}



Which gives the result



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512


The desired result should would be



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512


I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})


What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".



Any help is appreciated.










share|improve this question














I have logfiles from which i need to extract the key and value pairs for each logged line using the following format.



[2018-11-19T13:04:33.031+01:00]  Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"


This is currently parsed using the following regular expression to extract the name and value pairs.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})


It works fine, except for the key 'Bericht' which contains xml and also non-escaped quote's. The content of this key is a fact for me, so should handle this in my code to parse the logline. So i am looking for a way to define the end of a parameter value which is " or >" where >" should take precedence over ".



I use the following test code



public class test {

public static void main(String args) {

Pattern keyValuePairsPattern = Pattern.compile("(?:^\[.*\])?(?:[\s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:["]{1})");
String logentry = "[2018-11-19T13:04:33.031+01:00] Bedrijfsdocument="BD-023005 Document" Richting="Uitgaand" Status="verzonden"; Zaaknummer="2323343333"; MessageID="ef5c6e9e-849e-4d80-af86-92fc127e7178"; ConversationID="5571c03e-62a8-4fce-81ff-9fe31b7b276c"; RefToMessageId="34333139343034303934303135343731"; MMDBestand="2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd"; Bericht="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:DocumentBericht BDVersie="2.1" BDNaam="TA-022305" xmlns:ns2="com.my.test/berichten/document/2" xmlns="com.my.test/header/1"><Header><ID>58b5708f-4115-462c-93f3-5fb5134c9e25</ID><VerzendendePartijen><VerzendendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000034000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></VerzendendePartij></VerzendendePartijen><OntvangendePartijen><OntvangendePartij><Volgnummer>1</Volgnummer><RegistratieveRelatiePartij><Identificatie>00000004000000076000_OTA</Identificatie><SoortRegistratie>15</SoortRegistratie></RegistratieveRelatiePartij></OntvangendePartij></OntvangendePartijen><Datum>2018-11-19</Datum><Tijd>13:04:32.952+01:00</Tijd><SchemaVersieID>1.1</SchemaVersieID></Header><ns2:Zaak><ns2:Identificatie>2100008418</ns2:Identificatie></ns2:Zaak><ns2:DocumentAggregatieniveaus><ns2:DocumentAggregatieniveau><ns2:Classificatie><ns2:DocumentSoort>098</ns2:DocumentSoort></ns2:Classificatie><ns2:Identificatiekenmerk>DOC006256</ns2:Identificatiekenmerk><ns2:Foldernaam>02 - Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang="3512"";

// Extract Key value pairs
Matcher paramMatcher = keyValuePairsPattern.matcher(logentry);
while (paramMatcher.find()) {
System.out.println(paramMatcher.group("key") + "<=>" + paramMatcher.group("value"));
}
}


}



Which gives the result



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version=
encoding<=>UTF-8
standalone<=>yes
BDVersie<=>2.1
BDNaam<=>TA-022305
xmlns:ns2<=>com.my.test/berichten/document/2
xmlns<=>com.my.test/header/1
- Correspondentie</ns2:Foldernaam><ns2:Revisie>1</ns2:Revisie><ns2:IndicatieGewijzigdeMetadata>0</ns2:IndicatieGewijzigdeMetadata><ns2:Naam>02 - Toezenden stukken rm</ns2:Naam><ns2:Bijlage><ns2:MimeContent><ns2:MimeContentType>application/pdf</ns2:MimeContentType><ns2:MimeContentId>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_2.pdf@my.com</ns2:MimeContentId></ns2:MimeContent></ns2:Bijlage></ns2:DocumentAggregatieniveau></ns2:DocumentAggregatieniveaus></ns2:DocumentBericht>"; Omvang<=>3512


The desired result should would be



Bedrijfsdocument<=>BD-023005 Document
Richting<=>Uitgaand
Status<=>verzonden
Zaaknummer<=>2323343333
MessageID<=>ef5c6e9e-849e-4d80-af86-92fc127e7178
ConversationID<=>5571c03e-62a8-4fce-81ff-9fe31b7b276c
RefToMessageId<=>34333139343034303934303135343731
MMDBestand<=>2018-11-19_9bf1caf8-ca3d-43ae-b046-fa44142faa36_0_MMD.mmd
Bericht<=><?xml version="1.0" encoding="UTF-8" standalone="yes" .....
Omvang<=>3512


I tried to add another non-capture group with an optional "<" before the closing quote of the parameter value, but this does not resolve the issue.



(?:^[.*])?(?:[s]+)(?<key>[^=]+)(?:={1}"{1})(?<value>[^"]+)(?:>?)(?:["]{1})


What i probably need is an expression that defines the end of a value by a " or a <" where <" takes precedence over ".



Any help is appreciated.







regex key-value






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 21 '18 at 7:49









pcvnespcvnes

3672531




3672531













  • Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

    – Wiktor Stribiżew
    Nov 21 '18 at 7:52













  • Thanks for making the regex better, but it does not answer my question though.

    – pcvnes
    Nov 21 '18 at 7:54











  • Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

    – Wiktor Stribiżew
    Nov 21 '18 at 8:24



















  • Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

    – Wiktor Stribiżew
    Nov 21 '18 at 7:52













  • Thanks for making the regex better, but it does not answer my question though.

    – pcvnes
    Nov 21 '18 at 7:54











  • Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

    – Wiktor Stribiżew
    Nov 21 '18 at 8:24

















Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

– Wiktor Stribiżew
Nov 21 '18 at 7:52







Do not use {1} in any of your regexes, it does nothing as it is the default behavior: each pattern is matched once if not quantified in any other way. See regex tag info.

– Wiktor Stribiżew
Nov 21 '18 at 7:52















Thanks for making the regex better, but it does not answer my question though.

– pcvnes
Nov 21 '18 at 7:54





Thanks for making the regex better, but it does not answer my question though.

– pcvnes
Nov 21 '18 at 7:54













Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

– Wiktor Stribiżew
Nov 21 '18 at 8:24





Note I suggested leaving out (?:^[.*])? from the pattern because the timestamp is not likely to contain = and will not make any difference. If you ever need to put it back, use (?:^[[^]*])? instead.

– Wiktor Stribiżew
Nov 21 '18 at 8:24












1 Answer
1






active

oldest

votes


















1














You may use



s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"


In Java:



String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";


See the regex demo. The first s can even be omitted, but it makes it more efficient.



Details





  • s - whitespace


  • (?<key>[^=s]+) - Group "key": 1 or more chars other than whitespace and =


  • =" - a literal text


  • (?<value>(?:<[^<>]*>|[^"])*) - Group "value": any substring between < and > with no </> (<[^<>]*>) inside or (|) any char other than a double quotation mark ([^"])


  • " - a double quote






share|improve this answer



















  • 1





    Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

    – pcvnes
    Nov 21 '18 at 8:23











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407398%2fhow-to-set-define-precedence-of-one-match-above-another-in-regular-expression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














You may use



s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"


In Java:



String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";


See the regex demo. The first s can even be omitted, but it makes it more efficient.



Details





  • s - whitespace


  • (?<key>[^=s]+) - Group "key": 1 or more chars other than whitespace and =


  • =" - a literal text


  • (?<value>(?:<[^<>]*>|[^"])*) - Group "value": any substring between < and > with no </> (<[^<>]*>) inside or (|) any char other than a double quotation mark ([^"])


  • " - a double quote






share|improve this answer



















  • 1





    Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

    – pcvnes
    Nov 21 '18 at 8:23
















1














You may use



s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"


In Java:



String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";


See the regex demo. The first s can even be omitted, but it makes it more efficient.



Details





  • s - whitespace


  • (?<key>[^=s]+) - Group "key": 1 or more chars other than whitespace and =


  • =" - a literal text


  • (?<value>(?:<[^<>]*>|[^"])*) - Group "value": any substring between < and > with no </> (<[^<>]*>) inside or (|) any char other than a double quotation mark ([^"])


  • " - a double quote






share|improve this answer



















  • 1





    Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

    – pcvnes
    Nov 21 '18 at 8:23














1












1








1







You may use



s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"


In Java:



String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";


See the regex demo. The first s can even be omitted, but it makes it more efficient.



Details





  • s - whitespace


  • (?<key>[^=s]+) - Group "key": 1 or more chars other than whitespace and =


  • =" - a literal text


  • (?<value>(?:<[^<>]*>|[^"])*) - Group "value": any substring between < and > with no </> (<[^<>]*>) inside or (|) any char other than a double quotation mark ([^"])


  • " - a double quote






share|improve this answer













You may use



s(?<key>[^=s]+)="(?<value>(?:<[^<>]*>|[^"])*)"


In Java:



String pat = "\s(?<key>[^=\s]+)="(?<value>(?:<[^<>]*>|[^"])*)"";


See the regex demo. The first s can even be omitted, but it makes it more efficient.



Details





  • s - whitespace


  • (?<key>[^=s]+) - Group "key": 1 or more chars other than whitespace and =


  • =" - a literal text


  • (?<value>(?:<[^<>]*>|[^"])*) - Group "value": any substring between < and > with no </> (<[^<>]*>) inside or (|) any char other than a double quotation mark ([^"])


  • " - a double quote







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 8:09









Wiktor StribiżewWiktor Stribiżew

324k16146226




324k16146226








  • 1





    Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

    – pcvnes
    Nov 21 '18 at 8:23














  • 1





    Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

    – pcvnes
    Nov 21 '18 at 8:23








1




1





Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

– pcvnes
Nov 21 '18 at 8:23





Took me some time to understand the <[^<>]*> part, but works like a charm. Learned something again, thanks!

– pcvnes
Nov 21 '18 at 8:23




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407398%2fhow-to-set-define-precedence-of-one-match-above-another-in-regular-expression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Guess what letter conforming each word

Port of Spain

Run scheduled task as local user group (not BUILTIN)