Maintaining monolithic regex expressions
[Cross posted on my regex blog - http://blogs.regexadvice.com/dneimke/archive/2004/03/31/877.aspx - please leave any comments there]
I think that it speaks volumes about the readability and maintainability of a regex which is several hundred characters long when it receives 3 comments within 8 hours picking up on errors/improvements. Picking up where I left off yesterday, I'd like to show how to further improve the maintainability of the expression. Yesterday I showed the BNF for a DateLiteral in VB and I then expanded that out into a monolithic pattern ( The pattern on RegexLib.com ):
DateLiteral ::= # [ Whitespace+ ] DateOrTime [ Whitespace+ ] # DateOrTime ::= DateValue Whitespace+ TimeValue | DateValue | TimeValue DateValue ::= MonthValue / DayValue / YearValue | MonthValue – DayValue - YearValue TimeValue ::= HourValue : MinuteValue [ : SecondValue ] [ WhiteSpace+ ] [ AMPM ] MonthValue ::= IntLiteral DayValue ::= IntLiteral YearValue ::= IntLiteral HourValue ::= IntLiteral MinuteValue ::= IntLiteral SecondValue ::= IntLiteral AMPM ::= AM | PM
A better approach would be to build sub-expressions and to roll them up into a complete production in exactly the same manner as the grammar does it. Start with the Leaf nodes:
Private re_Whitespace As String = "\s" ' Patterns for the components of TimeValue Private re_HourValue As String = "(?'HourValue'(0?[1-9])|1[0-9]|2[0-4]) (?# Hour 01 - 24 )" Private re_MinuteValue As String = "(?'MinuteValue'[:]0?[1-9]|[1-5]\d|60) (?# Minute 01 - 60 )" Private re_SecondValue As String = "(?'SecondValue'[:]0?[1-9]|[1-5]\d|60) (?# Second :01 - :60 )" Private re_AMPM As String = "(?'AMPM'[AP]M)" ' Patterns for the components of DateValue Private re_MonthValue As String = "(?'Month'(0?[1-9])|1[0-2]) (?# Month 01 - 12 )" Private re_DayValue As String = "(?'Day'0?[1-9]|[12]\d|3[01]) (?# Month 01 - 31 )" Private re_YearValue As String = "(?'Year'\d{4})"
Now that I've emulated the BNF for the leaf nodes I can start rolling up into the next 2 productions ( TimeValue and DateValue )
' TimeValue ::= HourValue : MinuteValue [ : SecondValue ] [ WhiteSpace+ ] [ AMPM ] Private re_TimeValue As String = _ "(?'TimeValue'" & re_HourValue & _ re_MinuteValue & _ re_Whitespace & "*" & _ re_SecondValue & "?" & _ re_AMPM & "?" & _ ")" ' (?# DateValue ::= MonthValue / DayValue / YearValue | MonthValue - DayValue - YearValue ) Private re_DateValue As String = _ "(?'DateValue'" & _ re_MonthValue & _ " (?'Sep'[-/]) (?# Date separator '-' or '/' )" & _ re_DayValue & _ " \k'Sep' (?# whatever date separator was previously matched )" & _ re_YearValue & _ ")"
That was very painless and didn't really add that much literal text to the overall production. Maintaining either TimeValue and DateValue is still relatively painless because you can quickly tweak stuff in the leaf node expressions. I'll now roll-up again into the DateOrTime production:
' (?'DateOrTime' (?# DateOrTime ::= DateValue Whitespace+ TimeValue | DateValue | TimeValue ) Private re_DateOrTime As String = _ "(?'DateOrTime'" & re_DateValue & re_Whitespace & "+" & re_TimeValue & "|" & _ re_DateValue & "|" & re_TimeValue & ")"
Again, this has added almost zero extra pattern to maintain; just the quantifiers and the alternation operator to join the sub-expressions (nonterminals). One last roll-up and we're done:
' DateLiteral ::= # [ Whitespace+ ] DateOrTime [ Whitespace+ ] # Private re_DateLiteral As String = "#" & re_Whitespace & "?" & re_DateOrTime & re_Whitespace & "?#"
The pattern is now ready for use:
Dim test As String = "#1:45:39 PM#" Dim re As New Regex(re_DateLiteral, RegexOptions.IgnorePatternWhitespace) MessageBox.Show(re.Match(test).Success)
So, although the original pattern looked quite lage, in reality you can use the techniques which I've shown here to make large patterns very simple to maintain.