Pitfalls in UK postcode validation
                
                         Lukas Mai (mauke)
                
                       The Perl Conference
                       Glasgow, 2018-08-17

================================================================================

    - we want to validate the format of postal codes
    - including international addresses
    - trivial for most countries:
        e.g. five digits ([0-9]{5}) for Germany
    - unexpectedly difficult: UK

================================================================================

    First stop: Wikipedia
    https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation

    - many possible variants
    - many rules and restrictions
    - at the end: a regex!

================================================================================

    Wikipedia:

    The UK government has also provided the following regular expression that
    can be used for the purpose of validation:[27]

^(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$


    27. ^ "BULK DATA TRANSFER: ADDITIONAL VALIDATION FOR CAS UPLOAD" (PDF)

================================================================================

    https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/488478/Bulk_Data_Transfer_-_additional_validation_valid_from_12_November_2015.pdf

3.1 Expression

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$

3.2 Logic

    "GIR 0AA"
OR
    One letter followed by either one or two numbers
OR
    One letter followed by a second letter that must be one of ABCDEFGHJ
    KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one or two 
    numbers
OR
    One letter followed by one number and then another letter
OR
    A two part post code
        where the first part must be
        One letter followed by a second letter that must be one of ABCDEFGH
        JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and 
        optionally a further letter after that
    AND
        The second part (separated by a space from the first part) must be One 
        number followed by two letters.
        
A combination of upper and lower case characters is allowed.

Note: the length is determined by the regular expression and is between 2 and 8
characters.

================================================================================

        ^
        ( [Gg][Ii][Rr][ ]0[Aa]{2} )
    |
        (
            (
                ( [A-Za-z][0-9]{1,2} )
            |
                (
                    ( [A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2} )
                |
                    (
                        ( [A-Za-z][0-9][A-Za-z] )
                    |
                        ( [A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z] )
                    )
                )
            )
            [ ][0-9][A-Za-z]{2}
        )
        $

================================================================================

        ^
        ( GIR[ ]0A{2} )
    |
        (
            (
                ( [A-Z][0-9]{1,2} )
            |
                (
                    ( [A-Z][A-HJ-Y][0-9]{1,2} )
                |
                    (
                        ( [A-Z][0-9][A-Z] )
                    |
                        ( [A-Z][A-HJ-Y][0-9]?[A-Z] )
                    )
                )
            )
            [ ][0-9][A-Z]{2}
        )
        $

================================================================================

        ^
          GIR[ ]0A{2}
    |

            (
                  [A-Z][0-9]{1,2}
            |

                      [A-Z][A-HJ-Y][0-9]{1,2}
                |

                          [A-Z][0-9][A-Z]
                    |
                          [A-Z][A-HJ-Y][0-9]?[A-Z]


            )
            [ ][0-9][A-Z]{2}

        $

================================================================================

        ^
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9]?[A-Z]
        )
        [ ][0-9][A-Z]{2}
        $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9]?[A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9]?[A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

    3.2 Logic
    
        ...
    OR
        ...
    OR
        ...
    OR
        ...
    OR
            ...
        AND
            ...

================================================================================








            [A-Z][A-HJ-Y][0-9]{1,2}

    "One letter followed by a second letter that must be one of ABCDEFGHJ
    KLMNOPQRSTUVWXY (i.e..not I)"

    What about Z?

================================================================================








            [A-Z][A-HJ-Y][0-9]{1,2}

    "One letter followed by a second letter that must be one of ABCDEFGHJ
    KLMNOPQRSTUVWXY (i.e..not I)"

    What about Z?

    Wikipedia: "The letters IJZ are not used in the second position."

    What about J?

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9]?[A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================












            [A-Z][A-HJ-Y][0-9]?[A-Z]

    "One letter followed by a second letter that must be one of ABCDEFGH
    JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and 
    optionally a further letter after that"

    [0-9]?[A-Z] makes the digit optional, not the following letter.

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9]?[A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        (?:
            [A-Z][0-9]{1,2}
        |
            [A-Z][A-HJ-Y][0-9]{1,2}
        |
            [A-Z][0-9][A-Z]
        |
            [A-Z][A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
            [0-9]{1,2}
        |
            [A-HJ-Y][0-9]{1,2}
        |
            [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
                    [0-9]{1,2}
        |
            [A-HJ-Y][0-9]{1,2}
        |
                    [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
                    [0-9][0-9]?
        |
            [A-HJ-Y][0-9][0-9]?
        |
                    [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:


            [A-HJ-Y]?[0-9][0-9]?
        |
                    [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
            [A-HJ-Y]?[0-9][0-9]?
        |
                    [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]?
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
            [A-HJ-Y]?[0-9][0-9]?
        |
                    [0-9][A-Z]
        |
            [A-HJ-Y][0-9][A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
            [A-HJ-Y]?[0-9][0-9]?
        |


            [A-HJ-Y]?[0-9][A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        (?:
            [A-HJ-Y]?[0-9][0-9]?
        |
            [A-HJ-Y]?[0-9][A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        [A-HJ-Y]?[0-9]
        (?:
                          [0-9]?
        |
                          [A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        [A-HJ-Y]?[0-9]
        (?:
            [0-9]?
        |
            [A-Z]
        )
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        [A-HJ-Y]?[0-9]
        (?:
            [0-9]
        |
            [A-Z]
        )?
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        [A-HJ-Y]?[0-9]
        (?:
            [0-9A-Z]


        )?
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR[ ]0A{2}
    |
        [A-Z]
        [A-HJ-Y]?[0-9]
        [0-9A-Z]?
        [ ][0-9][A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR [ ] 0AA
    |
        [A-Z] [A-HJ-Y]? [0-9] [0-9A-Z]?
        [ ]
        [0-9] [A-Z]{2}
    )
    $

================================================================================

    ^
    (?:
        GIR [ ] 0AA
    |
        [A-Z] [A-HJ-Y]? [0-9] [0-9A-Z]?
        [ ]
        [0-9] [A-Z]{2}
    )
    $

    Conclusions:
     - the official regex is complicated and wrong
     - the official explanation is also wrong, but in different ways
     - the regex from Wikipedia is complicated and wrong in a third way
     - the explanation on Wikipedia is probably (?) correct

================================================================================