I need to use the regexp_extract function to extract certain parts of a string recently when I am doing big data analysis.
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_string_functions.html, this link shows us how to do, but sadly regex in impala is a little different from those in perl or python, so I have to look through the page and try again.
I keep a note here for regexp_extract for my later own usage.
Say I have a string
1
|
|
and I want to extract the number followd by “hidisp” and ‘”:’, which is number ‘27’ here.
As describe in the above document,
1
|
|
So if we want to represent the numbers here, we have use ‘\d’ rather than just ‘\d’ which is a standard in other programming languages.
regexp_extract usage is in the following format,
1
|
|
- group 0 matches the full pattern string, including the portion outside any () group, so
1
|
|
this will give out:
1
|
|
- group 1 matches just the contents inside the first () group in the pattern string:
1
|
|
will give out
1
|
|
And for the support of non-greedy matches using .*?, take the following string as an example,
1
|
|
How to extract the first match number for “reboot”, we can get the result in the following ways
a. without any “.*?” around the string pattern ‘reboot%22%3A(\d+)’, which is the easiest way I think
1
|
|
or
b. append “.*?” right after the string pattern ‘reboot%22%3A(\d+)’,
1
|
|
or
c. surround the string pattern ‘reboot%22%3A(\d+)’ with ‘.*?’ at both sides
1
|
|
So after the tree ways to get the leftmost match, we can easily guess how to get the rifht-most match.
1
|
|
That is just by appending the string pattern with ‘.*?’ at its left side.