Blog

  • Home
  • Python
  • Regular expression operations-Python

Regular expression operations-Python

  • (4.0)
  • | 1821 Ratings

Introduction to Regular Expressions

Regular expressions (also known as regexes, regexes pattern, Res) are small and specialized programming languages, which are embedded in Python. The “re” module is used to use Regular expressions. With the use of these expressions, we can specify the types of strings which can contain general English sentences such as email addresses.

These python 3 regex are compiled to bytecodes which are present in series. Then, an engine written in C is used to execute them. Hence, we will begin this guide regular expression in python for beginners with operations. While learning to test your code you can use python regex tester.

Regular expression operations

RE module in Python provides regular expression operations. Python 3 Regex use backslash ( ) character just to indicate any special characters without specifying their any special meaning. This falls in conflict with Python’s usage of backslash which it uses to indicate string literals.

If you were to use a literal backslash you will have to write ‘ ‘ , because there the regular expression must be indicated through ‘ ‘ and each backslash one will use, will have to specify it as ‘ ‘ within Python string literal.

So, the raw string notations of Python are used to avoid this, if backslashes are prefixed with ‘r’. Most regular expressions are available as RegexObject methods and module-level functions. Be sure to use Python regex tester for more help.

Regular Expression Syntax (with examples)

The functions in the Regular expressions (Or RE) module, allow us to test if any particular string matches any regular expression. The catenation of strings is possible using Regular Expressions.

Here, it can be explained using this example:

Consider, ‘P’ and ‘Q’ both are regular expressions; the catenation of these expressions will give us ‘PQ’. If ‘a’ matches ‘P’ and ‘b’ matches ‘Q’ then, ‘ab’ will match ‘PQ’.

Regular expressions contain both special and ordinary characters. Some of the simplest regular expressions are ordinary characters, ‘A’, ‘a’ etcetera. Since these ordinary characters as shown in the above example can be concatenated hence, the word matches the string ‘word’.

Some characters are special for example, ‘(‘ or ‘|’. They regulate the way in which the regular expressions around the ordinary characters are interpreted or define their classes.

Some of the regular expressions are ‘.’, ‘^’, ‘$’, ‘*’, ‘+’, ‘?’ etcetera.

The match function

If either zero or more characters, at the beginning of the string match the regular expressions then this module returns us MatchObject instance. If nothing in string matches then the ‘None’ is returned by the module.

re.match() only matches the beginning of the string, even in the MULTILINE mode in spite of checking the beginning of each line.

The Search Function

re.search function scans the string. It produces a MatchObject instance when it encounters the first instance where the regular expression produces a match. It returns none if there is no matching position for the corresponding regular expression.

Checkout Python Tutorial

Matching Versus Searching

There is the difference in the handling span of data between ‘re.match’ and ‘re.search’.

‘re.match’ scans only at the beginning of the string for matching the regular expressions. If it detect similarity in the pattern of the beginning of the string and regular expression, it executes.

In contrast ‘re.search’ scans throughout the string for the match in the string and regular expressions.

Here it will become clearer with this example:

import re
a = "123abc"
t = re.match("[a-z]+",a)
y = re.search("[a-z]+",a)
print (t)
print (y)

In this code, we have assigned string ‘123abc’ to the variable ‘a’. Then we have used both re.match and re.search. Both regular expressions will then look for matching alphabets (a-z).

Now, here is the output of the code.

None

We have used re.match first; it will look for alphabets at just the beginning. As expected it gave the result none, since our string “123abc, only has numbers at the beginning and it is only looking for the alphabets. Hence we get the result as ‘None’.

In case of the re.search, it will look for the alphabets all through the string “123abc”. Since it founds the alphabets at the fourth place, it executes and shows the description in the result.

As re.match has to scan only the beginning, it is much quicker than the re.search which scans the whole string.

Search and Replace

Python allows us to replace any string with other. Python regex replace object written as replace ( ) method, which is a part of the string module can be is used for this purpose.

Here is an example to describe the syntax of the search and then using python regex replace.

string.replace(s, old, new[, maxreplace])

These are the function parameters:

s: The string required to be searched and replaced.
old: The old sub-string we want to replace.
new: The new sub-string we want to replace in place of old one.
max replace: The maximum number of times the sub-string is required to be replaced.

Now, we will understand this with the help of an example:

our_str = 'Spider man'
new_str = our_str.replace('Spider', 'Bat')
print(new_str)
new_str = our_str.replace('man', 'Sense')
print(new_str)

In this example, we have a string ‘Spider man’. Now we are replacing ‘Spider’ with ‘Bat’, and the next string we are replacing the ‘man’ with ‘Sense’. Here since we have not described the attribute ‘maxreplace’, then it would take it as ‘1’ by default. Now we will see the output:

Bat man
Spider Sense

Now, as we required, the replacements have taken place.

Regular Expression Objects (with examples)

The regular expression objects in python are as follows:

The two objects ‘re.search’ and ‘re.match’ are already being discussed.

Split:

We can also break strings into further smaller strings in Python. It is done with the use of the Split function. You can use the comma to separate those chunks however if you will not specify any commas then white spaces will be taken as the breaks by default.

#part1
x = 'wind,water,fire'
k = x.split(",")
print (k)
#part2
a,b,c = x.split(",")
print (a,b,c)

Here, in this code, we have assigned string ‘wind,water.fire’ to the variable ‘x’. Then using the split function we will split three words in there as indicated by commas. Then in the second part, we have assigned the all three separated strings to the three variables ‘a’,’b’ and ‘c’.

Now, we will see the output of the following program.

['wind', 'water', 'fire']
wind water fire

First, we have got the separated sub-strings of our string and then we have extracted them after assigning them with the variable.

Findall:

Find all help in finding all the occurrences of any pattern in a string. Unlike re.search and re.match, ‘findall’ does not return Match object. 

import re
text = """
1. Star Wars
2. Star Trek
3. Futurama
"""
S = re.findall(r'^(d+).(.*)$', text, re.MULTILINE)
print (S)          <span style="white-space:pre"> </span>

As you can see that because of the use of parentheses, we have got the list of tuples.

Compile:

Using compile ( ) function the pattern can be compiled into the pattern objects. Pattern matching or string substitutions can be performed with the help of the re.compile function. Here is a python re.compile example.

import re
name_check = re.compile(r"[^A-Za-zs.]")
name = input("Please, enter your name: ")
while name_check.search(name):
print ("Please enter your name correctly!")

Match Objects

Following instances are supported by the MatchObject instances. If you ever feel overwhelmed to learn these objects, you can use python regex cheat sheet to memorize them:

expand (template):

It returns the string after completion of the backslash substitutions on the template string of Expand object by the sub( ) method. Using this object the escape backslash character such as ‘n’ is converted to numerical and character backreferences. Whereas named backreferences are replaced with the contents of that group.

group ([group1…]):

It returns the matching sub-groups as per required by the Regular Expression argument. In case of single argument, there is a single string, whereas for multiple arguments we have one tuple for each. Python regex extract can be grouped with this object. Python regex extract can be used to extract an email address, content from the cross software codes.

groupdict ([default]):

It returns a dictionary which contains all the matching subgroups.

span ([group]):

For certain MatchObject ‘m’ it returns the 2-tuple as in the following manner:   

 (m.start(group), m.end(group)).

pos:

pos is the index of the string, at which the regex engine started scanning for the match.

endpos:

endpos is that index of the string beyond which, the regex engine will not go.

re:

The search( ) or match ( ) function of regular expression object which produced the MatchObject instance.

string:

The string which was used in match( ) or search ( ).

Regular Expression Modifiers

For modifying various aspects of matching, Regular Expression Modifier includes optional modifiers that can do this task. These modifiers are used as the optional flags. Multiple modifiers can be provided using the exclusive OR (|). Here are some modifiers and their descriptions.

  • re.I  It performs matching while staying case sensitive.
  • re.L  Using these modifier words are interpreted as per the current locale. The alphabetic group along with word boundary behavior are affected by this.
  • re.M  With this modifier ‘$’ can be used to mark the end of any line, apart from an end of the string and ‘^’ can be used to mark the start of any line, apart from the start of the string.
  • re.S  It makes the dot (period) match any character including a newline.
  • re.U  This modifier interprets letters according to the Unicode character set.
  • re.X  This modifier ignores the whitespace, except inside ‘[]’ and when escaped by the backslash. The unescaped ‘#’ is used as the comment marker.

Regular Expression Patterns

Leaving control characters (?  + . * ^ $ ( ) [ ] { } | ) all characters can match themselves. These characters can be escaped by the use of the backslash. You can use python regex cheat sheets if you want to know more functions. This is the list of some patterns and their descriptions in Python.

  • ^   This matches the beginning of the line.
  • $   This matches the end of the line.
  • Can match every character except a newline. With the ‘m’ option we can do it as well.
  • […] Any single character can be matched that is in brackets.
  • [^…] It matches any single character that is not present in the brackets.
  • re* It matches 0 or more than 0 occurrences of the expression preceding this pattern.
  • re+ It matches 1 or more occurrences of the expression preceding this re-pattern.
  • re? It matches either 0 or 1 occurrences of the expression preceding this pattern.
  • (?#...) Comment.
  • w it matches the word characters.
  • W it matches the non-word characters.

Regular Expression Examples

Literal characters

Literal characters are described with the use of the double quotes (“”). For example if we were to describe the literal string python, then it will be described as the Match “python”.

Character classes

Character classes define the instructions for handling any expression. Here, are some of the character classes described below.

Special character classes

Special Character class Description
. Match any character except new line
d Match any digit [0-9]
D Match anything except digit [^0-9]
s Match any whitespace character [t r n f]
S Match any non-whitespace character [^t r n f]
w Match any single word character
W Match any nonword character.

Repetition cases

Here are the repetition cases, which are used when we have to handle the repletion in strings.

Repetition Cases Description
run? Match either “ru” or “run”.  Here ‘n’ is optional.
run* Match ‘ru’ along with zero or more n’s.
run+ Match ‘ru’ along with 1 or more 1’s.
d{4} Match exactly 4 digits.
d{4,} Match 4 or more digits.

Repetitions:

There are two repetitions available in Python.

Greedy Repetition:

Greedy repetition tries to search for as many as repetitions as possible. Here is an example of the output.

Code:

import re
p = &#39;runnn&#39;
greedy_re = &#39;n+&#39;
mymatch = re.search(greedy_re, p)
t = mymatch.group()
print (t)

Output:

nnn

As you can see that the repetition of ‘n’ which is three times in ‘runnn’. We have got the required result.

Frequently Asked Python Interview Questions & Answers

Non-greedy repetition:

The Nongreedy repetition is not greedy. That means it is satisfied with the first repetition it encounters.

import re
p = &#39;runnn&#39;
non_greedy_re = &#39;n+?&#39;
mymatch = re.search(non_greedy_re, p)
t = mymatch.group()
print (t)                  <span style="white-space:pre"> </span>

Output:

n

As required the nongreedy repetition stopped at the first one it encountered.

Anchors

Anchors in Python determine where the match function has to be performed on the string.

Here are the few of them listed below:

Anchors Description
play Matches “play” at the beginning of the interline line or string.
play$ Matches “play” at the end of the line or string.
Aplay Matches “play “at the start of the string.
playZ Matches “play” at the end of the string.
bRunb Matches “Run” at the word boundary.

Special syntax with parenthesis

These are some special syntax which are used with parenthesis and have special meaning.

Example Description
E(?#Comment) Matches “E”, the rest is a comment.
E(?i)xample Case sensitive while matching “xample”.
E(?i:xample) Again case sensitive while matching “xample”.
Exampl(?:e|er)) Group only without creating 1 backreference.
Explore Python Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Subscribe For Free Demo

Free Demo for Corporate & Online Trainings.

Anjaneyulu Naini
About The Author

Anjaneyulu Naini is working as a Content contributor for Mindmajix. He has a great understanding of today’s technology and statistical analysis environment, which includes key aspects such as analysis of variance and software,. He is well aware of various technologies such as Python, SAS, Artificial Intelligence, Oracle, Business Intelligence, Altrex etc, Connect with him on LinkedIn and Twitter.


DMCA.com Protection Status

Close
Close