Replace special characters in a string in Python

I am using urllib to get a string of html from a website and need to put each word in the html document into a list.

Here is the code I have so far. I keep getting an error. I have also copied the error below.

import urllib.request

url = input("Please enter a URL: ")

removeSpecialChars = str.replace("[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

words = removeSpecialChars.split()

print ("Words list: ", words[0:20])

Here is the error.

Please enter a URL:
Traceback (most recent call last):
  File "C:\Users\jeremy.KLUG\My Documents\LiClipse Workspace\Python Project 2\", line 7, in <module>
    removeSpecialChars = str.replace("[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
TypeError: replace() takes at least 2 arguments (1 given)

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

One way is to use re.sub, that’s my preferred way.

import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string


hey there

Another way is to use re.escape:

import string
import re

my_str = "hey th~!ere"

chars = re.escape(string.punctuation)
print re.sub(r'['+chars+']', '',my_str)


hey there

Just a small tip about parameters style in python by PEP-8 parameters should be remove_special_chars and not removeSpecialChars

Also if you want to keep the spaces just change [^a-zA-Z0-9 \n\.] to [^a-zA-Z0-9\n\.]

Solution 2

str.replace is the wrong function for what you want to do (apart from it being used incorrectly). You want to replace any character of a set with a space, not the whole set with a single space (the latter is what replace does). You can use translate like this:

removeSpecialChars = z.translate ({ord(c): " " for c in "[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+"})

This creates a mapping which maps every character in your list of special characters to a space, then calls translate() on the string, replacing every single character in the set of special characters with a space.

Solution 3

You need to call replace on z and not on str, since you want to replace characters located in the string variable z

removeSpecialChars = z.replace("[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

But this will not work, as replace looks for a substring, you will most likely need to use regular expression module re with the sub function:

import re
removeSpecialChars = re.sub("[[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+]", " ", z)

Don’t forget the [], which indicates that this is a set of characters to be replaced.

Solution 4

replace operates on a specific string, so you need to call it like this

removeSpecialChars = z.replace("[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

but this is probably not what you need, since this will look for a single string containing all that characters in the same order. you can do it with a regexp, as Danny Michaud pointed out.

as a side note, you might want to look for BeautifulSoup, which is a library for parsing messy HTML formatted text like what you usually get from scaping websites.

Solution 5

You can replace the special characters with the desired characters as follows,

import string
specialCharacterText = "H#y #@w @re &*)?"
inCharSet = "[email protected]#$%^&*()[]{};:,./<>?\|`~-=_+\""
outCharSet = "                               " #corresponding characters in inCharSet to be replaced
splCharReplaceList = string.maketrans(inCharSet, outCharSet)
splCharFreeString = specialCharacterText.translate(splCharReplaceList)

Solution 6

Translate seems faster:

N=100000, 30 special characters, string length=70

replace: 0.3251810073852539
re.sub: 0.2859320640563965
translate: 0.12320685386657715

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from or, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply