Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to add a space between the words when there is a special character in pyspark dataframe using regex?

I have a dataframe which consists of reviews and has special characters in between the words. I want to add a space.

For example,

Spark)NLP -> Spark ) NLP
Machine-Learning -> Machine – Learning

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Below is my dataframe

temp = spark.createDataFrame([
    (0, "This is 5years of Spark)world 5-6"),
    (1, "I wish Java-DL could use case-classes"),
    (2, "Data-science is  cool"),
    (3, "Machine")
], ["id", "words"])


+---+-------------------------------------+
|id |words                                |
+---+-------------------------------------+
|0  |This is 5years of Spark)world 5-6    |
|1  |I wish Java-DL could use case-classes|
|2  |Data-science is  cool                |
|3  |Machine                              |
+---+-------------------------------------+

I have used the below code to do that but it is not working

temp_1 = temp.withColumn('words', F.regexp_replace('words', r'(?<! )(?=[.,!?()\/\-\+\'])|(?<=[.,!?()\/\-\+\'])(?! )', '$1 $2 $3'))

Desired output:

+---+-----------------------------------------+
|id |words                                    |
+---+-----------------------------------------+
|0  |This is 5years of Spark ) world 5 - 6    |
|1  |I wish Java - DL could use case - classes|
|2  |Data - science is  cool                  |
|3  |Machine                                  |
+---+-----------------------------------------+

>Solution :

You can use

\b[^\w\s]\b|_

And replace with $0 . See the regex demo.

If you do not consider an underscore to be a special char, just use \b[^\w\s]\b that matches any char other than word and whitespace chars between word chars. Note word chars include underscores.

If there must be letters or digits on each side, replace word boundaries with lookarounds: (?<=[^\W_])[^\w\s](?=[^\W_])|_. To only find special chars between letters: (?<=[^\W\d_])[^\w\s](?=[^\W\d_])|_ or (?<=\p{L})[^\w\s](?=\p{L})|_.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading