I have two PySpark DataFrame objects that I wish to concatenate. One of the DataFrames
df_a has a column
unique_id derived using
pyspark.sql.functions.monotonically_increasing_id(). The other DataFrame,
df_b does not. I want to append the rows of
df_a, but I need to generate values for the
unique_id column that do not coincide with any of the values in
df_a = spark.createDataFrame( [ (1, "a", 42949672960), (2, "b", 85899345920), (3, "c", 128849018880) ], ["number", "letter", "unique_id"] ) df_b = spark.createDataFrame( [ (3, "c"), (4, "c"), (5, "d") ], ["number", "letter"] ) df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id()) df = df_a.union(df_b) df.show()
I looked to see if
pyspark.sql.functions.monotonically_increasing_id() took a parameter enforcing a minimum value, but it does not.
One final thing to note,
df_a is a massive DataFrame that needs to be appended to regularly. If I needed to reassign unique ids to
df_a using a function other than
pyspark.sql.functions.monotonically_increasing_id() to make a potential solution work long-term, I could do so once, but not every time I were to append new data.
Any direction would be appreciated—thank you!
You can always add a constant to
n = df_a.select(F.max('unique_id').alias('max_n')).first().max_n df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id() + F.lit(n + 1))