pyspark.sql.DataFrame.join¶
-
DataFrame.join(other: pyspark.sql.dataframe.DataFrame, on: Union[str, List[str], pyspark.sql.column.Column, List[pyspark.sql.column.Column], None] = None, how: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]¶ Joins with another
DataFrame, using the given join expression.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- other
DataFrame Right side of the join
- onstr, list or
Column, optional a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
- howstr, optional
default
inner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer,semi,leftsemi,left_semi,anti,leftantiandleft_anti.
- other
- Returns
DataFrameJoined DataFrame.
Examples
The following performs a full outer join between
df1anddf2.>>> from pyspark.sql import Row >>> from pyspark.sql.functions import desc >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name") >>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) >>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")]) >>> df4 = spark.createDataFrame([ ... Row(age=10, height=80, name="Alice"), ... Row(age=5, height=None, name="Bob"), ... Row(age=None, height=None, name="Tom"), ... Row(age=None, height=None, name=None), ... ])
Inner join on columns (default)
>>> df.join(df2, 'name').select(df.name, df2.height).show() +----+------+ |name|height| +----+------+ | Bob| 85| +----+------+ >>> df.join(df4, ['name', 'age']).select(df.name, df.age).show() +----+---+ |name|age| +----+---+ | Bob| 5| +----+---+
Outer join for both DataFrames on the ‘name’ column.
>>> df.join(df2, df.name == df2.name, 'outer').select( ... df.name, df2.height).sort(desc("name")).show() +-----+------+ | name|height| +-----+------+ | Bob| 85| |Alice| NULL| | NULL| 80| +-----+------+ >>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show() +-----+------+ | name|height| +-----+------+ | Tom| 80| | Bob| 85| |Alice| NULL| +-----+------+
Outer join for both DataFrams with multiple columns.
>>> df.join( ... df3, ... [df.name == df3.name, df.age == df3.age], ... 'outer' ... ).select(df.name, df3.age).show() +-----+---+ | name|age| +-----+---+ |Alice| 2| | Bob| 5| +-----+---+