Database tip: Understanding joins and set theory
Sybase expert Mich Talebzadeh provides an excerpt from his book explaining set theory and different types of joins.
Continue Reading This Article
Enjoy this article as well as all of our content, including E-Guides, news, tips and more.
![]() |
|
![]() |
![]() |
![]() |
To understand joins, one needs to understand the set theory. Here is some explanation:
Set theory is a branch of mathematics which defines sets (thought of a grouping of things) and operations on them. In a relational database, a dataset is composed of the data in a table and its columns.
Example:
CREATE TABLE test_table( column01 INT NOT NULL, column02 INT NOT NULL, CONSTRAINT ix01_test_table PRIMARY KEY NONCLUSTERED ( column01 ) ) go INSERT INTO test_table( column01, column02 ) SELECT 1, 1 go
The test_table dataset:
SELECT column01, column02 FROM test_table go 1 1
Let's give an overview of some basic set properties. Let's define three sets A, B, and C as follows:
JOINT SETS
Joint sets are sets that share elements. Joint sets can be visualized as follows:
DISJOINT SETS
Disjoint sets are sets that have no shared elements. Disjoint sets can be visualized as follows:
The following SET operations are discussed in more detail in the section on Joins. First, though, we will review how basic set operations are carried out for datasets.
SET Unions
A set union of datasets is the dataset which arises from the combining of all datasets. In addition, a set union can include the overlaps once or twice from a pair of datasets which are joint. These types of unions are typically carried out by outer joins, a union, or a union all.
SET Intersections
A set intersection of datasets is their overlap: it consists of those elements that exist in all of the datasets. Or simply put, an element which exists in one dataset must exist in the other dataset. Intersections are typically carried out by equijoins (aka natural joins or even inner joins), an IN, and/or an EXISTS.
SET Subtractions
A set subtraction is the removal of one dataset from a second dataset. Subtractions are typically carried out by a NOT IN and/or a NOT EXISTS.
SET Multiplications
A notion of set multiplication for datasets is the cartesian product. The cartesian product of a given collection of datasets "explodes" to a dataset consisting of all possible combinations of elements from all datasets. They are typically creating by forming joins with no conditions.
Joins
A join is an operation which combines elements of two or more datasets by specifying which combinations (within the cartesian product) are valid. In Sybase Adaptive Server Enterprise (ASE), we do this by specifying conditions on the columns or fields of the datasets row by row, and linking conditions between datasets row by row. A join then displays the combined results as a new dataset.
Note: In ASE, the tables or views specified in the join can be in the same database or in different databases.
Example:
SET ROWCOUNT 1 SELECT title, pub_name, advance FROM titles, publishers WHERE titles.pub_id = publishers.pub_id SET ROWCOUNT 0 go You Can Combat Computer Stress! New Age Books 10,125.00
A join may be thought of as the product of two or more sets, pivoted about some key or column and a comparison operation. The results that are returned or that are linked are those that match the search criteria specified. The results are those rows or records that are duplicated or joint on the join criteria.
You can embed a join in a select, update, insert, delete, or subquery. Other search conditions and clauses may follow the join conditions.
Joins use the following syntax:
start of SELECT, UPDATE, INSERT, DELETE, or subquery FROM {table_list | view_list} WHERE [NOT] [table_name. | view_name.]column_name join_operator [table_name. | view_name.]column_name [{AND | OR} [NOT] [table_name.|view_name.]column_name join_operator [table_name.|view_name.]column_name]... End of SELECT, UPDATE, INSERT, DELETE, or subquery
Joins and the relational model
The join operation is the hallmark of the relational model of database management. More than any other feature, the join distinguishes relational database management systems from other types of database management systems.
![]() |
||||
|
![]() |
|||
![]() |
In a relational database management system, relationships among data values are left unstated in the definition of a database. They become explicit when the data is manipulated when you query the database, not when you create it. You can ask any question that comes to mind about the data stored in the database, regardless of what was intended when the database was created.
According to the rules of good database design, called normalization rules, each table should describe one kind of entity – a person, place, event, or thing. That is why, when you want to compare information about two or more kinds of entities, you need the join operation. Relationships among data stored in different tables are discovered by joining them.
A corollary of this rule is that the join operation gives you unlimited flexibility in adding new kinds of data to your database. You can always create a new table that contains data about a different kind of entity. If the new table has a field with values similar to those in some field of an existing table or tables, it can be linked to those other tables by joining.
How joins are processed
Conceptually speaking, the first step in processing a join is to form the Cartesian product of the tables with all the possible combinations of the rows from each of the tables. The number of rows in a Cartesian product of two tables is equal to the number of rows in the first table, multiplied by the number of rows in the second table.
To illustrate joins, we'll consider the sample database included with ASE. The Cartesian product of the authors table and the publishers table is 69 (23 authors multiplied by 3 publishers). You can have a look at a Cartesian product with any query that includes columns from more than one table in the SELECT list, more than one table in the FROM clause, and no WHERE clause. For example, if you omit the WHERE clause on a join between the authors and publishers table, ASE combines each of the 23 authors with each of the 3 publishers, and returns all 69 rows.
This Cartesian product does not contain any particularly useful information. In fact, it is downright misleading, because it implies that every author in the database has a relationship with every publisher in the database, which is not true at all.
That is why you must include a WHERE clause in the join, which specifies the columns to be matched and the basis on which to match them. It may also include other restrictions. Once ASE forms the Cartesian product, it eliminates the rows that do not satisfy the join by using the conditions in the WHERE clause.
Equijoins and natural joins
Joins based on equality (=) are called equijoins. Equijoins compare the values in the columns being joined for equality and then includes all the columns in the tables being joined in the results.
Example:
SELECT * FROM authors, publishers WHERE authors.city = publishers.city
In the results of that statement, the city column appears twice. By definition, the results of an equijoin contain two identical columns. Because there is usually no point in repeating the same information, one of these columns can be eliminated by restating the query. The result is called a natural join. The query that results in the natural join of publishers and authors on the city column is:
SELECT publishers.pub_id, publishers.pub_name, publishers.state, authors.* FROM publishers, authors WHERE publishers.city = authors.city
The column publishers.city does not appear in the results.
Another example of a natural join is:
SELECT au_fname, au_lname, pub_name FROM authors, publishers WHERE authors.city = publishers.city
You can use more than one join operator to join more than two tables or to join more than two pairs of columns. These "join expressions" are usually connected with "AND", although "OR" is also legal.
The following are two examples of joins connected by "AND". The first lists information about books (type of book, author, and title), ordered by book type. Books with more than one author have multiple listings, one for each author.
SELECT type, au_lname, au_fname, title FROM authors, titles, titleauthor WHERE authors.au_id = titleauthor.au_id AND titles.title_id = titleauthor.title_id ORDER BY type
The second finds the names of authors and publishers that are located in the same city and state:
SELECT au_fname, au_lname, pub_name FROM authors, publishers WHERE authors.city = publishers.city AND authors.state = publishers.state
Joins with additional conditions
The "where" clause of a join query can include selection criteria as well as the join condition. For example, to retrieve the names and publishers of all the books for which advances of more than $7500 were paid (that is $ as in currency), the statement is:
SELECT title, pub_name, advance FROM titles, publishers WHERE titles.pub_id = publishers.pub_id AND advance > $7500 go You Can Combat Computer Stress! New Age Books 10,125.00 The Gourmet Microwave Binnet & Hardley 15,000.00 Secrets of Silicon Valley Algodata Infosystems 8,000.00 Sushi, Anyone? Binnet & Hardley 8,000.00
The columns being joined (pub_id from titles and publishers) do not need to appear in the select list and, therefore, do not show up in the results.
You can include as many selection criteria as you want in a join statement. The order of the selection criteria and the join condition is not important.
Much of the previous material on joins was taken from Chapter 4 of the ASE 12.5 Transact-SQL User's Guide.
Troubleshooting joins
Joining N tables will result in N-1 Join clauses
- Join columns should have matching data types.
- Join columns should have clustered indexes or non-clustered indexes on them with Index Page Cluster Ratio (IPCR) > 0.9. This value can be obtained through the function DERIVED_STAT('table_name', 'index_name','ipcr')
- Provide a valid SARG on the outer or driving table, otherwise expect a table scan!
- Join queries 8 tables at the time. Use #tables for intermediate results. This makes queries more consistent and your code more readable.
Mich Talebzadeh is a database expert with large DBA and architecture experience with special interests in Sybase and Oracle. He specializes in creating database architectures for large global trading systems involving heterogeneous databases. Mich is based in London and serves as Chairman of the Sybase Future Forum. As resident Sybase expert on SearchEnterpriseLinux.com, he can answer any of your Sybase questions.