This article provides a brief history and purpose for SQL and finishes off with a primer on how to summarize (aggregate) data with SQL.
Brief History and Purpose Of SQL
Structured Query Language (SQL) was created to shield the database programmer from understanding the specifics of how data is physically stored in each database management system and also to provide a universal foundation for updating, creating and extracting data from database systems that support an SQL interface. SQL approaches databases from a logical perspective rather than a physical perspective. In the old days (before SQL was around), in order to query and/or update a database, a programmer would have to understand how data was physically stored, for example the number of bytes in a record, stop tags etc. This requirement made code hard to read, difficult to port if the underlying database system changed, and required a lot more coding. SQL and its predecessors (other query languages) changed a lot of that by providing a logical abstraction to this physical layer. Now a DB programmer only had to worry about the tables that made up a database and how each table related - a programmer could now focus more on the meaning of the data and much less on how this data was physically stored. An interesting footnote is that SQL originated from work done in Mathematics Set Theory and a language called SEQUEL (Structured English QUery Language) an API created by IBM for its first relational database management system System/R. In fact many people still pronouce SQL as SEQUEL. For some more interesting tidbits on the origins of SQL and relational database theory, check out the links at the bottom of this article.
SQL Clauses Used in Aggregating Data
We summarize the basic SQL clauses and keywords used to aggregate data in the list below.
Clause | Purpose | Example |
SELECT | Used to specify what fields will be included in the query result. This clause is always found in SQL query statements. |
SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
|
FROM | Specifies what tables data will come from. This exists in all SQL query statements. |
SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
|
WHERE | Specifies what subset of the data will be used (always has non-aggregate conditions). This clause is almost always found in SQL query statements. |
SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
|
JOIN | The join clause is used to link more than one table together. This is often found in more complex queries that require retrieving data from more than one table. There are several formats (INNER JOIN, OUTER JOIN, LEFT JOIN, RIGHT JOIN, ..OUTER is usually optional |
SELECT Customers.CustomerName, Sales.CustomerID, Sum(Sales.Amount) As Total
FROM Sales INNER JOIN Customers ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.CustomerID='ABC'
|
GROUP BY | Used to specify about what fields data should be aggregated. In this example, we group by CustomerID so that we get a summary of the total purchases per customer |
SELECT CustomerID, Sum(Amount) As Total
FROM Sales
GROUP BY CustomerID
|
HAVING | HAVING is very similar to WHERE except the statements within it are of an aggregate nature. Note in this example - we are only returning summaries for customers who have purchased more than 60,000 worth of items |
SELECT CustomerID, Sum(Amount) As Total
FROM Sales
GROUP BY CustomerID
HAVING Sum(Amount) > 60000
|
Aggregation Functions SUM, COUNT, AVG | Aggregate functions are used to summarize data by rolling up a set of data items into a single item. There are a few basic ones that exist in most systems that support SQL, and a lot are specific to certain DBMS. An important thing to note is that if a column in the resultset is not an aggregate field, then it must be included in the GROUP BY clause. |
SELECT CustomerID, Sum(Amount) As Total , Count(*) As SaleCount, AVG(Amount) As AverageOrder
FROM Sales
GROUP BY CustomerID
|
Application
Data aggregation is the process of taking numerous records and collapsing them into a single summary record. When aggregating data, one must decide what records must be considered in the summary and how these records should be summarized. Data can be summarized based on certain fields (attributes) in a database or derived attributes.
The examples below were performed in SQL Server 7.0 so the syntax for these may be slightly different if you are working with a different DBMS.
|
SELECT SaleDate, Amount
FROM Sales
WHERE SaleDate > '1999-12-30'
|
In the above example with a date comparison using a string representation of a date. This represents the dataset that we will summarize. *Note SQL Server 7 does an implicit string to date conversion. Some DBMS require explicit casting such as CAST to date or CONVERT. Also valid string representations of dates may vary from local or DBMS system.
|
Once the data to summarize has been decided on, the next step is to decide how to aggregate the data. Data can be aggregated in an infinite number of ways. For example if we are doing an analysis of what are the best days of the week for sales we may want to know the average sale per day of week.
|
|
SELECT DATEPART(dw,SaleDate) As DayOfWeek, AVG(Amount) AvgDaySale
FROM Sales
WHERE SaleDate > '1999-12-30'
GROUP BY DATEPART(dw,SaleDate)
|
Links of Interest
|