Back to Papers and Articles
Summarizing data with SQL (Structured Query Language)
Copyright 2000 Paragon Corporation   ( September 08, 2000)
This article provides a brief history and purpose for SQL and finishes off with a primer on how to summarize (aggregate) data with SQL.

Brief History and Purpose Of SQL

Structured Query Language (SQL) was created to shield the database programmer from understanding the specifics of how data is physically stored in each database management system and also to provide a universal foundation for updating, creating and extracting data from database systems that support an SQL interface. SQL approaches databases from a logical perspective rather than a physical perspective. In the old days (before SQL was around), in order to query and/or update a database, a programmer would have to understand how data was physically stored, for example the number of bytes in a record, stop tags etc. This requirement made code hard to read, difficult to port if the underlying database system changed, and required a lot more coding. SQL and its predecessors (other query languages) changed a lot of that by providing a logical abstraction to this physical layer. Now a DB programmer only had to worry about the tables that made up a database and how each table related - a programmer could now focus more on the meaning of the data and much less on how this data was physically stored. An interesting footnote is that SQL originated from work done in Mathematics Set Theory and a language called SEQUEL (Structured English QUery Language) an API created by IBM for its first relational database management system System/R. In fact many people still pronouce SQL as SEQUEL. For some more interesting tidbits on the origins of SQL and relational database theory, check out the links at the bottom of this article.

SQL Clauses Used in Aggregating Data

We summarize the basic SQL clauses and keywords used to aggregate data in the list below.
ClausePurposeExample
SELECTUsed to specify what fields will be included in the query result. This clause is always found in SQL query statements. SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
FROMSpecifies what tables data will come from. This exists in all SQL query statements. SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
WHERESpecifies what subset of the data will be used (always has non-aggregate conditions). This clause is almost always found in SQL query statements. SELECT Sum(Amount) As Total
FROM Sales
WHERE CustomerID='ABC'
JOINThe join clause is used to link more than one table together. This is often found in more complex queries that require retrieving data from more than one table. There are several formats (INNER JOIN, OUTER JOIN, LEFT JOIN, RIGHT JOIN, ..OUTER is usually optional SELECT Customers.CustomerName, Sales.CustomerID,
Sum(Sales.Amount) As Total
FROM Sales INNER JOIN Customers ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.CustomerID='ABC'
GROUP BYUsed to specify about what fields data should be aggregated. In this example, we group by CustomerID so that we get a summary of the total purchases per customer SELECT CustomerID, Sum(Amount) As Total
FROM Sales
GROUP BY CustomerID
HAVINGHAVING is very similar to WHERE except the statements within it are of an aggregate nature. Note in this example - we are only returning summaries for customers who have purchased more than 60,000 worth of items SELECT CustomerID, Sum(Amount) As Total
FROM Sales
GROUP BY CustomerID
HAVING Sum(Amount) > 60000
Aggregation Functions SUM, COUNT, AVGAggregate functions are used to summarize data by rolling up a set of data items into a single item. There are a few basic ones that exist in most systems that support SQL, and a lot are specific to certain DBMS. An important thing to note is that if a column in the resultset is not an aggregate field, then it must be included in the GROUP BY clause. SELECT CustomerID, Sum(Amount) As Total , Count(*) As SaleCount, AVG(Amount) As AverageOrder
FROM Sales
GROUP BY CustomerID

Application

Data aggregation is the process of taking numerous records and collapsing them into a single summary record. When aggregating data, one must decide what records must be considered in the summary and how these records should be summarized. Data can be summarized based on certain fields (attributes) in a database or derived attributes.
The examples below were performed in SQL Server 7.0 so the syntax for these may be slightly different if you are working with a different DBMS.

SELECT SaleDate, Amount
FROM Sales
WHERE SaleDate > '1999-12-30'
In the above example with a date comparison using a string representation of a date. This represents the dataset that we will summarize.
*Note SQL Server 7 does an implicit string to date conversion. Some DBMS require explicit casting such as CAST to date or CONVERT. Also valid string representations of dates may vary from local or DBMS system.

Once the data to summarize has been decided on, the next step is to decide how to aggregate the data. Data can be aggregated in an infinite number of ways. For example if we are doing an analysis of what are the best days of the week for sales we may want to know the average sale per day of week.

SELECT DATEPART(dw,SaleDate) As DayOfWeek, AVG(Amount) AvgDaySale
FROM Sales
WHERE SaleDate > '1999-12-30'
GROUP BY DATEPART(dw,SaleDate)

Links of Interest

SQL: The Universal Database Language This is one of the best SQL primers I have ever read. It is a good read even if you are fairly proficient in SQL.
What's in a Name Published in the 11/3/98 issue of PC Magazine. Gives a brief history of the origins of SQL
Intelligent Enterprise Magazine - Celko Latest articles by the SQL Master Joe Celko.



Back to Papers and Articles