Database Management Tutorials

Storing an offset and a date range

August 19th, 2008

Clearly, summary tables that aggregate time have implied start dates, which can be utilized to store date offsets from the start. However, because the summary table is much smaller than a fact table, the processing time necessary to convert the date offset to a physical date could form a substantial percentage of the total time. If this is the case, we recommend that you do not use date offsets within the summary table. Disk I/O rates become less of an issue when the data volumes are small, and processing time becomes more significant.

Date ranges are useful in that they can represent a significant aggregation of rows within the table. As before, this technique will increase the processing time, because it makes use of the between statement to satisfy the query. If the degree of aggregation caused by use of the date range is high, the scan time saved by reducing the data volumes will still outweigh the additional processing time. However, if the degree of aggregation is low, it may be more appropriate to expand all rows represented by a date range into individual rows.

At this point, it is also worth considering the requirements for specific user access tools. Many tools are unable to execute queries against fact tables that utilize date ranges, because they expect a key to a time dimension table. Storing date ranges within the fact table can result in significant savings in data storage and query performance. The saving within summaries will be proportionately less, because there will be a much smaller number of rows. Because of this, it is typically more effective to avoid the use of date ranges within summaries.

Understanding MOLAP, ROLAP and OLAP

August 12th, 2008

The acronyms MOLAP and ROLAP stand for multidimensional OLAP and relational OLAP respectively. They are terms that have come into use to differentiate multidimensional tools from relational tools. The distinction is somewhat artificial, because many of the MOLAP tools have an SQL interface that allows them to extract data from a relational database. This said, the SQL interface is automatic, in that it generates the SQL automatically, but the SQL generated is not necessarily efficient. The advantage of using a multidimensional tool is that on a predefined set of data it gives user-friendly, fast access to powerful analytical and statistical functions.

Multidimensional tools suffer a heavy performance hit when uploading a cube, but once the data is in memory they can carry out certain operations on that data far more efficiently than a relational tool can. Operations such as time series analysis and top-ten bottom-ten selection can be performed extremely efficiently. These operations, while possible in a relational tool, are difficult to program in SQL, and will not perform efficiently. The other thing that multidimensional tools do very well is dimensional slicing. If data has been loaded into the cube dimensioned by office, region, sales_quantity, sales_value and product, the tool will be able to switch almost instantly from displaying data by region to displaying data by product or by sales_value.

Relational technology and Multidimensional analysis

August 2nd, 2008

Relational technology has been around for many years, and is fairly well understood these days. Again, there is a large body of literature on the subject. For our purposes it is sufficient to say that basically the relational model works by allowing data to be normalized into relations, usually referred to as tables. This normalization minimizes data duplication, making the data more manageable, while still allowing data to be efficiently manipulated. The power of relational technology is in the relational operations that allow data to be joined, unioned, intersected and so on. This, along with the standardization on a common SQL language, has made relational databases the norm in the marketplace today.

Multidimensional analysis is a technique whereby data can be analyzed in many dimensions at once. The term dimension in this context means an attribute such as cost, duration, or name. These attributes will generally be equivalent to a column in a relational table. The idea is that instead of analyzing the data in a two-dimensional table, the data is loaded into a multidimensional hypercube to be analyzed. Using matrix arithmetic and sparse matrix optimizations this allows the data to be stored space efficiently and analyzed very rapidly by the loaded dimensions. Some tools also allow the use of multiple separate but related hypercubes; this reduces the sparsity of the matrices and makes the dimension calculations more efficient.

Commonly used MOLAP and RLOP Tools

July 29th, 2008

MOLAP tools are good to use for analyzing aggregated data in conjunction with its dimension data. They are not so good if you need to drill down to detailed data at the fact level, or if you need to query very large quantities of base data. If you are going to use a MOLAP tool against the data warehouse it will be better to help the SQL performance by creating aggregations that will allow the commonly accessed cubes to be quickly built, or even prebuilt. This will allow the MOLAP tool to get up and running very quickly. It will also prevent inefficient generated SQL from trying to build data sets at the correct aggregated level for the desired cube. In effect, these aggregations are data marts designed specifically for the MOLAP tool.

ROLAP tools are the traditional SQL-oriented tools that have tight integration to the relational model. These tools have been around a long time, but are changing all the time. The current generation of ROLAP tools are powerful and easy to use. They use metadata to isolate the user from the underlying complexities of the data warehouse, and to present a business perspective of the data.
ROLAP tools can be distinguished from the data dippers by their range and depth of analytical functionality. As with the MOLAP tools these are business-aware tools that understand business terminology. Being relational they can be used as data browsers, and will have good drill-down capability from aggregation to detailed data. The future heralds ever more powerful tools that are data warehouse aware - tools that are designed for and are capable of dealing with very large databases.

Data analysis tools

July 20th, 2008

Data analysis tools are used to perform complex analysis of data. They will normally have a rich set of analytic functions, which allow sophisticated analysis of the data. These tools are designed for business analysis, and will generally understand the common business metrics, such as market share, churn and profitability. Data analysis tools can again be subdivided into two categories, MOLAP tools and ROLAP tools. MOLAP and ROLAP are terms that sprang up to distinguish two different approaches to analyzing data.

They came into existence because the term OLAP has become an industry buzzword. To understand the difference between them, first one has to know what an OLAP is. Apart from OLAP knowledge on is multidimensional analysis is also necessary. Much has been written about the subject in the computer literature, and for a detailed discussion you should consult some of that work. For our purposes it is sufficient to have a basic understanding of the term. The term OLAP is an acronym for online analytical processing. Much has been written about the subject in the computer literature, and for a detailed discussion you should consult some of that work. For our purposes it is sufficient to have a basic understanding of the term.

Hardware and Software Costs Of Data Marting

July 17th, 2008

Data marts require additional hardware and software, even if they are sited on the same physical hardware that is running the data warehouse. Essentially, the data mart database will require additional processing power and disk storage to handle the user queries. Having said that, we have to bear in mind that the total processing requirement may not have substantially changed, because the data marts may be offloading queries from the enterprise data warehouse.On the whole, the data mart is likely to require additional processing power, but it is unlikely to be substantial unless we plan to retain detailed data within the data warehouse as well. If there exists a requirement to retain detailed data within the data warehouse and the data mart, we shall incur additional costs to store and manage the replicated data.

This will be true even if the data mart is placed on the same MPP hardware that is running the enterprise data warehouse. The additional nodes running the data mart may be within the same cabinet, but they will be operating as separate systems. In all cases, data marts should not be used to avoid the need to store aggregations. Data marting is more expensive than aggregating, and so it should be used as an additional strategy, not as an alternative strategy. Also, aggregations can provide an overview of corporate information, which may not be available by data marting detailed fact data.

Different Database Design Options

July 8th, 2008

If we have to populate a database design specific to a user access tool, we shall probably have to transform the data into the required structure. This could simply be a transformation into different database tables or, alternatively, we may have to transform into structures that exist outside the data warehouse database. For example, if the user access tool operates against the data warehouse database, we would only have to transform from one. database schema to another. This is less complex than transforming into flat file structures, or to a non-relational database.

Typically, database transformations involve a degree of aggregation. This should be created from the detailed data in the normal way, as the data in the starflake schema is designed to be aggregated quickly and effectively. Complex transformations should be designed using certain guidelines. There is a great deal of similarity between the load and warehouse management process that loads and transforms into the data warehouse structure, and a data mart load process that transforms from the data warehouse into the data mart structure. Because these transformations may be complex, the bulk of the transformations may have to be implemented by stored procedures. Data cleaning and consistency will not be an issue, because the data will already have been cleaned by the load process into the data warehouse.

Is your Summary Table Too Big To Be Useful?

June 25th, 2008

The whole purpose of a summary table is to minimize the amount of data that needs to be scanned to satisfy a query. Clearly, if that figure is high, the query will take longer to respond. Before you can answer this question, it is important to understand that the significant factor is the volume of data typically scanned, as opposed to the full size of the table. In most cases, we would expect a proportion of the table to be needed within a query; it will be unusual for the majority of the queries to be scanning the bulk of the data within the summary tables. If we have a range of summary tables aggregating product sales over different time periods, product ranges, and locations, we would assume that either Most queries that look at corporate results will scan the bulk of those tables or Most queries that look at regional results will scan a proportion of the tables that store more detailed information.

Summary tables with low degrees of aggregation are so large that they warrant being treated as fact data. This means that their design should follow the guidelines recommended for fact data, using column reduction strategies, horizontal and vertical partitioning strategies. Summary tables with high degrees of aggregation will be so small that they should be designed to contain all the information required within a query, with dimensions denormalized into the summary. Summary tables with medium degrees of aggregation are the ones that will be borderline when it comes to deciding how much to put into each row. These tables should be sized carefully to prevent the total row length from affecting the overall scan time. As a guideline for medium aggregated summary tables, if the data to be scanned is more than 1-2 GB in size, the table is probably too large. You should expect highly aggregated summary tables to be no more than a few hundred megabytes in size.

Which Summaries To Create in a DB?

June 19th, 2008

This is the most common question that arises when one starts creating summaries in DB. Identifying which summary tables are required is a complex and ongoing process. As query profiles change, the determination of which summaries are necessary. will change as well. However, it is necessary to initiate the process by identifying the first set of summaries that are created within a new data warehouse.

If the data warehouse is replacing an existing MIS, it could be the case that a set of summaries already exist. In this instance, create those summaries within the data warehouse, in order to allow you to seamlessly cross over the existing user population, However, in most instances, you will have to determine what the likely query profile will be, possibly with no statistics on which to base your decision. One technique is to examine the various levels of aggregation within each key dimension, and determine the most likely combinations that may be of interest. This can be done by creating a table with the key dimensions, and looking for the combinations that match our understanding of the user interest. For example, in a retail sales analysis data warehouse, we would first create a table of the key dimensions. This process is continued until we have identified no more than 50 summary tables. These should seed the system, on the basis that this figure will increase as query profiles become better understood. This process can be automated by the warehouse manager.

Storing a physical date in a Database

June 3rd, 2008

This option should always be used unless the level of aggregation is very low. As already discussed, summary tables should always use intelligent keys, in order to reduce the need to join. This means that the physical date is stored within the summary table, in preference to a non-intelligent key to the date dimension table. One point to bear in mind is that when the time dimension is being aggregated at some level, even at a daily level, any time of day information is aggregated away, which implies that this aspect of date storage is hardly ever used in summaries.

I, for one would recommend that if you are using database date structures that embed the storage of time, you should always set the time of day to an amount that makes it clear that the value should be ignored. In broad terms, the focus of summary data is to speed up query response times, so each technique is amended in order to provide the best overall performance.