What are alternatives to building a data warehouse?

Question

Mohammed Haseeb Musba · Answer

There are many specialize software to built Data Warehouse. Most popular is Oracle - WMS. Warehouse Management System often utilize automatic identification and data capture technology, such as barcode scanners, mobile computers, and potentially RFID to efficiently monitor the flow of products. Once data has been collected, there is either a batch synchronization with, or a real-time wireless transmission to a central database. The database can then provide useful reports about the status of goods in the warehouse.

Also Advanced user of Microsoft Excel can manage Warehouse Data Manually

Emad Mohammed said abdalla · Answer

Some new paradigms of build a data warehouse.

1. Build a Logical Data Warehouse using Data Virtualization and Data Federation technologies like Composite software. Leave data where it is .i.e Do not ETL. Trying to come up to a one schema that can represent the needs of the Enterprise can be a3-5 yr project.2. Use a Hadoop + NoSQL approach. i.e dump all your data into a good folder structure , Use Hive for once in a while queries , Use NoSQL based datamarts for fast queries. Now

Better Performance through Parallelism: Three Common Approaches

There are three widely used approaches for parallelizing work over additional hardware:

• shared memory

• shared disk

• shared nothing

Shared memory: In a shared-memory approach, as implemented on many symmetric multiprocessor

machines, all of the CPUs share a single memory and a single collection of disks.

This approach is relatively easy to program: complex distributed locking and commit protocols

are not needed, since the lock manager and buffer pool are both stored in the memory system

where they can be easily accessed by all the processors.

Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and

memory requests have to be transferred over the same bus that all of the processors share.

causing the bandwidth of this bus to rapidly become a bottleneck. In addition, shared-memory

multiprocessors require complex, customized hardware to keep their L2 data caches consistent.

Hence, it is unusual to see shared-memory machines of larger than8 or16 processors unless

they are custom-built from non-commodity parts, in which case they are very expensive.

Hence, shared-memory systems offer very limited ability to scale.

Shared disk: Shared-disk systems suffer from similar scalability limitations. In a shared-disk

architecture, there are a number of independent processor nodes, each with its own memory.

These nodes all access a single collection of disks, typically in the form of a storage area

network (SAN) system or a network-attached storage (NAS) system. This architecture

originated with the Digital Equipment Corporation VAXcluster in the early1980s, and has been

widely used by Sun Microsystems and Hewlett-Packard.

Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the

interconnection network that connects each of the CPUs to the shared-disk subsystem can

become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the

processors, there is no obvious place for the lock table or buffer pool to reside. To set locks,

one must either centralize the lock manager on one processor or resort to a complex distributed

locking protocol. This protocol must use messages to implement in software the same sort of

cache-consistency protocol implemented by shared-memory multiprocessors in hardware.

Either of these approaches to locking is likely to become a bottleneck as the system is scaled.

To make shared-disk technology work better, vendors typically implement a “shared-cache”

design. Shared cache works much like shared disk, except that, when a node in a parallel

cluster needs to access a disk page, it:

1) First checks to see if the page is in its local buffer pool (“cache”)

2) If not, checks to see if the page is in the cache of any other node in the cluster

3) If not, reads the page from disk

Such a cache appears to work fairly well on OLTP, but has big problems with data warehousing

workloads. The problem with the shared-cache design is that cache hits are unlikely to happen,

since warehouse queries are typically answered through sequential scans of the fact table (or

via materialized views.) Unless the whole fact table fits in the aggregate memory of the cluster,

sequential scans do not typically benefit from large amounts of cache, thus placing the entire

burden of answering such queries on the disk subsystem. As a result, a shared cache just

creates overhead and limits scalability.

In addition, the same scalability problems that exist in the shared memory model also occur in

the shared-disk architecture: the bus between the disks and the processors will likely become a

bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs

increases, can be a problem. To reduce bus contention, customers frequently configure their

large clusters with many Fibre channel controllers (disk buses), but this complicates system

design because now administrators must partition data across the disks attached to the different

controllers.

Products By Bayt.com

Use Our Mobile App

Start networking and exchanging professional insights

What are alternatives to building a data warehouse?

Popular Searches

More Questions Like This