Wednesday, April 1, 2015

Moore’s Law, Cloud Computing, and DW/BI



What is Moore’s Law?



The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for at least another two decades.  The advent and evolution of cloud computing is a testimonial to the law.

What is Cloud Computing?



Cloud Computing provides a simple way to access servers, storage, databases and a broad set of application services over the Internet. Cloud Computing providers such as Amazon Web Services own and maintain the network-connected hardware required for these application services, while you provision and use what you need via a web application.





Let us now delve into some of the latest developments pertaining to cloud computing in the DW/BI sphere.

1. Amazon Redshift


Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. Amazon Redshift’s data warehouse architecture allows the user to automate most of the common administrative tasks associated with provisioning, configuring and monitoring a cloud data warehouse. Backups to Amazon S3 are continuous, incremental and automatic. Restores are fast; you can start querying in minutes while your data is spooled down in the background. Enabling disaster recovery across regions takes just a few clicks.

                               

Customer Case study: By moving to AWS and using Amazon Redshift as a fast, fully managed data warehouse, Nokia is able to run queries twice as fast as its previous solution and can use business intelligence tools to mine and analyze big data at a 50% costs savings.

Benefits of Amazon Redshift:

  • Optimized for Data Warehousing: Amazon Redshift has a massively parallel processing (MPP) data warehouse architecture, parallelizing and distributing SQL operations to take advantage of all available resources.

  • Scalable: With a few clicks of the AWS Management Console or a simple API call, you can easily change the number or type of nodes in your cloud data warehouse as your performance or capacity needs change.

  • Fault Tolerant: Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3.

 2. Snowflake


Snowflake’s unique architecture takes full advantage of all the cloud’s capabilities to store and process data. Scale up and down at any time without costly redistribution of data, read-only downtime, or hours of delay before new resources can be used. Based on a patent-pending new architecture, Snowflake’s cloud service delivers the power of data warehousing, the flexibility of big data platforms and the elasticity of the cloud—at a 90 percent lower cost than on-premises data warehouses.




           Benefits of Snowflake Cloud Services:

      Data warehousing as a service. Snowflake eliminates the pains associated with managing and tuning a database. That enables self-service access to data so that analysts can focus on getting value from data rather than on managing hardware and software.
      Multidimensional elasticity. Unlike existing products, Snowflake’s elastic scaling technology makes it possible to independently scale users, data and workloads, delivering optimal performance at any scale. Elastic scaling makes it possible to simultaneously load and query data because every user and workload can have exactly the resources needed, without contention.
 Single service for all business data. Snowflake brings native storage of semi-structured data into a relational database that understands and fully optimizes querying of that data. Analysts can query structured and semi-structured data in a single system without compromise.


Customer Case study: Adobe implemented Snowflake’s Cloud based offering because of the flexibility that came from separating compute from storage provides users and applications with on-demand access to business-critical data at the performance level and scale required. Adobe’s testing indicated that Snowflake’s cost / performance ratio could exceed alternate cloud-based solutions in the market.”

Top 5 Trends in Cloud Data Warehousing and Analytics for 2015

  •  Trend #1: Rapid Deployment of Large-Scale Cloud Data Warehouses

Given the urgency that today’s ever increasing data volumes and complexity levels present, organizations are searching for ways to keep the focus on their business rather than their IT infrastructure. Advances in cloud-based infrastructure and technology are leading companies to trust more of their critical functions to the cloud, including large scale cloud-based data marts and data warehouses.

  • Trend #2: Increased Enablement of Self-Service Data Access via Cloud data Integration Services

Even the most mature analytics organizations struggle with the gap between business analysts who need access to information that is not in existing systems and actually making that information accessible. Developers on the IT side work to create applications to house and maintain this data, but these solutions are often disparate and have no governance from or integration to a data warehouse or one another. New cloud-based data integration and data refinery technologies can allow organizations to close this gap by providing APIs to easily move data between cloud data stores.

  • Trend #3: Continued Growth of NoSQL Adoption

NoSQL databases showed a 7% increase in adoption in 2014[1], with reasons for increased interest ranging from faster and more flexible development to lower deployment costs. NoSQL databases not only offer a low-risk, low-cost solution for organizations looking to get started with cloud-based analytics but also provide one of the most efficient, scalable solutions for cloud data storage as well. Additionally, new types of NoSQL tools, such as graph databases for analyzing relationship networks and key-value pair databases for data stream analysis, are gaining popularity for specific analytic use cases.

  • Trend #4: Big Data Analytics in the Cloud

Big data represents a major focal point for many organizations in recent years. The challenge with big data analytics has always been bringing the data to the analytics tool. Now, with new technologies available for analyzing these data sets in the cloud, organizations are taking advantage of the increased scalability and lower overhead, and we are seeing a shift from physical machines to cloud-based big data solutions.

  • Trend #5: Cloud-Based Analytics and Data Discovery


Deploying cloud-based analytics and data discovery tools may be one of the simplest, most efficient ways for organizations to engage their users and provide self-service Business Intelligence capabilities to put the data in the hands of the business users who can get the most insight from it.

References:

http://www.snowflake.net/product/architecture/
http://aws.amazon.com/redshift/
https://www.ironsidegroup.com/2015/02/02/top-5-trends-in-cloud-data-warehousing-and-analytics-for-2015/
http://www.webopedia.com/TERM/M/Moores_Law.html

Thursday, March 5, 2015

DATA PRESENTATION & VISUALIZATION METHODS

Data visualization is the method of consolidating data into one collective, illustrative graphic. Traditionally, data visualization has been used for quantitative work but ways to represent qualitative work have shown to be equally as powerful. Data visualization excels in capturing a viewer’s attention and holding it through storytelling. It addresses a complex problem that could be easily looked over, and simplifies it using design. Naturally, a new market for business has emerged. By taking the data and turning it into visual content, users are more likely to engage with and share it.
The three industries over which I will be analyzing the use to date visualization are:
  • Financial Management
  • HealthCare Management
  • E-commerce

 Financial Management
Financial Management essentially deals with numbers and more numbers. Storing all of them in an excel file and then massaging the data to come to a meaningful conclusion is tedious, confusing and prone to human error.

Recommendation:
Executive Dashboard
This executive dashboard displays financial metrics and sales metrics such as Margin by Month, Sales Distribution, Monthly Support Expenses, Monthly Revenue, etc.

Column charts, just like bar graphs, serve dashboard readers by helping them visualize categorical data and comparing it side by side. The main purpose of both the column and line chart remains the same, even when they are combined. Columns are best used to represent categorical data, while lines displays the distribution of data over time (trend).


Key Visualizations:
Gauges: To visually depict the range of expenses
Maps, Area charts: To visually depict the sales distribution across locations
Line charts: To analyze the Margin, Revenue and Expenses.

Healthcare Management:
Anyone who has been a patient in a hospital will probably agree that the experience has room for improvement. Much of that sentiment stems from the fact that hospitals are complex production facilities. Instrumentation and the proper use of data and knowledge can make a real difference when it comes to improving patient care.



Recommendation:

Hospital executives can get a better picture of what's going on from the operation’s point of view and can gain additional insights and better understanding through analysis of ”what if” scenarios: What if we discharge all mothers of newborn babies a day early? How many beds will then be available? How will it affect readmission? What will be the associated costs? Is there a subgroup for which earlier discharge provides higher benefits? Using a dashboard that incorporates wide variety of graphs, meters and displays, healthcare administrators can make informed short-term tactical decisions while gaining insight into how their decisions will affect various outcomes, staff groups, and finances.

Key Visualization:
Pie Charts: To analyze top insurance payers.
Stack Bars: To compare utilization between Doctors and Nurses and the department utilization.
Gauges: To track patient wait time by date and hour.


E-commerce
In the competitive ecommerce market, companies have to keep a real-time track of their product performance classified regionally, their top performing stores, channels etcetera. If all of this data is displayed individually, it is difficult to comprehend it and it is overwhelming too. An analyst will have to look through multiple files to assess the data via store, location, and product. As the filter criteria to analyze increases, the number of files increases too. In such an event, it is also highly probable that some data may be misread or misanalysed.


Recommendation:
Dashboards and e-commerce analytics provide visibility for different departments to see information that’s relevant to them. Distributors can use these tools to improve decision-making because they paint a big picture of the data.
Viewing this data in the form of a geo map and a bar chart will be easier and more meaningful.


Key Visualizations:
Waterfall Chart: helps in understanding the cumulative effect of sequentially introduced positive or negative values
Geo Map: For a visually appealing overview of sales by region.
Stacked Bar Chart: To depict the region wise top performing stores in a state


Conclusion
Data is often meaningless without context and visually representing information offers audiences important context for understanding the information. It helps that data visualization and aesthetics often come hand-in-hand. Designed information can help viewers, especially those visual learners, cut through unnecessary details to make sense of the world.


References:

http://www.birst.com/learn/resources/visualization-gallery#
http://www.sweetspotintelligence.com/en/2014/10/09/focus-visualizations-combo-charts/
http://www.dashboardinsight.com/articles/new-concepts-in-business-intelligence/big-data-what-it-means-for-data-visualization-and-dashboard-applications.aspx
http://www.conceptdraw.com/How-To-Guide/data-visualization-solutions

Wednesday, February 18, 2015

DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA




Unstructured Data:

Definition: Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Unstructured data is not useful when fit into a schema/table.
Using email as an example, there are certain values from an email that can be fit into a table. For example: sender, recipient, email body, etc. Although you can have a column for the email body, the information stored in that column would be useless when analyzed in such a way. What questions could analysts ask of all data entries in the “email body” column? Could they be answered? The answer is no.

Common Forms of Unstructured Data:


In addition to social media there are many other common forms of unstructured data:

·      Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
·      Audio Files - Customer service recordings, voicemails, 911 phone calls
·      Presentations - PowerPoints, SlideShares
·      Videos - Police dash cam, personal video, YouTube uploads
·      Images - Pictures, illustrations, memes
·      Messaging - Instant messages, text messages

In all these instances, the data can provide compelling insights. Using the right tools, unstructured data can add a depth to data analysis that couldn’t be achieved otherwise.
Unstructured data is a valuable piece to the data pie of any business. Tools that are widely accessible today can help businesses use this data to its greatest potential.

Structured Data:

Definition:  Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations.

Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Today, big data tools and apps have allowed for the exploration of structured data that was once too expensive to gather and store.

Common Forms of Unstructured Data:



Machine Generated
•       Sensory Data - GPS data, manufacturing sensors, medical devices
•       Point-of-Sale Data - Credit card information, location of sale, product information
•       Call Detail Records - Time of call, caller and recipient information
•       Web Server Logs - Page requests, other server activity

Human Generated
•       Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Although its unstructured brother outnumbers it, structured data has always and will always play a critical role in data analytics. It functions as a backbone to critical business insights. Without structured data, it is difficult to know where to find insights hiding in your unstructured data sets.


VOLUME OF DATA AND HOW ORGANIZATIONS HANDLE THE DATA



A rough prediction is that 80% of the world’s data is unstructured. This is rapidly growing in comparison to unstructured data.  Thus, organizations need to formulate a plan to work with both these kinds of data.


TYPES OF DATA IN A WAREHOUSE

•       Historical data:
A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.

•       Derived data:
Derived data is generated from existing data using a mathematical operation or a data transformation. It can be created as part of a database maintenance operation or generated at run-time in response to a query.

•       Metadata:
Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.


LIMITATIONS OF DATA WAREHOUSES

•          Extra Reporting Work:
Depending on the size of the organization, a data warehouse runs the risk of extra work on departments. Each type of data that's needed in the warehouse typically has to be generated by the IT teams in each division of the business. This can be as simple as duplicating data from an existing database, but at other times, it involves gathering data from customers or employees that wasn't gathered before.

•          Cost/Benefit Ratio:
A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours and budgetary money to generate a tool that doesn't get used often enough to justify the implementation expense. This is completely sidestepping the issue of the expense of maintaining the data warehouse and updating it as the business grows and adapts to the market.

•          Data Ownership Concerns:
Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company.

•          Data Flexibility:
Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled.


FUTURE OF DATAWAREHOUSING

Based on the reports and articles that I read pertaining to this topic, my vote goes to Cloud Based Data Warehousing Solutions.

Cloud based warehousing is a service model in which elements of the data analytics process are provided through a public or private cloud. This involves delivery of business intelligence applications to end users from a hosted location. This model is scalable and makes start-up easier and less expensive.



REFERENCES
http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html
http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm
http://blogs.technet.com/b/dataplatforminsider/archive/2013/10/17/cloud-data-warehousing-the-fastest-time-to-value.aspx
http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data





Tuesday, February 3, 2015



TOOLS OVERVIEW










EVALUATION CRITERIA

WEIGHTED ASSESSMENT

JUSTIFICATION FOR THE SCORES ASSIGNED:

Qlikview [Rank: 5]

Pros
Qlikview is a self-service access BI tool built for non-technical professionals that utilizes both engaging graphics and data consolidation from multiple sources into a single place to greatly simplify data analysis. Qlik's in-memory processor vastly speeds up the application and allows it to refresh in real-time, providing insights as data is added. Qlikview distinguishes itself from other BI tools through its unique inference engine, which maintains data associations automatically. Relationships in data are easily illuminated with Qlikview’s visuals.

Cons
However, some businesses struggled with end-users filtering and letting things unintentionally join, and either limit or multiply the results. While a great feature, it is easily messed up at the end user and developer levels. In general, some users struggled to grasp the concept of filtering, and had trouble creating their own visuals.

SAP BO [Rank: 2]

Pros
SAP is for everyone - regardless of the company size or industry. Anyone within a company, regardless of function, can access any BI data via the SAP framework with no to minimal help from IT. SAP has roughly 14 BI solutions that are aimed at different company sizes and industries. For example, one BI solution is aimed at SMBs, while another solution is geared toward companies that work with Microsoft Office. Other solutions allows for the user to access their BI data via their mobile devices or provide additional analytical features. Most BI platforms tend to be specific to the company function (i.e. HR or sales) but the SAP’s BusinessObject platform provides a broader solution. Users can perform BI analysis from multiple data sources and in different formats.

Cons
The licensing process with SAP has proven confusing for many users. Some have reported frequent support fee hikes, and there are multiple levels of maintenance required on the program that build up additional costs.

Tableau [Rank: 1]

Pros
Tableau is a streamlined, user-friendly business intelligence solution that provides a simple, quick way for non-experts to access data and create their own dashboards in just a few clicks. The simplicity and clarity of the solution is provided without sacrificing the depth and range of insight. Tableau also can produce very quick results utilizing its rapid-fire intelligence tools that develop insights as data comes in, in real time. Tableau Desktop relies on innovative technology that allows users to click on data and drag and drop it where you want it so that you can analyze exactly what you want. This makes connecting to data much simpler for users, who can create dashboards in just a few moments. Also, Tableau makes it easy to incorporate multiple data sources.

Cons
Tableau is currently working on adapting their software for Mac. Some users have reported struggles with OLAP calculations when working within Teradata, and have expressed frustration with the ease of use of the tool. Essentially, Tableau has many advanced features, and it can prove difficult to navigate between them.

IBM Cognos [Rank: 4]

Pros
IBM Cognos software is a business intelligence tool that can be used to improve strategic management and monitor financial performance. Cognos is unique for its scalable products that can be tailored to the size of the business, ranging from the individual, a larger workgroup, an entire department, a small business, or a major corporation.

Cons
Critics of the product cite its difficulty to use, especially for those new to advanced software. Of particular note are the error messages that continually pop up, and have been noted to be very difficult to cipher and even more difficult to resolve. Data reports also take almost twice as long to compile with Cognos as compared to most competitors.

Microstrategy [Rank: 3]

Pros
The MicroStrategy Analytics Platform puts business intelligence in the hands of any user  – meaning the user will not need to rely on IT to provide analyses and reports.  There’s no complicated code to write or query to understand – just a click or drag and drop. This platform also gives the end-user flexible deployment options; it can either be hosted on site or in the cloud.

Cons
MicroStrategy users have noted that there is a particularly steep learning curve to the technology. IT resources will need to be devoted to help build up expertise within the user-base.

The tool also operates within very rigid data structures, which may lead to more time spent using Extract, Transform, and Load tools to get data in order.

The platform does not include any predictive or prescriptive analytics tools, and some have found the lack of more scientific visualizations frustrating.

MARKET USER SHARE

VERDICT