Wednesday, February 18, 2015

DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA




Unstructured Data:

Definition: Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Unstructured data is not useful when fit into a schema/table.
Using email as an example, there are certain values from an email that can be fit into a table. For example: sender, recipient, email body, etc. Although you can have a column for the email body, the information stored in that column would be useless when analyzed in such a way. What questions could analysts ask of all data entries in the “email body” column? Could they be answered? The answer is no.

Common Forms of Unstructured Data:


In addition to social media there are many other common forms of unstructured data:

·      Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
·      Audio Files - Customer service recordings, voicemails, 911 phone calls
·      Presentations - PowerPoints, SlideShares
·      Videos - Police dash cam, personal video, YouTube uploads
·      Images - Pictures, illustrations, memes
·      Messaging - Instant messages, text messages

In all these instances, the data can provide compelling insights. Using the right tools, unstructured data can add a depth to data analysis that couldn’t be achieved otherwise.
Unstructured data is a valuable piece to the data pie of any business. Tools that are widely accessible today can help businesses use this data to its greatest potential.

Structured Data:

Definition:  Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations.

Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Today, big data tools and apps have allowed for the exploration of structured data that was once too expensive to gather and store.

Common Forms of Unstructured Data:



Machine Generated
•       Sensory Data - GPS data, manufacturing sensors, medical devices
•       Point-of-Sale Data - Credit card information, location of sale, product information
•       Call Detail Records - Time of call, caller and recipient information
•       Web Server Logs - Page requests, other server activity

Human Generated
•       Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Although its unstructured brother outnumbers it, structured data has always and will always play a critical role in data analytics. It functions as a backbone to critical business insights. Without structured data, it is difficult to know where to find insights hiding in your unstructured data sets.


VOLUME OF DATA AND HOW ORGANIZATIONS HANDLE THE DATA



A rough prediction is that 80% of the world’s data is unstructured. This is rapidly growing in comparison to unstructured data.  Thus, organizations need to formulate a plan to work with both these kinds of data.


TYPES OF DATA IN A WAREHOUSE

•       Historical data:
A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.

•       Derived data:
Derived data is generated from existing data using a mathematical operation or a data transformation. It can be created as part of a database maintenance operation or generated at run-time in response to a query.

•       Metadata:
Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.


LIMITATIONS OF DATA WAREHOUSES

•          Extra Reporting Work:
Depending on the size of the organization, a data warehouse runs the risk of extra work on departments. Each type of data that's needed in the warehouse typically has to be generated by the IT teams in each division of the business. This can be as simple as duplicating data from an existing database, but at other times, it involves gathering data from customers or employees that wasn't gathered before.

•          Cost/Benefit Ratio:
A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours and budgetary money to generate a tool that doesn't get used often enough to justify the implementation expense. This is completely sidestepping the issue of the expense of maintaining the data warehouse and updating it as the business grows and adapts to the market.

•          Data Ownership Concerns:
Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company.

•          Data Flexibility:
Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled.


FUTURE OF DATAWAREHOUSING

Based on the reports and articles that I read pertaining to this topic, my vote goes to Cloud Based Data Warehousing Solutions.

Cloud based warehousing is a service model in which elements of the data analytics process are provided through a public or private cloud. This involves delivery of business intelligence applications to end users from a hosted location. This model is scalable and makes start-up easier and less expensive.



REFERENCES
http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html
http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm
http://blogs.technet.com/b/dataplatforminsider/archive/2013/10/17/cloud-data-warehousing-the-fastest-time-to-value.aspx
http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data





Tuesday, February 3, 2015



TOOLS OVERVIEW










EVALUATION CRITERIA

WEIGHTED ASSESSMENT

JUSTIFICATION FOR THE SCORES ASSIGNED:

Qlikview [Rank: 5]

Pros
Qlikview is a self-service access BI tool built for non-technical professionals that utilizes both engaging graphics and data consolidation from multiple sources into a single place to greatly simplify data analysis. Qlik's in-memory processor vastly speeds up the application and allows it to refresh in real-time, providing insights as data is added. Qlikview distinguishes itself from other BI tools through its unique inference engine, which maintains data associations automatically. Relationships in data are easily illuminated with Qlikview’s visuals.

Cons
However, some businesses struggled with end-users filtering and letting things unintentionally join, and either limit or multiply the results. While a great feature, it is easily messed up at the end user and developer levels. In general, some users struggled to grasp the concept of filtering, and had trouble creating their own visuals.

SAP BO [Rank: 2]

Pros
SAP is for everyone - regardless of the company size or industry. Anyone within a company, regardless of function, can access any BI data via the SAP framework with no to minimal help from IT. SAP has roughly 14 BI solutions that are aimed at different company sizes and industries. For example, one BI solution is aimed at SMBs, while another solution is geared toward companies that work with Microsoft Office. Other solutions allows for the user to access their BI data via their mobile devices or provide additional analytical features. Most BI platforms tend to be specific to the company function (i.e. HR or sales) but the SAP’s BusinessObject platform provides a broader solution. Users can perform BI analysis from multiple data sources and in different formats.

Cons
The licensing process with SAP has proven confusing for many users. Some have reported frequent support fee hikes, and there are multiple levels of maintenance required on the program that build up additional costs.

Tableau [Rank: 1]

Pros
Tableau is a streamlined, user-friendly business intelligence solution that provides a simple, quick way for non-experts to access data and create their own dashboards in just a few clicks. The simplicity and clarity of the solution is provided without sacrificing the depth and range of insight. Tableau also can produce very quick results utilizing its rapid-fire intelligence tools that develop insights as data comes in, in real time. Tableau Desktop relies on innovative technology that allows users to click on data and drag and drop it where you want it so that you can analyze exactly what you want. This makes connecting to data much simpler for users, who can create dashboards in just a few moments. Also, Tableau makes it easy to incorporate multiple data sources.

Cons
Tableau is currently working on adapting their software for Mac. Some users have reported struggles with OLAP calculations when working within Teradata, and have expressed frustration with the ease of use of the tool. Essentially, Tableau has many advanced features, and it can prove difficult to navigate between them.

IBM Cognos [Rank: 4]

Pros
IBM Cognos software is a business intelligence tool that can be used to improve strategic management and monitor financial performance. Cognos is unique for its scalable products that can be tailored to the size of the business, ranging from the individual, a larger workgroup, an entire department, a small business, or a major corporation.

Cons
Critics of the product cite its difficulty to use, especially for those new to advanced software. Of particular note are the error messages that continually pop up, and have been noted to be very difficult to cipher and even more difficult to resolve. Data reports also take almost twice as long to compile with Cognos as compared to most competitors.

Microstrategy [Rank: 3]

Pros
The MicroStrategy Analytics Platform puts business intelligence in the hands of any user  – meaning the user will not need to rely on IT to provide analyses and reports.  There’s no complicated code to write or query to understand – just a click or drag and drop. This platform also gives the end-user flexible deployment options; it can either be hosted on site or in the cloud.

Cons
MicroStrategy users have noted that there is a particularly steep learning curve to the technology. IT resources will need to be devoted to help build up expertise within the user-base.

The tool also operates within very rigid data structures, which may lead to more time spent using Extract, Transform, and Load tools to get data in order.

The platform does not include any predictive or prescriptive analytics tools, and some have found the lack of more scientific visualizations frustrating.

MARKET USER SHARE

VERDICT