Wednesday, February 18, 2015

DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA




Unstructured Data:

Definition: Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Unstructured data is not useful when fit into a schema/table.
Using email as an example, there are certain values from an email that can be fit into a table. For example: sender, recipient, email body, etc. Although you can have a column for the email body, the information stored in that column would be useless when analyzed in such a way. What questions could analysts ask of all data entries in the “email body” column? Could they be answered? The answer is no.

Common Forms of Unstructured Data:


In addition to social media there are many other common forms of unstructured data:

·      Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
·      Audio Files - Customer service recordings, voicemails, 911 phone calls
·      Presentations - PowerPoints, SlideShares
·      Videos - Police dash cam, personal video, YouTube uploads
·      Images - Pictures, illustrations, memes
·      Messaging - Instant messages, text messages

In all these instances, the data can provide compelling insights. Using the right tools, unstructured data can add a depth to data analysis that couldn’t be achieved otherwise.
Unstructured data is a valuable piece to the data pie of any business. Tools that are widely accessible today can help businesses use this data to its greatest potential.

Structured Data:

Definition:  Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations.

Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Today, big data tools and apps have allowed for the exploration of structured data that was once too expensive to gather and store.

Common Forms of Unstructured Data:



Machine Generated
•       Sensory Data - GPS data, manufacturing sensors, medical devices
•       Point-of-Sale Data - Credit card information, location of sale, product information
•       Call Detail Records - Time of call, caller and recipient information
•       Web Server Logs - Page requests, other server activity

Human Generated
•       Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Although its unstructured brother outnumbers it, structured data has always and will always play a critical role in data analytics. It functions as a backbone to critical business insights. Without structured data, it is difficult to know where to find insights hiding in your unstructured data sets.


VOLUME OF DATA AND HOW ORGANIZATIONS HANDLE THE DATA



A rough prediction is that 80% of the world’s data is unstructured. This is rapidly growing in comparison to unstructured data.  Thus, organizations need to formulate a plan to work with both these kinds of data.


TYPES OF DATA IN A WAREHOUSE

•       Historical data:
A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.

•       Derived data:
Derived data is generated from existing data using a mathematical operation or a data transformation. It can be created as part of a database maintenance operation or generated at run-time in response to a query.

•       Metadata:
Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.


LIMITATIONS OF DATA WAREHOUSES

•          Extra Reporting Work:
Depending on the size of the organization, a data warehouse runs the risk of extra work on departments. Each type of data that's needed in the warehouse typically has to be generated by the IT teams in each division of the business. This can be as simple as duplicating data from an existing database, but at other times, it involves gathering data from customers or employees that wasn't gathered before.

•          Cost/Benefit Ratio:
A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours and budgetary money to generate a tool that doesn't get used often enough to justify the implementation expense. This is completely sidestepping the issue of the expense of maintaining the data warehouse and updating it as the business grows and adapts to the market.

•          Data Ownership Concerns:
Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company.

•          Data Flexibility:
Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled.


FUTURE OF DATAWAREHOUSING

Based on the reports and articles that I read pertaining to this topic, my vote goes to Cloud Based Data Warehousing Solutions.

Cloud based warehousing is a service model in which elements of the data analytics process are provided through a public or private cloud. This involves delivery of business intelligence applications to end users from a hosted location. This model is scalable and makes start-up easier and less expensive.



REFERENCES
http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html
http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm
http://blogs.technet.com/b/dataplatforminsider/archive/2013/10/17/cloud-data-warehousing-the-fastest-time-to-value.aspx
http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data





No comments:

Post a Comment