• No results found

Agile BI with automated, real-time Data Collection

N/A
N/A
Protected

Academic year: 2021

Share "Agile BI with automated, real-time Data Collection"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Agile BI with automated, real-time Data Collection

10 essential questions help you zero in on key issues

These days, companies must incorporate Web data into their intelligence and analysis tools in order to compete. In some industries it’s a matter of survival.

The reason? The accumulation of real-time Web data is dramatically outpacing traditional Business

Intelligence (BI) and Information management data. The chart to the right shows how the volume of public Web data is growing over traditional sources of data.

Real-time data is where the answers are. It’s where market and customer trends are immediately identifiable. It’s where deals will be won and where winners will claim their trophies.

The new reality – real-time Web data

When defining Web Data, most people focus on consumer and social media sites such as Google, Amazon, Blogs, Wikipedia, Facebook, and twtter. But what many people forget is that data behind your firewall, data you want from your partners and suppliers, and data behind password protected logins are all accessible through a web browser. So this, too, is web data. And when you expand the definition to include all of this, and understand how easy it is to collect and make available to your employees with no API access and no coding, you begin to realize the enormous potential real-time Web Data offers.

(2)

Traditional IT processes just aren’t built to access, transform, and load all this unstructured, real-time data. Most BI and analytics tools utilize data owned by the Enterprise (e.g. Financial, Sales, Manufacturing, Customer Surveys, Market Share, and Pricing data). While valuable, this data is often outdated, stale, biased, and

incomplete.

As a result, executives are now demanding that CIOs deliver real-time reports on key market activities across product lines. They simply can’t trust manipulated P&L and marketing reports anymore.

So you need to harvest Web data, but it’s not easy. There are a lot of vendors out there claiming all kinds of capabilities.

Are you “learning to fish” or just “eating fish” that someone else caught?

You know the saying – “Catch a man a fish, feed him for a day. Teach him how to fish, feed him for life.” In the Web data collection business, lots of vendors take a “feed him for a day” approach. They masterfully present an initial set of data with colorful dashboards, UIs, and graphics.

Problems start to surface, however, when you want to:

• Make changes to your initial request • Examine a different set of data • Find errors within your data feed

“You want customizations?” they ask. Then they roll out the

consulting teams to make manual tweaks. And you’re stuck with this expensive “value-added” service for the life of the contract.

That’s not a viable solution. It sounds great when the quote is reasonable and they show a dazzling Flash demo. But it doesn’t work in the real world.

The more sensible alternative is to invest in a system that teaches you to fish and feed on your own customized data. That requires a vendor who provides tools and training up front, as well as a flexible, robust Web data collection engine that’s built for human interaction.

You want a system that allows you to easily modify your data access processes and build new ones at any time. If you’re held captive by a vendor that’s feeding you the data they think you need, you’re going to get really uncomfortable really fast. Imagine being force fed cod fish when you really want salmon prepared just the way you like it – and you can’t leave the table or close your mouth!

And again, BI solutions are only as good as the data behind them. Garbage in, garbage out. You’re better off with a lower level BI tool with great data than a fancy BI product with poor data. That’s why getting the best Web Data ETL product is critical.

(3)

Complete ETL capability is crucial

The “feed him for a day” approach typically occurs when the Extraction phase of data gathering is emphasized at the expense of the Transformation phase.

Here’s a quick review of ETL just to set the table.

Extract: This phase can be performed manually via “cut and paste” or a number of scraping, mining and harvesting products available in the market today. Essentially, extracting data from the Internet involves capturing what you see in your Web browser. However, not all extraction tools work alike and the coverage (the variety and amount of Web pages a tool can recognize and access) of competing tools vary vastly from vendor to vendor.

Transform: This is the key differentiator among tool vendors. During this phase, unstructured Web data gets converted to data with structure so that your analysis tools can recognize the data you’ve scraped. Many customer demos and Web site videos only scratch the surface of the complexity involved in transforming data. Not all tools have the same features and capabilities for covering all the potential scenarios that can undermine your data quality and accuracy.

Load: The Load phase moves your data to wherever you want to use it. Interestingly, not all tools have the same loading capabilities.

The greatest variable among vendors comes during the transform phase. While initial differences may seem subtle, the impact to your business could be major, to the point where one tool can get the job done and another completely falls apart. Automated or artificial intelligence (AI) tools typically fail here. This is the point where the vendors make a lot of money charging for custom services.

Now, here’s the part where you can gain some clarity and weed out lesser vendors with just a few pointed questions.

Ten Essential Questions to ensure Quality Data and Smarter Workers

Vendor demos often hide flaws, so you need to ask the right questions up front.

Extract

Question 1: Can you cover 100% of all Web data, and does your tool easily harvest AJAX,

JavaScript, Flash, and PDFs?

Different vendors use a variety of technology and manual scripting to get your data, so you need to be aware of whether your potential vendor can access the data you need. Many vendors access only 60% of Web sites out there (or less). Their solution? Add on consulting services to develop customized scripts. Also note that not all sites are created equally. While a vendor may be able to find a particular example of a Flash or AJAX site their tool works with (possibly with quite a bit of customization or manual scripting prior to showing you the demo),

(4)

they may not be able to access the Flash or AJAX site you’re interested in. Provide your potential vendor with a list of the sites you want to access, not the ones they have prepared for you.

Question 2: How well does your tool perform typical Web browsing navigation functions

such as filling out forms, pagination (clicking to the “next” page), etc.?

As you discover the benefits of Web data and become more dependent on real-time access, you’ll need a tool that extracts data similar to the way you browse Web sites, namely filling in forms and passwords, clicking through search results, choosing and comparing items, and moving from one page to another. Make sure your tool vendor provides the features and capabilities to satisfy your needs.

Transform

Question 3: Describe how your tool surgically transforms unstructured Web data to provide

superior data quality without any noise?

While an automated tool may appeal to you from an “ease of use” perspective, the real factor should be quality of transformed data. Why? Without the ability to transform correctly, scraped data is useless and noise in the data can prove catastrophic. Features to look for should include:

• Regular Expressions capability (with a graphical user interface): Search for text strings

(including their variations such as both grey and gray, color and colour, or car and cartoon)

• Encoding and Decoding (to deal with special characters): Many URLs have characters

such as %20 in them which you can detect and remove

• Date Formatting: handle international dates, convert time zones, deal with relative

dates, combine dates, scrape data “within the last 7 days” or “1 hour ago.”

• String Calculations

• Conditional Expressions (if, then, else, and, or)

• Numeric Calculation: Search a competitors price, compare to your own, find the

difference, calculate the 10% difference, reduce your price 10%

• Multiple language support (including multi-byte character sets)

Question 4: Does your tool provide automatic, customizable, de-duplication of data? How

about Data Cleansing, Data Normalization, Metadata Extraction, Data Mapping and Data

Linking?

Having a robust set of transformation functions will save you time, reduce confusion and eliminate headaches. The more sites you collect data from, the more important these features become.

(5)

Question 5: What are my different load output options?

Depending on where you are loading the data, you’ll need to be able to output the scraped data into multiple formats, such as in to your own SQL database, a vendor hosted database, Java data structure, C# data structure, SOAP or REST Web service, RSS, CSV or XML.

Vendor

Question 6: How accurate and customizable is your Artificial Intelligence or Automated

Engine/Agent/Process?

Automated data collection is a fairly complex task because Web sites are built and configured differently to make the user experience positive but also secure. Vendors that rely on Artificial Intelligence for their scraping processes have many weaknesses that prevent them from being able to access different sets of data as well as different styles of Web page programming. This is one of the main reasons low data coverage is common (60% and below). As noted in the Transform section, even scraping simple text pages may leave you unable to structure and load your data correctly. And when AI fails, you need to know if you have the ability to modify the process. Make sure you’re not paying more for work-arounds every time an automated process fails. Additionally, Web sites are dynamic and content is constantly changing. Do you have control over your agents and scripts to quickly manage these changes, or must you engage your vendor for additional consulting services?

Question 7: What platforms do you support? Do you support SaaS, Cloud and On-Premise?

Some vendors only support Windows. If you need Linux or UNIX support, be sure to verify this before purchasing. Further, if you would prefer to minimize your infrastructure investment and outsource the technology management of your offering to your vendor, ask if they offer a cloud-based offering. And finally, if you start with a SaaS solution, can you take it in-house later if you need to for security or IP reasons?

Question 8: Please tell me about scalability and performance

The more data you collect, the more the answer to this question matters. If you want to run multiple processes (robots, spiders, scripts, agents, etc.), does your vendor feature load-balancing and multi-threaded execution to scale linearly with the number of CPUs? Headless browser technology will also be an important factor to reduce overhead and decrease network bandwidth.

Question 9: What about debugging and error handling?

Can the tool debug problems with automated scraping agents when they aren’t working? Can you immediately see when a script or robot breaks, and do you have the ability to fix it on the fly? As Web sites are dynamic and change constantly, how quickly will you be able to learn about the break and be able to fix it? Will you have to contact your vendor because their automated or artificial intelligence couldn’t detect and correct it? Look for a tool that allows

(6)

you to identify the breaks, fix them and redeploy as quickly as possible to eliminate gaps in your data.

Question 10: Can you tell me more about pricing and total cost?

Is the pricing variable (based on volume of data scraped or number of agents built) or fixed (number of CPUs)? To achieve 100% data coverage with high data quality, do I need to engage your consulting or professional services organization to fix things when your tool does not live up to your promises? (Ask a follow-up question about the percentage of total revenue the vendor receives from post-sales consulting services, too.)

Now the Decision is Yours

As you move through an evaluation process, you’ll want to dig deeply into these questions, especially in the proof of concept phase. Make sure you’re not opening up your wallet every time coding changes are required. You also want to focus on specific ETL capabilities. You’ll see a huge payoff in ROI if you zero in on the details and choose your vendor wisely.

Think about learning to fish as opposed to accepting what’s served. You don’t want to be at the mercy of an AI vendor that delivers a limited menu or a force feeding session. And remember, Web Data Harvesting is like feeding data to a BI tool. Data quality and coverage are critical, and they both are 100% dependent on your vendor choice.

Kapow Technologies

260 Sheridan Ave Suite 420 Palo Alto, CA 94306 Phone: +1 800 805 0828 Fax: +1 650 330 1062 Email: marketing@kapowtech.com Website: KapowTech.com Blog: KapowTech.com/blog

(7)

References

Related documents

[r]

In this chapter, we faced the question of whether it is possible to estimate at the same time the task being performed (reach a target position) and the signal model (binary

Section 51 (2) of the Copyright Act permits an authorized officer of a university library or archives to provide a copy (by communication or otherwise) of an unpublished thesis

Committee members at the March 26, 2014 meeting suggested adding information about healthcare facility-associated (HCFA) community onset Clostridium difficile infections (CDI) to

Additional resources were allocated to both the School of Business and the School of Science and Computer Engineering for growth in their graduate program enrollment including

[r]

PDL (the Parameter Definition Language) is a simple but comprehensive set of notations for expressing the values of parameters in components and graphs. PDL offers you flexibility

Reconnex Niche player Network filtering – CMF to capture full forensic information of inbound and outbound traffic in a database. Integrate with e-mail systems and