Why Big Data Is Not All Hype: The Power of Unstructured Text Analytics

“Is big data really all it’s cracked up to be?” a New York Times writer asks. And the Financial Times wondered in March whether we are making a big mistake with big data. Such queries reflect a growing worry that big data may be little more than big hype, or, as FT continued, “A massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media.”

Well, there are problems with Big Data, some of which are illuminated by Prof. Mariann Jelinek in her article on page 7 of this issue. But beneath the misgivings and skepticism are real applications that hold real promise, especially in the newest and fastest growing field of Big Data Analytics: Unstructured Text Analysis (UTA).

Over the past six years, CIMS— with help from IBM— has developed a powerful, relatively low cost, UTA platform that can help companies make better, more informed strategic decisions in five important areas:

  1. Identify data- driven trends often seen in marketing, customer and financial intelligence;
  2. Look for “root cause” of certain situations that are not at all obvious (we call this “finding needles in haystacks”);
  3. Extract information or facts such as IDs, names, demographics, etc;
  4. Use contextual data (often external to the firm) to enhance predictive models;
  5. Identify key influencers around a particular topic or event.

In the article below, North Carolina State University Professors Paul Mugge and Dick Kouri explain how the CIMS faculty is applying UTA to the first problem area; specifically, to help a large pharmaceutical company assess the impact on its business of a major healthcare trend — “personalized medicine” for the treatment of cancerous tumors.

Subsequent articles in this series will explore case histories of applications in the other four areas (see “UTA Case Studies,” at left).

UTA involves information retrieval, parsing and indexing of tokens and phrases, and lexical analysis in order to study word frequency distributions and pattern associations. Its overarching goal is, essentially, to turn text into data for analysis via the application of natural language processing (NLP) and analytical methods.

CIMS believes every organization will need the ability to accurately read and analyze unstructured text data. IBM estimates that 80% of the world’s information exists as unstructured text residing in company data warehouses and on the Worldwide Web. Without UTA this information is largely inaccessible to them.

The UTA platform we have developed over the last six years takes advantage of the technological advances made in parallel processing, cloud computing and distributed file systems, like Hadoop, to capture and intercalate extremely large amounts of text data.

At the heart of the platform are sophisticated NLP algorithms (IBM Content Analytics with Enterprise Search 3.0) that literally read free-form text by breaking it into subjects, nouns, verbs, and predicate phrases. By defining dictionaries of key words that describe a particular topic, the platform can be instructed to perform an array of tasks — at over 2 billion operations per second!

Follow the Money

A large pharmaceutical company wanted to gauge the impact on its business of a major new healthcare trend: “personalized medicine” for the treatment of cancerous tumors. These therapies promise great hope by targeting the unique genetic makeup of an individual patient’s tumor. Specifically, they wanted to know if and when personalized medicine was likely to become a standard for hospitals, with an attendant impact on their sales of cancer drugs.

The answer to a question like this begins with identifying the stage in the well-known Innovation Cycle where the trend in question has advanced to, and how far that is from market entry. These stages represent the major steps, or transitions, that all new trends and concepts follow: Basic and Applied Research -> Technology Development & Demonstration -> Product Commercialization & Market Development -> Market Entry & Market Volume (see “Innovation Cycle,” next page).

We capture the importance and timing of major events by monitoring financial transitions across the Innovation Cycle.

We capture the importance and timing of major events by monitoring financial transitions across the Innovation Cycle.

To determine the importance and ultimate timing of these trends, we simply “follow the money” being allocated, or not, at each stage of the cycle. Specifically, we use UTA to capture and quantify the monies major funding agencies are committing to key players at each stage.

For example, to understand what concepts are in Basic Research—and why— we search the websites of government funding agencies (NIH, NSF, etc.) to see which academic institutions they have issued grants to, and for what purposes. By analyzing these transactions we can assume that these agencies (and similarly their countries) see these concepts as promising great societal benefit for its citizens and, as a consequence, represent potential new market opportunities for our industry partners.

Risk Types and Weightings

As the illustration shows, there are multiple players in the Innovation Cycle whose degrees of influence and activity vary along this chain. The nature of participation is linked to the risks involved: technology, financial and market.

Technology risk stems from uncertainty regarding the characteristics and performance of the technology itself. Technology risk is greatest in the early stages of development because so little is known about the technology. And, since decision makers in the process have not yet identified its ultimate market and the possible product manifestations of the technology, little attention is paid to its market risk in these early stages.

As the attendant demonstrations and scale-ups prove out product capability, the technology moves toward market entry, and the other downstream risks such as market and financial uncertainties (market definition, size, receptivity, uptake rates, etc.) dominate.

Overall, as one nears market entry and market volume, all three risk types typically decrease.

How the Money Flows

The primary players at the idea-generation and concept-development stages of both basic and applied research are the universities and colleges whose major sources of funding are from federal government and industry-based research and development labs.

Industry, often in the form of manufacturers or major technology users, will invest through all stages, as their individual interests and economic ability allow. However, industry generally focuses on the applied research and market development /market entry stages.

Early private equity in the form of seed capital or individual (angel) investors is a source of financing for fledgling companies prior to the Market Entry stage. Angel investors, who are often serial entrepreneurs themselves, will invest in individuals or start-up companies at the pre-seed capital stage, but will generally have an increasingly smaller role to play as the product reaches this stage.

Formal financing through risk capital typically picks up after the seed capital stage, where products have been prototyped and demonstrated, but are not necessarily manufactured in volume. And certainly, these companies are not generating revenue. Venture capitalists (VCs) fund individual companies through these stages, and frequently realize their returns (“exit”) once industry, banks and IPO markets invest.

A number of funding sources exist to help finance the final stage of the Innovation Cycle and address the remaining market risk.

In short, monitoring the stocks and flows of money along the Innovation Cycle can be a great predictor of impending market trends, as our example will now show.

Personalized Medicine’s Funding Gap

In our case of personalized medicine for cancerous tumors, “following the money” allowed us to uncover some surprising information. The pharmaceutical firm wanted to know in which major medical centers personalized medicine for the treatment of cancerous tumors had become the standard of care. Theanswer we found through UTA was that it is not the standard of care in any ofthe major medical centers.

Our analysis revealed that personalized medicine is only used/applied in research and teaching hospitals — the same ones that had received substantial grants in basic and applied research. These grants were being used to support selective personalized medicine approaches; institutions without these funds could not afford to carry out these studies.

How exactly did UTA guide us to this conclusion? We first searched the website of the NCI (National Cancer institute), and other federal funding agencies worldwide, for the academic institutions that had received major research grants in the field of personalized medicine. We then analyzed these grants to determine the names of their Principal Investigators and their key co- investigators.

Because these people had received major, prestigious grants for their contributions to personalized medicine, we deem them Key Opinion Leaders (KOLs}. Again, we followed the money to these people and their academic affiliations.

We then compared the KOL data set to our “VC dataset.” The VC dataset consists of the websites of the top 400 accredited venture capital organizations in North America focused on life sciences. The dataset contains approximately 3.5 million web files that we update monthly. This dataset describes these firms’ portfolios of fledging new ventures and has proved vital for monitoring new technologies — not yet products —trying to traverse the familiar “Valley of Death.”

Traversing the Valley of Death

VCs and other private equity organizations routinely navigate the Valley of Death, the treacherous decision space between recognition of a potential commercial opportunity and its actual realization. A VC’s job is to spy promising new concepts still in early stage technology development, help their inventors demonstrate and develop these concepts into viable products, and launch them as profitable ventures.

Our logic was KOLs are often the founders of these high-tech ventures, or at least sit on the “science advisory boards” of their venture and others. Searching the VC dataset with their names should yield the number and status of their ventures, i.e., were they approaching IPO, were established biopharma companies showing interest in them? Had they acquired these ventures or establishing manufacturing and distribution alliances? All of this information is readily discernible using UTA.

Our Surprising Finding

To our surprise—and the project’s sponsor — when we used our UTA platform to query the VC dataset containing millions of documents with the names of the KOLs, we only had 36 “hits.” Of these, only two KOLs were members of the science advisory boards of new ventures, and neither of these companies was associated with personalized medicine approaches.

That told us that VCs see the market risks associated with personalized medicine — and maybe the remaining technological risks as well—as being too formidable at this time for them to make a substantial financial commitment to it. It is obvious from these results that personalized medicine is experiencing a pre-commercial funding gap.

This information alerts our sponsor, and other major pharmaceutical firms, that if they want to lead their industry with these promising technologies, they shouldn’t wait for VCs to finance the development of personalized medicine, which could take years. Instead, they should engage KOLs directly to help them develop their technologies.

From its years of helping organizations traverse the Valley of Death, CIMS knows this is when technologies are most malleable and need to be linked to the enduring needs of customers in order to be successful. With their proximity to the market and considerable commercialization resources and experience, this strategy makes good business sense for innovative pharmaceutical companies.

Track Data-Driven Trends

This case shows how UTA can be used effectively to track data-driven trends, in this case the macro trends buffeting the healthcare industry that are highlighted by the financial transactions occurring at every point in the innovation cycle of new drugs and therapies. Given the vast amount of unstructured data that exists in health sciences, we could only have arrived at this result in the way we have described.

UTA is equally adept at tracking the “micro” trends impacting a company. In a project with a major computer manufacturer, we used UTA to determine the satisfaction of their customers with particular product types, service response times, the local sales team, contracting and pricing issues, etc.

We did this by “reading” and analyzing 100,000 posts on the company’s web forum and comparing it to 6,000 customer satisfaction survey responses. In other words, through UTA we were able to create a complete and almost real-time picture of the customer’s experience with the company.

Whether tracking data-driven macro or micro trends, UTA works. In future IMRs we will describe other, equally productive, projects with organizations where we have used additional features of UTA to inform their business and innovation strategy.

Paul Mugge is Executive Director, Center for innovation Management Studies (CIMS), and Innovation Professor of the Poole College of Management, NC State University; pmugge@ncsu.edu

Richard Kouri is Executive Director, Bioscience Management MBA Initiative, in the Jenkins Graduate School of the Poole College of Management, NC State University; richard_kouri@ncsu.edu

Comments are closed.