This was the theme of the Fall 2010 CIMS Sponsors Meeting held in Raleigh this past October. And what a meeting it was! Over 125 people attended presentations ranging from how to be both an “exploitive and exploring organization,” by CIMS Academic Fellow and Brown University engineering professor Angus Kingon, to a whole new business model for higher education, called CONII: Colleges Ontario Network for Industry Innovation, as explained by Trish Dryden, associate vice-president, research and corporate planning, Centennial College of Toronto, Canada.
But without question, the highest level of participation came on the day CIMS researchers described what CIMS is doing ‘in the cloud’ to provide meaning and relevance -- to companies seeking competitive advantage -- from the petabytes of data created every day in cyberspace.
In the Spring 2010 edition of this Technology Management Report, I announced that with the help of the IBM Company, CIMS was building a cloud computing environment on NC State’s Virtual Computing Lab (see “Together CIMS, NC State and IBM Seed Smarter Clouds” at http://cims. ncsu.edu/downloads/newsletters/62_ CIMSspring10.pdf ). Using advanced software tools from IBM, we would attempt to do “text analytics” in search of answers to common innovation questions. That is, we would gather, filter, annotate, and eventually make sense of the massive amounts of unstructured data that reside on the worldwide web in the form of web pages, wikis, blogs, etc.
Just What Is Text Analytics?
This was a big topic of discussion at the meeting. Current information retrieval techniques are often based on statistical models of frequency analysis performed on indexed data. This is the basis of popular search engines such as Google, Illumin8, and Thompson Innovation. While this approach produces rapid results, the relevancy of the return value to the searcher is often limited. The process sacrifices precision for speed. There is no contextual analysis performed with statistics- based searches. The returns of the search require user time to determine the context of the return.
Text analytics, in contrast, is based on analyzing written language using software algorithms that capture the rules of grammar. With these rules, dictionaries and thesauri, users can generate search criteria that become more precise with regard to their required content. This allows the users to set priorities in evaluating search results.
By evaluating the association of words within a sentence, paragraph, page, or document, it becomes possible to establish a hierarchy of those web pages likely to contain the information the user is seeking. Literally millions of pages are reduced to a “readable” set. Further, visual displays and text highlighting techniques make final analysis of the reduced set of records fast and easy.
This is a fundamentally different approach than statistics-based determinations of relationships. The software is able to “pre-screen” the text for the context the user defines.
CIMS and Users Work Together
Understanding how to generate an objective question with defined terms that convert a user’s ideas to machine actions requires knowledge of the process beyond the simple models of information retrieval currently in use. CIMS faculty members work with users to clearly define the query.
In order for the tool to provide relevant information, we will break down the query into several objective questions. Because the process becomes iterative, complexity can be built in as results are obtained. The more specific the query becomes, the more relevant the returns. For example, the question of “who’s active in polymer coatings?” may not provide as much strategic information as the question of “how much are companies involved with polymer coating spending on R&D?” The software is capable of recognizing integers and performing mathematical calculations on the data analogous to many of the functions in common spreadsheet programs.
The software can deal with both unstructured and structured data. Files generated from ERP systems can be incorporated for data analysis by these tools. All the data generated from these queries can be stored as text files.
Progress to Date
The prototype project was developed with NC State’s Office of Technology Transfer in response to the all-too common question, “Who would be a good partner to help me develop my invention?” At the October meeting, we not only reported the successful conclusion of that prototype project but we announced that three more research projects had begun with CIMS members: the Drug Discovery Center of Innovation, the Plant Sciences division of BASF, and the Global Supply organization of Eisai.
All three of these projects are being performed under a special engagement model worked out with IBM and NC State’s Contracts and Grants Office (see illustration below). Under this special arrangement, CIMS members have free access to the VCL computing resources and IBM software for the duration of the research project. Moreover, they will be able to host their meta data files (MDF) on the VCL until we have properly prepared the people in their organizations to use this information to make more informed business decisions.
CIMS Academic Fellow, Mariann Jelinek, a research team member, calls this process the “informating” of critical, strategic business decisions.
At this time, we constrain the extent of the crawls performed in the proof-of-concept experiments (Phase 2) because our current hardware storage capacity is limited to 5 TB. We generally crawl the web for two weeks, gathering up to 400 websites containing 10 million pages of text. However, in Phase 3 we plan to build much larger MDFs, perhaps 1,000 times as large! Right now work is underway to secure the hardware required to support these truly massive data sets.
Why this Capability Is So Important
CIMS believes those companies that harness the meaning and power of the massive amounts of data being generated every day in order to understand:
-the macro trends buffeting their industry
-the new opportunities these trends permit
-the capabilities of present and emerging competitors, and
-the global network of qualified and eager “innovation partners” at their disposal will have competitive advantage over those that do not.
How To Engage In a CIMS Research Project
If you believe your organization would benefit from having such an information advantage, or if you just have questions about any of the topics addressed in this article, please don’t hesitate to email me at Paul_Mugge@ ncsu.edu.