The SIGIR 2010 Industry Track organized by David Harper (Google, Switzerland) and Peter Schäuble (Eurospider, Switzerland) was a success. In the morning session four keynote talks were presented from influential technical leaders (Baidu, Google, Bing, Yandex). During the afternoon session, seven presentation showed interesting, novel, and innovative ideas from the search industry.
William Chang, Baidu
Abstract: The China Economic Miracle has produced thirty years of sustained 10% GDP growth, allowing China to overtake Japan. Recently, concerned with social issues, debt safety, high commodity prices and weak exports, China has sought to tame that part of GDP derived from “real estate as securities” i.e. properties constructed for purpose of trade instead of use. Instead, China has turned to exhorting with urgency domestic consumption, but the country lacks many of the foundations of Information Enabled Commerce, contrastingly the epitome of American ingenuity and the very source of America's global competitive edge. We will survey some of these inventions for IEC from the viewpoint of IR. We will also survey China's demographic, social, cultural, and economic background and the role information now plays in people's daily lives, showcasing successful applications and business models that can suggest further opportunities for IR and IEC. Although China's 30-year development hasn't really built many of the things that the West takes for granted, people there are beginning to try: China's emerging Internet commerce has already exceeded 1% of GDP and is expected to double this year.
Jan Pedersen, Bing
Abstract: Web Search is a modern marvel because it is able to produce very relevant results from relatively short queries evaluated over a vast database. Much of the magic is due to query understanding; the technology that analyzes a user query and produces a suitable backend search expression. This technology corrects common orthographic errors, expands terms to their semantically similar equivalents, and groups terms into concepts. I will discuss the language models behind these technologies and their role in canonical web search engine architecture.
Ilya Segalovich, Yandex
Abstract: This talk will discuss the machine learning approach to search quality problems at Yandex, the largest search engine in Russian Federation. We focus on a number of learning approaches that are vital in solving the large-scale IR problems, and explore the capabilities and prospects of machine learning in search quality, as well as the problems that appear in handling the real-world data sets based on our experiences at Yandex. We also describe Internet Mathematics 2009 contest which was organized by Yandex to stimulate research in the fields of data analysis and ranking methods
Andrei Broder, Yahoo! Labs
Abstract: The classic Web search experience, consisting of returning “ten blue links” in response to a short user query, is powered today by a mature technology where progress has become incremental and expensive. Furthermore, the “ten blue links” represent only a fractional part of the total Web search experience, which now is heavily directed to satisfying the unexpressed “query intent”. The latter is associated with several challenges, including, but not limited to:
We believe that these are the areas where breakthroughs are most likely and scientific research can be most worthwhile, and we invite the SIGIR community to direct its attention to these only partially explored frontiers.
Thomas Arni, Eurospider
Abstract: Since 2002, the Swiss Supreme Court offers cross-language information
retrieval to search for decisions by this court. We report the lessons
learned when developing and maintaining the system. In particular, we
elaborate on the customization of cross language retrieval to the legal domain.
Finally, we summarize the results of a log data analysis and discuss our conclusions.
Garret Swart, Oracle
Abstract: Real-time indexing means making documents searchable as they are inserted into the corpus. Achieving a high ingest rate while maintaining good query performance is the challenge. A KWIC index for a large corpus is often too large to be cost effectively stored in memory or even Solid State Disk. Storing a keyword‚Äôs posting list on rotating disk requires storing it in large contiguous extents that can be efficiently read from disk during a search. Indexing a new batch of documents requires appending new postings for each distinct keyword used in the batch. Appending the postings physically incurs seeks as the documents are indexed, while appending the postings logically incurs seeks when the posting lists are accessed while answering a query.
In this talk we present a hybrid approach for storing posting lists that is implemented in Oracle Text, a feature of the Oracle RDBMS. Oracle Text utilizes database structures to implement text indexes: hash partitioning for parallelism, B-Trees for indexing terms and semantic tags, LOBs for storing compressed posting lists, and write-ahead logs for quickly committing updates. We perform real-time index maintenance by using memory and Solid State Disk as a staging area for new postings, allowing postings to accumulate so we can reduce the number of writes to the rotating media. We show how to model online index maintenance and size the staging area appropriately for the ingest rate, the query rate, IO system performance and the maximum recovery time.
Why: KWIC index management may seem like its been done to death, but changes in system architectures, memory capacities, storage technologies (like Flash) and new requirements (like real-time) change the shape of the problem and the feasible solutions. Index performance often gates the introduction of advanced search quality algorithms and interactivity aids. The specific problem addressed, balancing query and document ingest performance, may seem tricky but it is amenable to some generally useful modeling and optimization techniques. While performance was the primary topic at only one of last year's SIGIR sessions, it still remains central to the field and, presented in an engaging way, it should be accessible and interesting to your participants. Last year's industrial track appeared to have a business rather than an engineering focus but it looks as though this year you are looking for new ideas rather than business models.
Daniel E. Rose, A9.com
Abstract: For years, commercial search engines have learned from and built upon decades of research from the information retrieval community. While this has been immensely valuable, commercial search developers often face a variety of usage scenarios that are quite different from what’s described in the literature. For example, most work in Information Retrieval has focused on unstructured text documents, viewed largely as a static corpus to be ranked in isolation of user interactions. In contrast, Product Search engines face a very different environment. Our documents are often highly structured, with a mix of trusted and untrusted attributes; traditional IR concepts may not function as intended. Our users routinely alternate between searching and browsing, between typing keywords and choosing refinements. Product search relevance is strongly influenced by our users’ behavior. Often we know their identity, and they have explicitly entrusted us with personal information. And long before real‐time information sharing services became popular, we had to manage a system where Items rapidly come and go from the index, as new merchants begin offering products for sale and existing products go out of stock; these changes need to be reflected in search results immediately. Working in this environment has given us a different perspective on some fundamental issues in search. In this presentation, I’ll talk about some of the unique characteristics of the product search problem, and then share some lessons and challenges that may apply to anyone working in the search field.
Xavier Amatriain, Telefónica
Abstract: Telefónica is the world’s 3rd largest telecommunications company by market cap. Its activities are centered mainly on the fixed and mobile telephony businesses, while its broadband business is the key growth driver underpinning both. It operates in 25 countries and its customer base exceeds 264 million globally.
Telefónica I+D is the innovation company of the Telefónica Group. Founded in 1988, it contributes to the Group's competitiveness and modernity through technological innovation. To achieve this aim, the company applies new ideas, concepts and practices in addition to developing advanced products and services. It is the largest private R+D center in Spain as regards activity and resources, and is the most active company in Europe in terms of European research projects in the ICT sector (Information and Communication Technology). It currently collaborates with technological leaders and numerous organizations in 42 different countries - among which figure more than 150 universities located in different parts of the world. It also participates in the most important international forums on technological know-how, thus creating one of the largest innovation ecosystems in the ICT sector.
Neel Sundaresan, eBay
Abstract: In this talk we will explore the technical and operational challenges and related research opportunities in searching, finding, buying and selling in a long tail marketplace like eBay. The challenges (and opportunities) are posed by the diversity of products, buyers, and sellers along many dimensions of categories, quality, price, reputation, formats, and inventory.
Typical item listings are expressed in just a few words in the title and further described through item descriptions. Most searches by prospective buyers are made over the terms in the title which provide unique challenges to sellers to use the right terms, and buyers to search by these terms. The constrained nature of this space has given rise a unique vocabulary between the buyers and sellers. Algorithmic approaches to spell checking, synonyms, linguistic tagging, query rewriting and expansion revolve around these issues.
Also, non-catalog of the inventory, seller and buyer diversity in terms of trust, expertise, and incentive, and the differential in the buyer and seller language poses unique challenges in the search ranking problem and also in building recommender systems.
In this talk we will explore the research and technical approaches to query understanding, ranking, and recommendations in our marketplace and how it differs from web search. We will also touch upon trust and reputation system and how it plays a role in the design of the search system. We will discuss how relevance, trust, value, and diversity interact in such a finding ecosystem.
Carlos Castillo, Yahoo! Research
Abstract: In recent years, clicks and query reformulation have become major signals in large scale evaluation of Web Search engines. User activity, such as the sequence of clicks and query reformulations in search sessions are key means to evaluate user satisfaction. User inactivity following the return of results for a given query (AKA abandonment) is typically considered as failure by the engine to satisfy the user's need.
However, all search engines now offer experiences that aim to satisfy users' needs by impression only, on the results page itself, without requiring any click. This "instant satisfaction" experience is achieved for instance through shortcuts/"oneboxes" (e.g., for weather forecasts, package tracking and flight information), and rich abstracts (e.g., by providing phone numbers and addresses of businesses). As this trend is growing, a fundamental question is how to automatically measure the success of such experiences, where the signal for success – no user interaction - is similar to the negative signal of abandonment.
This work posits the existence of a class of "tenacious" users, who very rarely abandon a search session when not satisfied. They will click on results with imperfect abstracts, reformulate queries, or otherwise persist in their attempts to obtain answers from the search engine. Thus, for them, no activity on the results page is actually an indication of satisfaction. This work uncovers such tenacious users through automatic analysis of search engine logs, and taps them to identify search sessions that were satisfied by impression.