SIGIR-2010

The SIGIR 2010 Industry Track organized by David Harper (Google, Switzerland) and Peter Schäuble (Eurospider, Switzerland) was a success. In the morning session four keynote talks were presented from influential technical leaders (Baidu, Google, Bing, Yandex). During the afternoon session, seven presentation showed interesting, novel, and innovative ideas from the search industry.

The presentation in PDF format.

Speaker: William Chang, Baidu

Abstract: The China Economic Miracle has produced thirty years of sustained 10% GDP growth, allowing China to overtake Japan. Recently, concerned with social issues, debt safety, high commodity prices and weak exports, China has sought to tame that part of GDP derived from “real estate as securities” i.e. properties constructed for purpose of trade instead of use. Instead, China has turned to exhorting with urgency domestic consumption, but the country lacks many of the foundations of Information Enabled Commerce, contrastingly the epitome of American ingenuity and the very source of America's global competitive edge. We will survey some of these inventions for IEC from the viewpoint of IR. We will also survey China's demographic, social, cultural, and economic background and the role information now plays in people's daily lives, showcasing successful applications and business models that can suggest further opportunities for IR and IEC. Although China's 30-year development hasn't really built many of the things that the West takes for granted, people there are beginning to try: China's emerging Internet commerce has already exceeded 1% of GDP and is expected to double this year.

The presentation in PDF format.

Speaker: Yossi Matias, Google

Abstract: This talk will discuss some recent developments in Search, emerging in various shapes and forms. We will highlight some challenges, and point to some search trends that play an increasing role in multiple domains.

The presentation in PDF format.

Speaker: Jan Pedersen, Bing

Abstract: Web Search is a modern marvel because it is able to produce very relevant results from relatively short queries evaluated over a vast database. Much of the magic is due to query understanding; the technology that analyzes a user query and produces a suitable backend search expression. This technology corrects common orthographic errors, expands terms to their semantically similar equivalents, and groups terms into concepts. I will discuss the language models behind these technologies and their role in canonical web search engine architecture.

The presentation in PDF format.

Speaker: Ilya Segalovich, Yandex

Abstract: This talk will discuss the machine learning approach to search quality problems at Yandex, the largest search engine in Russian Federation. We focus on a number of learning approaches that are vital in solving the large-scale IR problems, and explore the capabilities and prospects of machine learning in search quality, as well as the problems that appear in handling the real-world data sets based on our experiences at Yandex. We also describe Internet Mathematics 2009 contest which was organized by Yandex to stimulate research in the fields of data analysis and ranking methods.

The presentation in PDF format.

Speaker: Andrei Broder, Yahoo! Labs

Abstract: The classic Web search experience, consisting of returning “ten blue links” in response to a short user query, is powered today by a mature technology where progress has become incremental and expensive. Furthermore, the “ten blue links” represent only a fractional part of the total Web search experience, which now is heavily directed to satisfying the unexpressed “query intent”. The latter is associated with several challenges, including, but not limited to:

Assisting the users in creating “better” queries reflecting their needs -- this is typically supported by pre-search tools such as query suggestions, query completion, spellchecking, etc.
Integrating rich and complex data sources – from real-time social feeds and user profiles to pictures, videos, and maps, to massive amounts of editorial content and public commentaries
Integrating Web derived knowledge, well beyond entity extraction, towards building and representing interrelationships between known entities, thus enabling users to search not only the “Web of Pages” but the “Web of Things.”
Integrating tools that facilitate a superior post-search experience such as saving and annotating results, sharing them, extracting data, filtering, etc.
Facilitating the integration of third party applications to enable a richer, more diversified, and more satisfying user experience.
And the most intriguing of all, “implicit search” where users see their needs addressed without even specifying a query.
We believe that these are the areas where breakthroughs are most likely and scientific research can be most worthwhile, and we invite the SIGIR community to direct its attention to these only partially explored frontiers.

The presentation in PDF format.

Speaker: Thomas Arni, Eurospider

Abstract: Since 2002, the Swiss Supreme Court offers cross-language information retrieval to search for decisions by this court. We report the lessons learned when developing and maintaining the system. In particular, we elaborate on the customization of cross language retrieval to the legal domain. Finally, we summarize the results of a log data analysis and discuss our conclusions.

The presentation in PDF format.

Speaker: Garret Swart, Oracle

Abstract: Real-time indexing means making documents searchable as they are inserted into the corpus. Achieving a high ingest rate while maintaining good query performance is the challenge. A KWIC index for a large corpus is often too large to be cost effectively stored in memory or even Solid State Disk. Storing a keyword‚Äôs posting list on rotating disk requires storing it in large contiguous extents that can be efficiently read from disk during a search. Indexing a new batch of documents requires appending new postings for each distinct keyword used in the batch. Appending the postings physically incurs seeks as the documents are indexed, while appending the postings logically incurs seeks when the posting lists are accessed while answering a query.

In this talk we present a hybrid approach for storing posting lists that is implemented in Oracle Text, a feature of the Oracle RDBMS. Oracle Text utilizes database structures to implement text indexes: hash partitioning for parallelism, B-Trees for indexing terms and semantic tags, LOBs for storing compressed posting lists, and write-ahead logs for quickly committing updates. We perform real-time index maintenance by using memory and Solid State Disk as a staging area for new postings, allowing postings to accumulate so we can reduce the number of writes to the rotating media. We show how to model online index maintenance and size the staging area appropriately for the ingest rate, the query rate, IO system performance and the maximum recovery time.

Why: KWIC index management may seem like its been done to death, but changes in system architectures, memory capacities, storage technologies (like Flash) and new requirements (like real-time) change the shape of the problem and the feasible solutions. Index performance often gates the introduction of advanced search quality algorithms and interactivity aids. The specific problem addressed, balancing query and document ingest performance, may seem tricky but it is amenable to some generally useful modeling and optimization techniques. While performance was the primary topic at only one of last year's SIGIR sessions, it still remains central to the field and, presented in an engaging way, it should be accessible and interesting to your participants. Last year's industrial track appeared to have a business rather than an engineering focus but it looks as though this year you are looking for new ideas rather than business models.

The presentation in PDF format.

Speaker: Daniel E. Rose, A9.com

Abstract: For years, commercial search engines have learned from and built upon decades of research from the information retrieval community. While this has been immensely valuable, commercial search developers often face a variety of usage scenarios that are quite different from what’s described in the literature. For example, most work in Information Retrieval has focused on unstructured text documents, viewed largely as a static corpus to be ranked in isolation of user interactions. In contrast, Product Search engines face a very different environment. Our documents are often highly structured, with a mix of trusted and untrusted attributes; traditional IR concepts may not function as intended. Our users routinely alternate between searching and browsing, between typing keywords and choosing refinements. Product search relevance is strongly influenced by our users’ behavior. Often we know their identity, and they have explicitly entrusted us with personal information. And long before real‐time information sharing services became popular, we had to manage a system where Items rapidly come and go from the index, as new merchants begin offering products for sale and existing products go out of stock; these changes need to be reflected in search results immediately. Working in this environment has given us a different perspective on some fundamental issues in search. In this presentation, I’ll talk about some of the unique characteristics of the product search problem, and then share some lessons and challenges that may apply to anyone working in the search field.

The presentation in PDF format.

Speaker: Xavier Amatriain, Telefónica

Abstract: Telefónica is the world’s 3rd largest telecommunications company by market cap. Its activities are centered mainly on the fixed and mobile telephony businesses, while its broadband business is the key growth driver underpinning both. It operates in 25 countries and its customer base exceeds 264 million globally.

Telefónica I+D is the innovation company of the Telefónica Group. Founded in 1988, it contributes to the Group's competitiveness and modernity through technological innovation. To achieve this aim, the company applies new ideas, concepts and practices in addition to developing advanced products and services. It is the largest private R+D center in Spain as regards activity and resources, and is the most active company in Europe in terms of European research projects in the ICT sector (Information and Communication Technology). It currently collaborates with technological leaders and numerous organizations in 42 different countries - among which figure more than 150 universities located in different parts of the world. It also participates in the most important international forums on technological know-how, thus creating one of the largest innovation ecosystems in the ICT sector.

The presentation in PDF format.

Speaker: Neel Sundaresan, eBay

Abstract: In this talk we will explore the technical and operational challenges and related research opportunities in searching, finding, buying and selling in a long tail marketplace like eBay. The challenges (and opportunities) are posed by the diversity of products, buyers, and sellers along many dimensions of categories, quality, price, reputation, formats, and inventory.

Typical item listings are expressed in just a few words in the title and further described through item descriptions. Most searches by prospective buyers are made over the terms in the title which provide unique challenges to sellers to use the right terms, and buyers to search by these terms. The constrained nature of this space has given rise a unique vocabulary between the buyers and sellers. Algorithmic approaches to spell checking, synonyms, linguistic tagging, query rewriting and expansion revolve around these issues.

Also, non-catalog of the inventory, seller and buyer diversity in terms of trust, expertise, and incentive, and the differential in the buyer and seller language poses unique challenges in the search ranking problem and also in building recommender systems.

In this talk we will explore the research and technical approaches to query understanding, ranking, and recommendations in our marketplace and how it differs from web search. We will also touch upon trust and reputation system and how it plays a role in the design of the search system. We will discuss how relevance, trust, value, and diversity interact in such a finding ecosystem.

The presentation in PDF format.

Speaker: Carlos Castillo, Yahoo! Research

Abstract: In recent years, clicks and query reformulation have become major signals in large scale evaluation of Web Search engines. User activity, such as the sequence of clicks and query reformulations in search sessions are key means to evaluate user satisfaction. User inactivity following the return of results for a given query (AKA abandonment) is typically considered as failure by the engine to satisfy the user's need.

However, all search engines now offer experiences that aim to satisfy users' needs by impression only, on the results page itself, without requiring any click. This "instant satisfaction" experience is achieved for instance through shortcuts/"oneboxes" (e.g., for weather forecasts, package tracking and flight information), and rich Abstracts (e.g., by providing phone numbers and addresses of businesses). As this trend is growing, a fundamental question is how to automatically measure the success of such experiences, where the signal for success – no user interaction - is similar to the negative signal of abandonment.

This work posits the existence of a class of "tenacious" users, who very rarely abandon a search session when not satisfied. They will click on results with imperfect Abstracts, reformulate queries, or otherwise persist in their attempts to obtain answers from the search engine. Thus, for them, no activity on the results page is actually an indication of satisfaction. This work uncovers such tenacious users through automatic analysis of search engine logs, and taps them to identify search sessions that were satisfied by impression.

Back to Knowledge Management

Information Retrieval

The objective of Information Retrieval (IR) is to search large data collections for information relevant to a user’s information requirements. The term “information retrieval” was coined by Calvin Mooers in 1950. Like “research” the word “retrieval” does not refer to refinding something. It rather relates to the information retrieval paradox: “If I knew what I was searching for, I wouldn’t be searching for it.”

Information retrieval is focuses on three dimensions: systems and applications, theory and models, evaluation. Various retrieval models exist, such as Vector Space Model (VSM) and probabilistic and language models. For evaluatio,n recall and precision are often used. SMART was an early retrieval system that dealt with all three aspects. RankBrain is a more recent retrieval system based on TensorFlow.

Open-Source Software

Eurospider uses open-source software (OSS) whenever it is appropriate. The main requirements are robustness, compatibility of the license with the business model, and an active community to maintain the software. Eurospider uses, for example, STRUS (Mozilla Public License) and Kaldi (Apache License).

Compliance

Media Analysis

Knowledge Managament

26.07.2010 - SIGIR 2010 - Industry Track

FUTURE SEARCH: FROM INFORMATION RETRIEVAL TO INFORMATION ENABLED COMMERCE

SEARCH FLAVOURS - RECENT UPDATES AND TRENDS

QUERY UNDERSTANDING AT BING

MACHINE LEARNING IN SEARCH QUALITY AT YANDEX

THE NEW FRONTIERS OF WEB SEARCH: GOING BEYOND THE 10 BLUE LINKS

CROSS-LANGUAGE INFORMATION RETRIEVAL IN THE LEGAL DOMAIN

BUILDING AND CONFIGURING A REAL-TIME INDEXING SYSTEM

LESSONS AND CHALLENGES FROM PRODUCT SEARCH

BEING SOCIAL: RESEARCH IN CONTEXT-AWARE AND PERSONALIZED INFORMATION ACCESS @ TELEFONICA

SEARCHING AND FINDING IN A LONG TAIL MARKETPLACE

WHEN NO CLICKS ARE GOOD NEWS

KNOWLEDGE MANAGEMENT BLOG

Information Retrieval

Open-Source Software