Unit 1 Reading stuff will be here

FOA section 1.1 Finding Out About - a cognitive activity

The primary argument advaced is that progress requires that we appreciate the cognitive foundation we bring to this task as academics, as language users, and even as adaptive organisms. Evolved a wide range of strategies fro seeking useful information about our environment. Making initial guesses about good paths, using complex sets of features to decide if we seem to be on the right path, and proceeding forward.

  • As human, Language has been searching through the most.

    • oral - spoke and listened

      • how to get what we want
    • writing down important facts

  • relevant to our search

  • WWW –> too much information in general

  • FOA is centrally concern with:

    • meaning: the semantics of the words, sentences, questions and doucments involved

    • to understand the senantics of doc and topic –> aboutness most typical within the tradition of library science

  • good technical solutions must be informed by, and can contribute to, a broader philosophy of language

  • electronic artifacts - from email messages and WWW corpora to the browsing behaviors of millions of users all trying to FOA - brings an empirical grounding for new theories of lanugange that may well be revolutionary

  • FOA process of browsing readers can be imagined to involve 3 phases:

    • Screenshot 2019-01-11 14.50.56.png

    • Asking a question

      • users will have questions

      • congnitive state the users’ information need

      • query –> query language

      • ill-defined

      • internal cognitive state and turn it into an external expression of their question

    • Constructing an answer

      • The following should be considered:

        • Can they translate the user’s ill-formed question into a better one?

        • Do they know the answer themselves?

        • Are they able to verbalize the answer?

        • Can they give the answer in terms that the user will understand?

        • Can they provide the necessary background knowledge for the user to understand the answer itself?

      • Q-A: search engine

      • each passage considered as “document”

      • entire set of documents –> corpus

      • When the corpus is large, it is hard to retireve relative information

      • when it is a very small set, the retrieval part will be more sufficient

      • Screenshot 2019-01-11 15.04.00.png

    • Assessing the answer

      • user waiting in line to ask a question of a professor

      • “closing of the loop” between asker and answerer

      • user prolvides an assessment of just how relevant they find the answer provided

      • FOA is dialog between question-asker and answerer; it doesn end with the search engine’s first delivery of an answer

      • asker and answerer exchange the passage

      • Screenshot 2019-01-12 16.05.25.png

  • Working within the IR Tradition

    • IR is field that has existed since computers were first used to count words

    • IR has also borrowed heavily from the field of linguistics, especially computational linguistics

    • computers capable of searching and retrieving from the entire biomedical literature, across an entire nation’s judicial ystem, or from all of the major newspaper and magazine articles, have created new markets among doctors, lawyers, journalists, student, … everyone and the internet

    • Screenshot 2019-01-12 16.19.35.png

    • search engine, refer to a particular implementation, but to an idealized system most typical of the many different generations and varieties of actual search engines now in use

    • search engine is a match, between descriptive features mentioned by users in their queries, and documents sharing those same features

IES section 1.1 and 1.2

1\.1 What is information retrieval

concerned with representing, searcing, and manipulatig large collections of electronic text and other human-manguage data.

1\.1.1 Web Search

machines identify a set of Web pages contaiing the terms in the query, compute a score for each page, eliminate duplicate and redundant pages, generate summaries of the remaining pages, and finally return the summaries and links back to the user for browsing

This snapshot must be gathered and refreshed constantly by a Web crawler, also running on a cluster of hundres or thousands of machiens, and downloading periodically – perhaps once a week – a fresh copy of each page

consider the millions of web pages that contain the words “information” and “retrieval”. includes many that relevant to the subject of information retrieval but are much less general in scope than those that appear in the top ten

The efficient implementation and evaluation of relevance ranking algorithms under a variety of contexts and requirements represents a core problem in information retrieval, and forms the conetral topic of this book

1\.1.2 Other Search Applications

Desktop and file system search provides another example of a widely used IR application. A desktop search enginee provides search and browsing facilities for files stored on a local hard disk and possibly on disks connnected over a local network.

1\.1.3 Other IR Applications

  • document routing, filtering, and selective dissermination reverse the typical IR process

  • text clustering and categorization systems group documents according to shared properties

  • summarization systems reduce documents to a few key paragraphs, slentences, or phrases describe their content

  • Information extraction systems identify named entities, such as places and dates, and combine this information into structured records that describe relationships between these entities

  • Topic detection and tracking systems identify events in streams of news articles and similar information sources, tracking these events as they evolve

  • expert search systems identify members of organizations who are experts in a specified area

  • question answering sytems integrate information from multiple source to provide concise answers to specific questions

  • multimedia information retrieval systems extend relevance ranknig and other IR techniques to images, video, music and speech

Information Retrieval System

1\.2.1 Basic IR System Architecture

Screenshot 2019-01-12 18.31.07.png

The fundamental goal of relevance ranking is frequently expressed in terms of the Probability Ranking principle (PRP), which we phrase as:

If an IR system’s reponse to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, then the overall effectiveness of the system to its users will be maximized.

It overlooks important aspects of relevance that must be considered in practice

While you are working with different contexts, we need to not only working with different type of document formats.

MIR section 1.1-1.4

Mainly talked about the history of IR and how IR has been developed over the years. Before the World Wide Web exist, the major way of getting information is through library. With WWW come along, things changed rapidly.

Releance of the information still remains based on each person’s own judgements.

IR system still follows process of information retrieval and ranking process based on the user query.

whole new forms of encyclopedias will apear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified [303]

Because of the web, the way information has been searched is changed:

  • characteristics of the document collection itself

  • the size of the collection and the volume of user queries submitted on a daily basis

  • vast size of the document collection

  • Web is not just a repository of documents and data, but also a medium to do business

  • Web on search derives from Web advertising and other economic incentives

security, privacy, copyright and patent right, scanning optical character recognition and cross-language retrieval are all practical issues on the Web.

Muddiest Points:

  • As we all know that now we are getting more and more information from all kinds of resources, how to effectively getting the information you need is the key. From the reading, it introduced us the process of how we are filtering out the data, and getting what we want. As human beings, we can pick out relevant information just by reading the contents, however, what would be the most effective way to teach a machine to do something similar?