Fuzzy Logic in Information Retrieval

Fuzzy Logic and Fuzzy Set Theory was proposed in the 1960s by Lotfi Zadeh. In Computer Science, we have been used to dealing with Boolean Logic - under the assumption that the truth value of any variable can take either 0 or 1. However, Fuzzy Logic states that the truth value of a variable may take on a real value between 0 and 1.

For example, I could make a statement, "It rained today". My statement under a Boolean logic model would be either true or false, but under the fuzzy model you could say I was telling 0.33rd of the truth. It could be giving us information about the fact that it rained to a certain degree, perhaps there was a light drizzle; but if I said the above statement was 0.97th of truth, it could convey that it was pouring in torrents. Thus, fuzzy logic handles the concept of a "partial" truth.

Fuzzy Set Theory takes this concept and applies it to sets. In classical or discrete set theory, an element either belongs or does not belong to a set, denoted with the help of an indicator function. Fuzzy sets instead contain elements with varying "degrees" of membership. A membership function which takes on a real value between 0 and 1 determines the "level" to which an element belongs to a set.

In information retrieval, we have two well known strategies - Set Theoretic models, which rely on individual words being treated as elements of a set ( here, a document ), and Algebraic models, which utilize vectors to represent documents. Boolean retrieval and Fuzzy retrieval are set-theoretic, and Vector Space model is algebraic. However, fuzzy retrieval combines the advantages of both vector space and boolean models, having the flexibility and performance of the first (which uses weights to determine a ranking function) and the simplicity of the latter. Fuzzy logic also makes it easier to model the uncertainty associated with the representation of words instead of numbers, and thus is a better candidate for modeling queries and documents in natural language.

The fuzzy model consists of 3 main steps to give each document a "relevance" value which can be used for for fetching the results of a query.

1. Fuzzification - To explain this, we can take the help of an example. To begin, consider a variable associated with the document such as Term Frequency Ratio (term frequency divided by document length). We assume that our relevance is decided only on the basis of this variable (other variables such as Document Frequency could also be considered). If this were a boolean model, we could define a relevance function

       relevance(di) = { 0 , if tfr(di) <= 0.5
                                          1, if tfr(di) > 0.5

However, the fuzzy model bears in mind the varying degrees of relevance, whether high, medium or low. Hence, we can define a function

      high-relevance(di) = { 0 , if tfr(di) <= 0.3
                                                   (tfr(di) - 0.3) / 0.7, if 0.3 < tfr(di) < 0.7
                                                    1, if tfr(di) >= 0.7

Similarly, we would have to define functions for medium and low relevance.
The above is known as an "L-shape" membership function(due to its shape). The fuzzification step involves choosing these fuzzy variables, deciding the fuzzy values, defining the membership function(s), and getting the membership values.

2. Fuzzy interference system - If we use more than 1 fuzzy variable, each will have its own fuzzy values. For example, if we consider the df variable beside using the tfr variable then we should have fuzzy values for both. Take the tfr fuzzy values to be high-tfr, medium-tfr, or low-tfr. Each document will have 3 values associated with each of its terms, representing the degree of document membership in each of these sets, and dependant on the value of the term frequency in the document.

Similarly, say there are 2 df fuzzy values, high-df and low-df. Remember that df is the number of documents where the query term appeared divided by the number of documents in the corpus.

These fuzzy values now have to be related to the document relevance using a set of rules decided on the basis of past experience or simple intuition. Suppose we decide on 2 fuzzy relevance values (high-relevance and low-relevance), we can frame our rules in the following fashion:

        if (high-tfr AND low-df) then high-relevance -> Rule 1
        if (medium-tfr AND low-df) then high-relevance -> Rule 2
        if (low-tfr AND high-df) then low-relevance -> Rule 3

We can then replace the parameter (high-tfr) by the document high-tfr fuzzy value to formally evaluate these rules, and so on for the rest of the parameters. This step involving rule evaluation is called as the FIS.

We must also define the operators used in these rules. Zadeh operators are the equivalent of Boolean operators in the fuzzy world and are defined thus:

      NOT x = (1 - truth(x))
      x AND y = min(truth(x), truth(y))
      x OR y = max(truth(x), truth(y))

Say high-tfr = 0.7, medium-tfr = 0.4, low-tfr =0.1, high-df = 0.3, and low-df = 0.6, then:

     high-relevance = min(0.7, 0.6) = 0.6 (applying Rule 1)
     high-relevance = min (0.4, 0.6) = 0.4 (applying Rule 2)
     low-relevance = min(0.1, 0.3) = 0.1 (applying Rule 3)

The first two rules can be combined as high-relevance = maximum (0.6, 0.4) = 0.6.

3. Defuzzification - Now that we have 2 fuzzy values for the document, we have to get a final value to decide its relevance to the query and also rank it beside other documents. We can apply a simple weighted average function for this step. If we assume the weights as 1 for high and 0.1 for low relevance, then

    relevance-value = (1*0.6 + 0.1*0.1)/(1+0.1) = 0.554

I will now discuss a classical fuzzy retrieval model, the Mixed Min and Max (MMM) model.

The MMM model assumes that there is a fuzzy set Ti for each index term ti. Every document di has a weight dTi proportional to its membership in Ti. For a query of the type q1 = (ti OR tj), the relevant document should be in the union of the fuzzy sets Ti, Tj. Similarly for a query of the type q2 = (ti AND tj), the relevant document should be in the intersection of the fuzzy sets Ti, Tj. Hence the relevance in these is calculated as:

    q1 = min(dTi,dTj)
    q2 = max(dTi, dTj)

This can be be applied to general Q_OR and Q_AND queries for the OR and AND of n terms t1,t2,...tn and a document D.

The MMM model finally calculates a query-document similarity function (for ranking).

Sim(Q_OR, D)=C_or1*max(dT1, dT2, ... dTn) + C_or2*min(dT1, dT2, ... dTn), where C_or1>C_or2

Sim(Q_AND, D)=C_and1*min(dT1, dT2, ... dTn) + C_and2*max(dT1, dT2, ... dTn), where C_and1>C_and2

where C_or1, C_or2 are the "softness" coefficients for the OR operator, and C_and1, C_and2 for AND. For simplicity it is generally assumed that C_or1 = (1 - C_or2) and C_and1 = (1 - C_and2). These coefficients try to soften or relax the calculation of these Boolean operators by considering the query-document similarity to be a linear combination of the min and max document weights.

Thus, in this blog, we explored the fundamentals of fuzzy logic and fuzzy set theory, applying the same concept to design a crude model, and finally, applying the same model to an information retrieval system.

References:

1. ftp://ftp.irisa.fr/local/caps/DEPOTS/BIBLIO2010/Harastani_Rima.pdf
2. https://en.wikipedia.org/wiki/Fuzzy_retrieval
3. https://en.wikipedia.org/wiki/Fuzzy_logic
4. https://en.wikipedia.org/wiki/Fuzzy_set



Comments